Announcing Accord: A Java implementation of Chord

Accord is a new java implementation of the Chord p2p lookup protocol. I’ve been working on this gradually over the past few months and recently received a large contribution of code from Marco Bazzoni who has joined the project.

The basic functionality behind Chord is actually quite simple: given a number of nodes, Chord will map a given piece of data onto the node most likely to contain the data you’re looking for. A rather basic example would be a situation whereby you have 100 nodes containing stock information and you’re trying to locate data based on a CUSIP number. How do you find the node with the data? Well, Chord solves this problem in an elegant fashion.

Based on the input value (your CUSIP number), Chord will make a SHA-1 digest (a 160-bit integer). Then, based on any “known” node in the cluster, the Chord algorithm will compare the digest with the identifier of the known node, which is not coincidentally also a SHA-1 digest of its network location. Using a partial routing table (the finger table) that each Chord node maintains, the node containing the data is located using log(N) messages. There are a lot of details about how node joins, departures and failures are handled – these are covered in the Chord Tech Report.

The beauty of Chord is that it’s fully distributed, which is important if you want something that’s both resilient and reliable.

Now, the reason why this caught my eye and prompted me to start Accord has to do with my work on Whirlycache. I spent a long time trying to figure out a good way to make Whirlycache distributed like some of the other open-source caching solutions. All of the other caches of any significance are using JGroups, but I didn’t want to go down this road.

First, let’s just pretend that there isn’t a problem with using LGPL java code to make this explanation simpler. Second, I see no point in making a new wheel when the new wheel looks exactly like all the other wheels. We have several examples of people making distributed caches based upon the same toolkit, so why create another? It seemed pointless and didn’t offer any opportunity for me to add anything new. I was also concerned with the fact that JGroups uses IP multicast, because my understanding of multicast is that its performance on a given network is greatly mysoline dependent on the local router’s ability to process multicast packets. Maybe that’s a misconception, so please correct me if I’m wrong about that.

But I was also influenced by Microsoft’s approach to locking in SQL Server clusters, which they call “shared-nothing” (here’s a quickie article just to give you the basics). Locking is a complex problem to deal with and distributed locking is an even tougher nut to crack, as anybody who has worked with enterprise level transactions will tell you. So the fact that I could totally decentralize my data, spread it across a bunch of nodes, reduce lock contention on the network level and still get the benefit of Whirlycache’s speed while operating inside each individual node really appealed to me. One reason why Whirlycache is the fastest in-memory object cache has to do with how we handle locking inside: there’s not a lot of it and the little locking that does happen occurs on a very granular level, so the chance of contention in all cases is very low.

How to handle data redundancy on the network is something that is unknown at this time as it pertains to Whirlycache, but I’m not too concerned about this. After all, we are speaking about caches, not authoritative data stores. If the cache gets cleared, the systems they support still need to be able to operate properly (maybe slowly, but still…)! Ideas about this should be directed at the Accord dev mailing list.

So while that gives you an idea of where Whirlycache will go, please keep in mind that Accord will live as a seperate entity and will be usable for whatever other projects you can dream up. After you have had a chance to read the Chord documentation, we’d love to get your help on the project.

Caching, Illustrated

Let’s just say that you have some friends (who might be called Chris and Mary) who are starting a new radio show (let’s call it Open Source) who put a link to a 54Mb mp3 on their website.

And let’s just suppose that another guy who runs a really popular website called Scripting News makes a link from his website to this big mp3 file. What do you think will happen?

Well, in a word, this will happen:

Bandwidth usage before and after mod_expires was enabled

Since the Open Source website runs on my infrastructure, it’s up to me to make sure that things keep working, even in circumstances like this. Now, I’ve done lots of working with caching (in the application layer and via HTTP) in the past, making slow websites fast and popular websites usable again. It’s not magic, but there are a lot of tricks that you can play with a standard Apache configuration in order to get the most mileage out of your hardware.

In this case, since mod_expires wasn’t enabled for my client, I switched it on. Doing so reduced the bandwidth consumption since lots of people started getting the content from the ISP’s proxy caches instead. This gets content to the users faster and reduces load on my servers.

Thought it was a nice graph that illustrated the benefit of knowing how to configure your webserver properly. The maintenance activity at ~2am required a quick bounce to Apache that lasted less than a second. And you’ll notice that the bandwidth usage dropped considerably after the mod_expires was configured and turned on.

Building J2EE Web Applications with UTF-8 Support

Recently, I’ve spent some time developing a little website project and I wanted it to support multiple languages and character sets with Unicode from the beginning. This turned out to be harder than I anticipated, but is actually not a big deal now that it’s done. Mainly, the problem was documentation – the lack of it, rather. Since my application is based on J2EE, you’ll have to extrapolate for different platforms if that’s not what you are using.

The first thing you need to do is ensure that your entire stack supports UTF-8, which is a subset of Unicode that probably is good enough to handle all of your needs. For example, my application stack is Apache, Jetty, Struts, Hibernate and MySQL. I’ll save you the trouble of having to determine the level of Unicode support in all of these components by informing you that UTF-8 happily exists inside these layers.

Getting Data Out

The simple case is getting data out of the database and on to your browser’s screen. If you’re just writing your application, you’ll need to artifically populate your database with some sample text for this experiment. I found a lovely web page with all sorts of UTF-8 text in several languages, so copy/paste some of that text and get it into your database somehow. In fact, I’ll save you the trouble by providing you with some Tamil text (chosen because my closest Indian friends speak Tamil!):

நான? கண?ணாடி சாப?பிட?வேன?, அதனால? எனக?க? ஒர? கேட?ம? வராத?

Note: while Firefox and MSIE can display this Tamil text alright, Feedreader isn’t able to…

Now, if you’re using MySQL Control Center or MySQL Query Browser, you should know up front that while MySQL seems to have super UTF-8 support, neither of these tools do. Actually, Control Center seems to display UTF-8 text properly right after you paste it in, but the same text appears garbled as soon as you use SQL to get the data out. MySQL Query Browser displays the text in some incomprehensible fashion in all cases:

illegible utf-8 text

Notice, however, that it is able to properly display ISO-8859-1 text.

I had to determine that my eyes weren’t going fuzzy by using message digests to see if the data was still intact:

select md5(text_column), text_column from mytable

I stored the same text in a file on my disk and md5sum‘d that file to ensure that the digests were identical. Seems kind of a silly incompatibility for MySQL to release such a kickass database server without the proper support for UTF-8 in their admin tools.

Within Java, all java.lang.String objects are UTF-8. That’s quite cool, because it means that we can get data out of the database using Hibernate (sorry, I can’t be bothered writing SQL anymore), pass it through a Struts Action (the C in MVC) and out into a JSP (the V in MVC).

The JSP needs to be configured a little bit, though. It needs the following declaration at the beginning of the file:

< %@ page contentType="text/html; charset=utf-8" pageEncoding="UTF-8"%>

That makes sure that the right value in the Content-Type header precedes the text/html content. For the record, this header displays as follows (use LiveHTTPHeaders for your Moz/Firefox browser to watch the headers):

Content-Type: text/html; charset=utf-8

You also need this tag in the < head> of your HTML document:

< meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

This is less important in the cases where your application only provides character data in UTF-8. This convention exists because, based on a given static web page sitting on disk, a webserver cannot figure out the character encoding used therein. You can use things like mod_magic to figue out what kind of data is in the file, but it’s not smart enough to reliably determine the character set.

Getting Data In

So while getting data out of the database and through the stack is fairly straightforward, getting data from the browser turns out to be slightly more complicated. Data is usually transmitted from browsers to webservers using either the GET or POST methods in HTTP. (Yes, there’s more to HTTP than GET or POST, but when was the last time you wrote a webapp that used them?)

Typical use of a POST request will involve a webform that gets filled out by the user in their browser and then is sent to the server. How are you supposed to know what character set the user used while filling out the form? According to the HTML 4.01 specification, if you don’t specify the accept-charset attribute of the < form> tag:

The default value for this attribute is the reserved string “UNKNOWN”. User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

So barring a buggy browser client, you can constrain your form submission to the character set(s) specified in the accept-charset attribute. In my case, I decided to be explicit by specifying “UTF-8” as the value, rather than rely on the browser to take the default (also UTF-8, since that’s how the web page with the form was encoded).

When the user submits the form, the data is sent to our server via HTTP POST, at which point it is up to our application to process the submission. Typically, in Java-land, we build webapps on frameworks (like Struts, Spring, etc.) that sit on top of the Servlet API. The servlet API defines a number of objects that your chosen servlet container will make available to you, such as ServletRequest and ServletResponse which, predictably enough, are used to model the request and response of any HTTP request (well, the HttpServletRequest and HttpServletResponse are the HTTP-specific subclasses that we’re interested in). The ServletRequest defines many abstract methods including this:

abstract public void setCharacterEncoding(String arg) throws;

At this point, the astute reader will realise that this is exactly what we need to call on every servlet-derived method that will handle requests for our application. In Struts, subclasses of Action handle HTTP requests. The method signature of the execute() method of Action looks like this:

abstract public ActionForward execute(ActionMapping mapping, ActionForm form, HttpServletRequest request, HttpServletResponse response) throws Exception;

So the necessary work required to get Struts to process request parameters encoded with UTF-8 is exactly the same as for regular Servlet requests:


Kind of tedious, no? Well, I thought so until I found a lovely little servlet filter squirreled away in the CVS tree for Tomcat 4.0. You can configure the servlet in your web application’s web.xml descriptor with two parameters for conditionally or unconditionally setting the charater set on the current request’s request object:

< filter>
	< filter-name>Set Character Encoding< /filter-name>
	< filter-class > < /filter-class>
	< init-param >
		< param-name>encoding< /param-name>
		< param-value>UTF-8< /param-value>
	< /init-param>
	< init-param>
		< param-name>ignore< /param-name>
		< param-value>false< /param-value>
	< /init-param>
< /filter>

Now, there is one gotcha here and it has to do with the handling of GET requests. Let’s say that you expose the following URL on your site with some Japanese characters stuck as part of the URI. Go ahead and click on it.伊?屋ホームページ

Presumably, your browser URL-encoded the request as follows:…..

No big deal. This type of decoding should be handled by your servlet container. The gotcha is this, though. If you do an internal forward from the servlet that intercepts the request to another servlet with the URL-encoded text as part of the rewritten URI, you need to re-encode the URI before the forward.

In other words, if your application receives a request for this:伊?屋ホームページ

… which in turn results in a forward to this:伊?屋ホームページ

… then you will have to re-URL-encode the 伊?屋ホームページ part of the request before you do the internal forward. This seems wrong to me. I’m almost wondering if part of my application stack has some kind of bug somewhere, but I haven’t investigated that possibility. Nonetheless, this step was necessary for part of my application to function.

W3C provides some Java code here that you can take for the purpose of URL encoding and decoding strings.

Building support for UTF-8 feels good because it’s the right way to build modern applications. And since it’s much easier to start off using UTF-8 than convert later, there’s hardly any reason not to do this from the beginning.