If data is more openly available as XML over HTTP, it’s going to be pretty damn easy for a smart hacker to get access to that data to make applications like this impressive example… which is great, but undoubtedly someone eventually will feel like their data is being “stolen” or “misused”.
Reverse engineering HTML was easy from the very beginning because Mosaic and then Netscape had a feature that allowed you to view the source code of any HTML page. And since it’s very easy to watch HTTP traffic going back and forth out of your desktop computer using things like Live HTTP Headers or Ethereal. Anybody with a few choice Perl modules can screen-scrape data from a web page and reuse it in another application. For example, let’s just say that I wanted to make an RSS feed of guests on the David Letterman show. I could easily write some code to parse the CBS Late Show homepage to get the data that I want. It’s easy and it’s great, but am I stealing CBS’s data? Getting concensus around an answer to that question is tricky unless the content is specifically licensed for such use.
Now, imagine if the guestlist doesn’t have to be screenscraped from a webpage, but is available as XML for consumption by an AJAX-style web application:
Well, that’s much more convenient, isn’t it? With a few XPath queries, you’ll have the data that you want for your application.
I was wondering what it would take to prevent people from reusing data like this. I don’t advocate that people adopt these ideas (though anybody is welcome to), because I think that it’s ultimately better if data is openly available. But, just out of curiosity….
Making XPath queries would be a bit trickier if the node names were unpredictable:
Unauthorized queries of the XML document would then need to rely on positions instead of named nodes, which will work, but will be brittle over time. Authorized applications would have the rewritten XPath locations embedded into their HTML. It will be costly to build randomly generated node names via XSLT transforms on a busy site. And since the accurate XPaths with the randomized node names will need to be stored in the HTML on the consuming web page, this technique has some added integration cost. Clearly stated: this approach is completely goofy. Let’s see if there’s a better idea…
Another way involves using an authorization token embedded in the HTML of an web page intended to consume the XML in an AJAXy manner:
//Generated on the server and included as part of the dynamically generated HTML
var ajaxToken = "dff0194b-384f-43d3-0059-889247de5f88";
//Make a request for some data and include the token
var req = new XMLHttpRequest();
. . .
The token here is a UUID. A good UUID generator will create IDs that are (1) non-sequential and (2) hard to guess. You could just as well use another type of generator as long as it meets these two requirements. The token, included as a header in requests made by the XMLHttpRequest object, would behave similarly to a session token and would provide access to the protected XML resource for a period of time. Because authorization tokens are only generated by friendly web pages (i.e. on your own servers), you can be sure that any valid token presented to the protected resource is coming from your (authorized) application… or possibly from another screenscraping application like we originally started out with! Since an unauthorized third party would need a valid authorization token, the only way to get one would be to parse it out of the web page that presents it to the end user. So this approach isn’t perfect, but it will nearly do the trick. It is, however, especially useful for cross-site integration with a data provider.
A third approach involves using traditional HTTP sessions to manage the authorization. When a user requests an AJAX-style web page, simply set an attribute in their session that provides them access to a protected resource. When the XMLHttpRequest object makes requests, it will send back all the session cookie currently active for that site. This approach is clearly the easiest to implement and is probably the best approach overall, but it will only work in situations where the AJAX web page and the XML data are being served from within the same domain in order for the session cookie to arrive at both resources (i.e. not if you get your data from a third party, as is the case with the Google Maps/Craigslist real estate example).
Ultimately, it’s going to be easier to leave the XML data out in the open rather than to try and lock it up. However, there will undoubtedly be legitimate cases where people will want to prevent reuse of the data by parties outside of traditional control channels. None of these solutions are perfect, but maybe there are some ideas to play with.
New architectures pose new challenges and I suspect that we’re only beginning to understand the security concerns associated with AJAX-style web applications.