Links

pmuellr is Patrick Mueller

other pmuellr thangs: home page, twitter, flickr, github

Friday, May 18, 2007

That Darned Cat! - 2

Some more thoughts on Twitter performance, as a followup to "That Darned Cat! - 1".

Twitter supports three different communication mediums:

  • SMS text messaging
  • A handful of IM (Instant Messaging) services
  • HTTP - which can be further subdivided into the web page access, the Twitter API, and RSS and Atom feeds

I'm not going to talk about the first two, since I'm not familiar with the technical details of how they work. Other than to notice that I don't see how Twitter can be generating any direct revenue off of HTTP (no ads on the web pages, even), whereas they could certainly be generating revenue of off the SMS traffic they drive to whoever hosts their SMS service. IM? Dunno.

It would appear, or at least I guess, that most of the folks I follow on Twitter are using HTTP, rather than the other communication mediums. Maybe I'm in a microcosm here, but I'm guessing there are a lot of people who only use the HTTP medium. And there's no money to be made there.

So, we got a web site getting absolutely pounded, that's generating no direct revenue for the traffic it's handling. And it's become a bottleneck. What might we do?

Distribute the load.

Here's a thought on how this might work. Instead of people posting messages to Twitter, have them post to their own site, just like a blog. HTTP-based Twitter clients could then feed off of the personal sites, instead of going through the Twitter.com bottleneck.

This sounds suspiciously like blogging, no? Well, it is a lot like blogging. Twitter itself is a lot like blogging to begin with. Only the posts have to be at most 140 bytes. So let's start thinking about it in that light, and see what tools and techniques we can bring from that world.

For instance, my Twitter 'friends' are nothing but a feed aggregator like Planet Planet or Venus. Only the software to do this would be a lot easier.

Josh notes: Hmm, but doesn't RSS/Atom already give us everything we need for a twitter protocol minus the SMS (and text limit)? Indeed. Twitter already does support both RSS and Atom (I didn't see explicit Atom links, but find an RSS link @ Twitter, and replace the .rss URL suffix with .atom). They aren't things of beauty, but it something to start with. While you can already using blogging tools to follow Twitter, I'm not sure that makes sense for most people. However, reusing the data formats probably makes a lot of sense.

So, why would Twitter ever want to do something like this? I already mentioned they don't seem to be making any direct revenue off the HTTP traffic, so off-loading some of that is simply going to lower their network bill. They could concentrate instead, in providing some kind of value, such as contact management and discovery. Index the TwitterSphere, instead of owning and bottlenecking it. And of course continue to handle SMS and IM traffic, if that happens to bring in some cash.

In the end, I'm not sure any one company can completely 'own' a protocol like this forever. Either they simply won't be able to afford to (expense at running it, combined with a lack of revenue), or something better will come along to replace it.

If you love something, set it free.

There are other ideas. In "Twitter Premium?", Dave Winer suggests building Twitter "peers". This sounds like distributing Twitter from one central site, to a small number of sites. I don't think that's good enough. Things will scale better with millions of sites.

Thursday, May 17, 2007

That Darned Cat! - 1

The performance of Twitter as of late has been abysmal. I'm getting tired of seeing tweets like "Wondering what happened to my last 5 tweets" and "2/3 of the updates from Twitterrific never post for me. Is this normal?". I'm especially tired of seeing that darned cat!

Pssst! I don't think the cat is actually helping! Maybe you should get him away from your servers.

Here's a fun question to ask: do you support ETags?

In order to test whether Twitter is doing any of the typical sorts of caching that it could, via ETag or Last-Modified processing, I wrote a small program to issue HTTP requests with the relevant headers, which will indicate whether the server is taking advantage of this information. The program, http-validator-test.py, is below.

First, here are the results of targetting http://python.org/ :

$ http-validator-test.py http://python.org/
Passing no extra headers
200 OK; Content-Length: 15175; Last-Modified: Fri, 18 May 2007 01:41:57 GMT; ETag: "60193-3b47-b04e2340"

Passing header: If-None-Match: "60193-3b47-b04e2340"
304 Not Modified; Content-Length: None; Last-Modified: None; ETag: "60193-3b47-b04e2340"

Passing header: If-Modified-Since: Fri, 18 May 2007 01:41:57 GMT
304 Not Modified; Content-Length: None; Last-Modified: None; ETag: "60193-3b47-b04e2340"

The first two lines indicate no special headers were passed in the response, and that a 200 OK response was returned with the specified Last-Modified and ETag headers.

The next two lines show an If-None-Match header was sent with the request,indicating to only send the content if it's ETag doesn't match the value passed. It does match, so a 304 Not Modified is returned instead, indicating no content will be sent down (it hasn't changed since you last asked for it).

The last two lines show an If-Modified-Since header was sent with the request,indicating to only send the content if it's last modified date is later than the value specified. It's not later, so a 304 Not Modified is returned instead, indicating no content will be sent down (it hasn't changed since you last asked for it).

For content that doesn't change between requests, this is exactly the sort of behaviour you want to see from the server.

Now, let's look at the results we get back from going to my Twitter page at http://twitter.com/pmuellr :

$ http-validator-test.py http://twitter.com/pmuellr
Passing no extra headers
200 OK; Content-Length: 26491; Last-Modified: None; ETag: "a246e2e41e13726b7b8f911995841181"

Passing header: If-None-Match: "a246e2e41e13726b7b8f911995841181"
200 OK; Content-Length: 26504; Last-Modified: None; ETag: "1ef9e784fa85059db37831c505baea87"

Passing header: If-Modified-Since: None
200 OK; Content-Length: 26503; Last-Modified: None; ETag: "2ba91b02f418ed74e316c94c438e3788"

Rut-roh. Full content sent down with every request. Probably worse, generated with every request. In Ruby. Also note that no Last-Modified header is returned at all, and different ETag headers were returned for each request.

So there's some low fruit to be picked, perhaps. Semantically, the data shown on the page did not change between the three calls, so really, the ETag header should not have changed, just as it didn't change in the test of the python site above. Did anything really change on the page? Let's take a look. Browse to my Twitter page, http://twitter.com/pmuellr, and View Source. The only thing that really looks mutable on this page, given no new tweets have arrived, is the 'time since this tweet arrived' listed for every tweet. That's icky.

But poke around some more, peruse the gorgeous markup. Make sure you scroll right, to take in some of the long, duplicated, inline scripts. Breathtaking!

There's a lot of cleanup that could happen here. But let me get right to the point. There's absolutely no reason that Twitter shouldn't be using their own API in an AJAXy style application. Eating their own dog food. As the default. Make the old 1990's era, web 1.0 page available for those people who turn JavaScript off in their browser. Oh yeah, a quick test of the APIs via curl indicates HTTP requests for API calls do respect If-None-Match processing for the ETag.

The page could go from the gobs of duplicated, mostly static html, to just some code to render the data, obtained via an XHR request to their very own APIs, into the page. As always, less is more.

We did a little chatting on this stuff this afternoon; I have more thoughts on how Twitter should fix itself. To be posted later. If you want part of the surprise ruined, Josh twittered after reading my mind.

Here's the program I used to test the HTTP cache validator headers: http-validator-test.py

	#!/usr/bin/env python

	#--------------------------------------------------------------------
	# do some ETag and Last-Modified tests on a url
	#--------------------------------------------------------------------

	import sys
	import httplib
	import urlparse

	#--------------------------------------------------------------------
	def sendRequest(host, path, header=None, value=None):
	    headers = {}

	    if (header):
	        print "Passing header: %s: %s" % (header, value)
	        headers[header] = value
	    else:
	        print "Passing no extra headers"

	    conn = httplib.HTTPConnection(host)
	    conn.request("GET", path, None, headers)
	    resp = conn.getresponse()

	    stat = resp.status
	    etag = resp.getheader("ETag")
	    lmod = resp.getheader("Last-Modified")
	    clen = resp.getheader("Content-Length")

	    print "%s %s; Content-Length: %s; Last-Modified: %s; ETag: %s" % (
	        resp.status, resp.reason, clen, lmod, etag
	        )
	    print

	    return resp

	#--------------------------------------------------------------------
	if (len(sys.argv) <= 1):
	    print "url expected as parameter"
	    sys.exit()

	x, host, path, x, x, x = urlparse.urlparse(sys.argv[1], "http")

	resp = sendRequest(host, path)
	etag = resp.getheader("ETag")
	date = resp.getheader("Last-Modified")

	resp = sendRequest(host, path, "If-None-Match", etag)
	resp = sendRequest(host, path, "If-Modified-Since", date)

Update - 2007/05/17

Duncan Cragg pointed out that I had been testing the Date header, instead of the Last-Modified header. Whoops, that was dumb. Thanks Duncan. Luckily, it didn't change the results of the tests (the status codes anyway). The program above, and the output of the program have been updated.

Duncan, btw, has a great series of articles on REST on his blog, titled "The REST Dialog".

In addition, I didn't reference the HTTP 1.1 spec, RFC 2616, for folks wanting to learn more about the mysteries of our essential protocol. It's available in multiple formats, here: http://www.faqs.org/rfcs/rfc2616.html.

Tuesday, May 15, 2007

modelled serialization

Too many times I've seen programmers writing their web services, where they are generating the web service output 'by hand'. Worse, incoming structured input to the services (XML or JavaScript), parsed by hand into objects. Maybe not parsed, but DOMs and JSON structures walked. Manually. Egads! Folks, we're using computers! Let the computer do some work fer ya!

In my previous project, we used Eclipse's EMF to model the data we were sending and receiving via RESTy, POXy web services. For an example of what I'm referring to as 'modelling', see this EMF overview and scroll down to "Annotated Java". For us, to model our data, meant adding EMF comments to our code. And then running some code to generate the EMF goop. What they goop ended up giving you was a runtime version of this model you could introspect on. Basically just like Java introspection and reflection calls, to examine the shape of classes, and the state of objects, dynamically. Only with richer semantics. And frankly, just easier, if I remember correctly.

Anyhoo, for the web services were we writing, we constrained the data being passed over the wire to being modelled classes. Want to send some data from the server to the client? Describe it with a modelled class. Because the structure of these modelled classes was available at runtime, it was (relatively) easy to write code that could walk over the classes and generate a serialized version of the object (XML or JSON). Likewise, we could take an XML or JSON stream and turn it into an instance of a modelled class fairly easily. Generically. For all our modelled classes. With one piece of code.

Automagic serialization.

One simplification that helped was that we greatly constrained the types of 'features' (what EMF non-users would call attributes or properties) of a class; it turned out to be basically what's allowed in JSON objects: strings, numbers, booleans, arrays, and (recursively) objects. We had a few other primitive types, like dates and uuid, but amazingly, were we able to build a large complex system from a pretty barebones set of primitives and composites. Less is more.

For folks familiar with WS-*, none of this should come as a huge suprise. There are basically two approaches to defining your web service data: define it in XML schema, and have tooling generate code for you. Or define it in code, and have tooling generate schema for you. In both cases, serialization code will be generated for you. Neither of these resulted in a pleasing story to me. Defining documents in schema is not simple, certainly harder than defining Java classes. And the code generated from tooling to handle schema tends to be ... not pretty. On the other hand, when starting with code, your documents will be ugly - some folks don't care about that, but I do. The document is your contract. Why do you want your contract to be ugly?

Model driven serialization can be a nice alternative to these two approaches, assuming you're talking about building RESTy or POXy web services. Because it's relatively simple to create a serializer that feels right for you. And you know your data better than anyone; make your data models as simple or complex as you actually need. If you're using Java, and have complex needs, consider EMF, because it can probably do what you need, or at least provide a solid base for what you need.

Besides serialization, data modelling has other uses:

  • Generating human-readable documentation of your web service data. You were planning on documenting it, right? And what, you were going to do it by hand?

  • Generating machine-readable documentation of your web service data; ie, XML schema. I know you weren't going to write that by hand. Tell me you weren't going to write that by hand. Actually, admit it, you probably weren't going to generate XML schema at all.

  • Generating editors, like EMF does. Only these editors would be generated in that 'junky' HTML/CSS/JavaScript trinity, for your web browser client needs. Who wants to write that goop?

  • Writing wrappers for your web services for other languages. At least this helps with the data marshalling. Again, JavaScript is an obvious target language here.

  • Generating database schema and access code, if you want to go hog wild.

 

If it's not obvious, I'm sold on this modelling stuff. At least lightweight versions thereof.

So I happened to be having a discussion with a colleague the other day about using software modelling to make life easier for people who need to serialize objects over the web. I don't think I was able to get my message across as well as I wanted, and we didn't have much time to chat anyway, so I thought I'd whip up a little sample. Code talks.

This sample is called ToyModSer - Toy Modelled Serializer. It's a toy because it only does a small amount of what you'd really want it to be able to do; but there's enough there to be able to see the value in the concept, and how light-weight you can go. I hope.