pmuellr: 2008-01-13

Saturday, January 19, 2008

schemas are good?

I'm not a "DB guy", by any stretch of the imagination, but something about DeWitt and Stonebraker's "MapReduce: A major step backwards" post seemed immediately ... wrong. Greg Jorgensen's response, "Relational Database Experts Jump The MapReduce Shark", points out at least one obvious thing - they're comparing apples to screwdrivers to a certain extent.

But there was one thing that jumped right out at me, from the original blog post:

"Schemas are good."

Talking specifically about database schemas here, just to be clear :-) And my immediate thought was: That's Not Right. Now, perhaps I'm confusing you; haven't I been yapping for ages about meta-data and schema and how it's going to save the world? Yes I have.

I bristle at the statement, "schemas are good", in relation to DB schemas, in the same way I bristle at folks who make claims about all the benefits of strongly typed languages compared to dynamically typed languages. There's value in being able to talk about types at interface boundaries, where I need to make a contract / commitment about data a program accepts and/or returns, because other people will be using these programs. But most of the code I'm writing is NOT a public interface, nor is the layout of my database; they're both implementation details. Enforcing strict typing at these levels is just getting in my way.

Here's the description in the original paper about why "schemas are good":

"The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce data-set can actually silently break all the MapReduce applications that use that data-set."

In theory, I agree. It's usually nice to keep your DB clean. But we've also moved beyond the days when applications talked directly to the databases; the smart money seems to be exposing your 'data' via a set of services, presumably whose implementation is using a database (or persistence service). Or at least you've probably encapsulated your data access in some kind of library. Those seems like perfectly good places to put your data validation and introspection.

And then, of course, not everyone wants to keep their database clean in the first place.

I thought it was interesting that they didn't mention performance, which you might guess would be another positive aspect of having schema. Wonder why not? Perhaps, having schema doesn't help performance?

Wednesday, January 16, 2008

IDL vs. Human Documentation

Do we really need all this documentation?
© Frank Jepsen

Ping pong; in reply to Steve's post on the same subject ...

Like I said in my previous post, interface definition languages exist for machines to generate code. They're totally inadequate, though, for instructing developers on how to write code to use a service. The need for human documentation in this context isn't quaint or impractical at all - it's simply reality.

Steve also points out the reams of documentation produced by OMG, and produced about WS-*, over the years, as a proof point of this.

He's right. Programmers can not survive on IDL alone, or more generally, meta-data. Human language is still often required to express subtleties or non-intuitive aspects of programming libraries, services, etc., simply because we have no other formal means of doing so. Or you wouldn't want to read it, if we did. Human readable documentation is also key to providing overview, introductory, relationship, etc type of information.

I appreciate having that sort of documentation. Lots of it.

But here's what I see as reality: Flickr Services.

Overlook the fact that parts of the interface aren't terribly RESTy, if you're a REST purist. But go out on a limb for me, and pretend they did the right thing in terms of REST; it's not a big leap, I don't think. Let's ask some questions. Is there some kind of XML schema that describes the input and output documents of these service definitions? Nope; can't find one. What HTTP status codes might get returned? Not documented, though they define their own error codes returned in the HTTP response payload; that commonly seen pattern is something else we need to talk about, BTW. Are ETags in use? Last-Modified checking? Who knows. etc

So, keep in mind that this documentation is all you've got. I'm going to claim it's inadequate.

Now the kicker: this API is still perfectly usable! In fact, I use it every time I write a blog entry, via a GreaseMonkey script to generate my little flickr images with annotations. It's not rocket science! It's HTTP!

While the documentation isn't completely adequate, it's perfectly acceptable, because I'm only using a small number of the APIs. The fact that's it's inadequate is containable. As Steve has previously mentioned, it's easy to do REPL-ish experiments with REST APIs. Heck, Flickr includes an API "Explorer" on their site to let you play, live. Nice!

But here is my fear. It's early days for REST. I want to believe that we'll see lots of people using REST over the coming decade to provide 'API's to their services. Where the promise of DCE, CORBA, WS-* got mired down in their complexities, REST is simple enough to actually be practical.

And that's my tension. It's simple/easy (you pick your semantic) to do simple stuff. I'd like to make sure that it's possible to do hard stuff. What happens when I need to deal with MORE that just one interface, and those interfaces are BIGGER than Flickr's? I feel like we're going to get lost in a sea of differently inadequate documentation. Using meta-data to help describe at least the nuts+bolts layer, is something that can help. The trick is to make sure it doesn't get too much in the way, and that you can keep it in sync with your code. My belief is that both are entirely possible. BTW, the link in the sentence above is a couple steps back in this conversation thread, if you've lost your place.

Let me finish this with a challenge. We all agree that documentation is good. Do you think there's some kind of a standard 'template' that we might agree on, that could provide all the practical, required information regarding usage of REST services? All text, prose, for now; just identify the information actually required. That seems like a worth-while goal, especially to try to educate people on the "way of REST" - and I'm talking about RESTy service providers here. Lots of people don't get REST. Let's teach them, by example, and by the way, I'm sure I still have a lot to learn myself.

You never know; I may see that prose and say "My gods. I've been such a dumb-ass! I can see that there's no way to formalize that to some machine readable format!". But I'll bet you a beer, that I won't say that. Take that prose, create an XHTML template that you could style, morph it into a MicroFormat (might not be so micro), create a JSON mapping when JSON takes over the universe this year, etc. Now I got my meta-data and I'll stop yammering on about it (not likely). But baby steps for now. I'd be happy with prose for now. Let's go top-down.

One more final note :-) I wrote this post last night, before Tim Bray posted about 'therestfulway.com'. Synchronicity? Wouldn't it be great to have a whole site full of patterns, best practices, examples, etc on REST? I've posed this question before, and gotten some positive responses. But some action is obviously required. I can't do it alone, but I'd be happy to help. I'm willing to sign a waiver indicating I won't talk about meta-data for the first six months. Who can pass that up?

phpBB on project zero

java implementation
© Stefan Jansson

My colleague Rob Nicholson posted a blog entry yesterday announcing a new significant accomplishment reached by the Project Zero team working on the PHP interpreter, which is written Java.

They've got phpBB running on Project Zero!

Hats off to the team; they've been working on the PHP interpreter for just over a year, and I'm quite impressed by the progress they've made.

While Project Zero is a commercial product, our PHP team has been making use of the test cases provided by the qa.php.net community, thanks to the generous license of the tests. I'm proud that the team has been able to contribute test cases back to the community as well. Rob tells me they've contributed more than 1000 tests so far. They've been using the existing PHP test cases to validate the interpreter, and have been contributing tests back as they find undocumented or untested behavior, that we need to implement, for which we need to determine the exact semantics.

phpBB is just the start; the team has been busy working on getting another interesting, non-trivial PHP application running. But they told me to keep my big yap shut, so you'll need to wait a bit longer to find out which one ...

Tuesday, January 15, 2008

on naming

No Name
© Patrick (but not me!)

Check out Stephen Wolfram's latest blog post, "Ten Thousand Hours of Design Reviews". In it, he describes how they run design reviews at Wolfram Research, for their Mathematica product.

The whole article is interesting, but the most interesting bit is in the second half of the blog post, where he talks about naming. He couldn't be more right about that; coming up with good names is hard, really hard. And he nails some of the tensions.

I've come up with some schemes to name projects, based on google searches on words with letter mutations. That's actually not terribly difficult. But those names are just labels; class names, function/method names, constant names, etc are things that can have a very long life-time, if you're lucky. Names that your programming peers will have to put up with. It's great to come up with perfect names.

Unlike human languages that grow and mutate over time, Mathematica has to be defined once and for all. So that it can be implemented, and so that both the computers and the people who use it can know what everything in it means.

on qooxdoo

qoo
© stephan

For some reason, I hadn't really looked at the qooxdoo JavaScript framework until last night. I remember noticing it when Eclipse RAP hit my radar, but I guess I didn't drill down on it. And after browsing the doc, playing with the demos, browsing the code, there are a few things that I really like about the framework.

explicit class declarations - So we know that ActionScript and the new drafts of the JavaScript language include explicit class declaration language syntax. Why? Classes provide a type of modularity that makes it easier to deal with building large systems. There are many ways to get this sort of modularity, and other toolkits provide facilities similar to what qooxdoo does. So why do I like the way qooxdoo does it?

It feels pretty familiar. It's fairly concise. Seems easy to introspect on, if you need to. I'm guessing it made it a lot easier to generate the gorgeous doc available in their API Documentation application.

On the downside, some ugliness seeps through. "super" calls. Primitive Types vs. Reference Types. Remembering to use commas, in general, since the class definition is largely a bag of object literals in object literals.

'window system' vs. 'page' GUI - Running the demos, it's clear that this is a toolkit to build applications that look like 'native' applications - a typical GUI program you would run on your desktop. Compared to the the traditional 'page' -based applications that we've been using since we've been surfing the web.

This is of course a contentious issue. Uncanny Valley and all that. Still, seems like there's potentially promise here, for at least certain types of apps.

we don't need no stinkin' HTML or CSS - Rather than follow the typical pattern of allowing/forcing folks to mix and match their 'GUI' bits of code via HTML, CSS and JavaScript, qooxdoo has a "you only need JavaScript" story. It's an interesting idea - it certainly seems like there's potentially value having to deal with just one language instead of three.

Here's some rational on the non-CSS-based styling.

The downside I see is that the API is fairly large, and without syntax assist, I can see that you'd be keeping that API doc open all the time to figure out what methods to call. This style of programming also seems more relevant for 'window system' UIs compared to building typical web 1.0-style 'page' UIs.

Test Runner - 'nuff said.

lots of documentation - I couldn't find any kind of introduction on the widgets in all this though. Still, this level of documentation seems to be above average, relative to other toolkits.

summary - Cool stuff. I'd wondered when I'd see a 'bolted on' class system in JavaScript that I liked, and this one is certainly getting close. Likewise, I've often wondered about completely library-izing the GUI end of browser clients; the shape here seems about right.

Anyone else played with this?

steve vinoski on schema

2007/01/15: if you happen to read through this blog entry, please also read the comments. Steve felt a bit misquoted here, especially in the title, and I offer up my rational, and apology, but more importantly, we continue the argument :-)

As someone who believes there is some value for 'schema' in the REST world, I'd like to respond to Steve Vinoski's latest blog post "Lying Through Their Teeth: Easy vs. Simple".

Since I'm more firmly in the REST camp than the WS-* camp, and since I love an underdog, I'll side with schema just for the halibut. It's probably more correct to say that I haven't made up my mind w/r/t schema than I believe there is some value - my 'pro' stance on schema is a gut feel - I've seen no significant evidence to sway me either way.

I definitely feel like schema has gotten a bad rap from it's association with WS-*. Well-deserved, to some extent; XML schema is overly complex, resulting in folks having differing interpretations of some of the structures; WSDL itself is so overly indirect it's horribly un-DRY, while on the other hand being terribly brittle at points like hardcoding URLs to your service in the document.

So? Yeah, they got it wrong. They got a lot of stuff wrong. Anyone consider we might be throwing the baby (schema) out with the bath water (WS-*)?

I'd say my general reaction to the typical RESTafarian's rant against schema is that it's misplaced, guilt-by-association.

Let's look at some of Steve's arguments:

"After all, without such a language, how can you generate code, which in turn will ease the development of the distributed system by making it all look like a local system?"

That's making a bit of a leap - "making it all look like a local system?". That's pretty much what the WS-* world did. Why would he think someone who is pro-schema would necessarily want to repeat the mistakes of the WS-* world? I don't want to!

Anyone who's done any amount of client-side HTTP programming is going to realize there's a fair amount of boiler plate-ish code here. Setting up connections, marshalling inputs, making requests, demarshalling outputs, etc. Want caching? How about connection pooling? You're going to be writing a little framework, my friend, if you don't have one handy. And once you've done that, perhaps you'd like to wrapper some of the per-resource operation bits in some generated code, just to make your life easier. And note, I count things like dynamic proxies as code generation; it's just that isn't static code generated by a tool that you have to re-absorb back into your application.

And guess what? Your generated code doesn't have to hide the foibles of the network, if it doesn't want to. And as Steve implies, it shouldn't.

"trying to reverse-map your programming language classes into distributed services, such as via Special Object Annotations, is an attempt to turn local design artifacts into distributed ones"

Again, no. Schema and "hiding remoteness" are two separate things, they aren't necessarily related.

Similarly, with REST, you read the documentation and you write your applications appropriately, but of course the focus is different because the interface is uniform.

The documentation? What documentation? I'm picturing here that Steve has in mind a separate document (separate from the code, not generated from the code) written by a developer. But human generated documentation like this is still schema, only it's not understandable by machines, pretty much guaranteed to get out of sync with the code, probably incomplete and/or imprecise. Not that machine generated schema might fare any better, but it couldn't be any worse.

But there are more problems with this thought. The notion of hand-crafted documentation for an API is quaint, but impractical if I'm dealing with more than a handful of APIs. In fact, the "uniform interface" constraint of REST pretty much implies you are going to have a greater number of "interfaces" than had you defined the functionality as a service (or set of services) in WS-*. Though presumably each REST interface has a smaller number of operations. Of course it would be nice to have this documentation highly standardized (at least given a related 'set' of documented interfaces), available electronically, etc. I don't see that happening with documentation generated by a human.

Another problem here is that although the "uniform interfaces" themselves will be generally easy to describe - "GET returns an Xyz in the HTTP response payload" - the format of the data is certainly not uniform. Is this sufficient documentation for data structures sent in HTTP payloads? It's not for me. Documenting data structures 'by example' like that seems to be the status quo, and is of course woefully inadequate.

Lastly, I'll point out that I think the primary reason for having some kind of schema available is specifically to generate standardized, human-readable documentation. But not the only one. I think there are opportunities to have have machine-generated 'client libraries', machine-generated test suites, many-flavored documentation (PDFs, web-based, man pages, context-assisted help in editors, etc), validators, assembly tools for use by non-programmers (think Yahoo! Pipes), etc.

What the proponents of definition languages seem to miss is that such languages are primarily geared towards generating tedious interface-specific code, which is required only because the underlying system forces you to specialize your interfaces in the first place.

Right. Except for the data, which will be different; and URL template variables. Don't forget that sometimes you'll be needing to be setting cache validators up; maybe some special headers. Face it, there's plenty of tedious code here, especially if you're making more than onesy-twosey requests. Especially if you're in a low signal-to-noise ratio language like Java. Tedious code is no fun. Why not framework-ize some of it?

Summary

Even though I have this feeling that schema will be of some value in the REST world, I actually welcome having to have an argument about it. Absolutely, assume we don't need it until we find a reason that we really need it. We haven't yet hit the big time with REST, to know if we'll need it or not. When we see APIs on the order of eBay's, only REAL REST, then we'll know REST will have hit the big time.

For right now though, none of the anti-schema arguments I've ever heard is very compelling to me.

In the meantime, it's also refreshing to see folks experimenting with stuff like WADL. If we end up needing / wanting schema, it would be nice to have some practical experience with a flavor or two, for REST, before trying to settle on some kind of universal standard.

Links