pmuellr is Patrick Mueller

other pmuellr thangs: home page, twitter, flickr, github

Saturday, January 19, 2008

schemas are good?

I'm not a "DB guy", by any stretch of the imagination, but something about DeWitt and Stonebraker's "MapReduce: A major step backwards" post seemed immediately ... wrong. Greg Jorgensen's response, "Relational Database Experts Jump The MapReduce Shark", points out at least one obvious thing - they're comparing apples to screwdrivers to a certain extent.

But there was one thing that jumped right out at me, from the original blog post:

"Schemas are good."

Talking specifically about database schemas here, just to be clear :-) And my immediate thought was: That's Not Right. Now, perhaps I'm confusing you; haven't I been yapping for ages about meta-data and schema and how it's going to save the world? Yes I have.

I bristle at the statement, "schemas are good", in relation to DB schemas, in the same way I bristle at folks who make claims about all the benefits of strongly typed languages compared to dynamically typed languages. There's value in being able to talk about types at interface boundaries, where I need to make a contract / commitment about data a program accepts and/or returns, because other people will be using these programs. But most of the code I'm writing is NOT a public interface, nor is the layout of my database; they're both implementation details. Enforcing strict typing at these levels is just getting in my way.

Here's the description in the original paper about why "schemas are good":

"The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce data-set can actually silently break all the MapReduce applications that use that data-set."

In theory, I agree. It's usually nice to keep your DB clean. But we've also moved beyond the days when applications talked directly to the databases; the smart money seems to be exposing your 'data' via a set of services, presumably whose implementation is using a database (or persistence service). Or at least you've probably encapsulated your data access in some kind of library. Those seems like perfectly good places to put your data validation and introspection.

And then, of course, not everyone wants to keep their database clean in the first place.

I thought it was interesting that they didn't mention performance, which you might guess would be another positive aspect of having schema. Wonder why not? Perhaps, having schema doesn't help performance?

1 comment:

stu said...

We still need schemas to understand the meaning of the data, or its constraints, even if you encapsulate the access. And I do mean database schemas: XML schemas don't give you much in terms of understanding semantics compared to a good 'ol ERD.

Also, not all access paths are canned. I scream every time I see yet-another-XML-language to craft dynamic queries.

There's little difference between a Web Service and a Stored Procedure, it's just a matter of how the data is represented.