Links

pmuellr is Patrick Mueller, an IBMer doing dev advocate stuff for node.js on BlueMix.

other pmuellr thangs: muellerware.org, twitter.com/pmuellr

Thursday, August 21, 2008

better than json

Well, JSON is the new hotness. This week. Again. Thought I'd jot down some thoughts I had on JSON last summer, in the area of improving on JSON.

Before doing that, let me go over some of my core beliefs about JSON:

  • It defines an ASCII serialization format for a lowest common denominator of primitive and composite types used in most common programming languages, so it can be used to serialize natural data structures in many environments.

  • JSON doesn't attempt to handle traditional OO mechanics, it's really just plain old simple data and data structures, keeping things simpler.

  • The fact that JSON is legal JavaScript, and so can be eval()'d and <script src="">'d in web browsers via JSONP, is not a benefit, it's a security nightmare waiting to happen. JSON parsers are easy to write, and with JSONP you're giving up too much control over your HTTP requests.

  • The 'readability' and 'writeability' of JSON, by humans, isn't a big priority for me. But it does bug me that it could easily more readable and writable, by humans.

Basically, I like what it can express, and I care less about the syntax.

So, what would I change? Turns out, not much, mainly syntax changes to aid in human readability and writability.

Add some additional primitive data types / literal forms. Some obvious ones would be decimal and date. Most languages can support these, somehow. Date gets dicey in terms of subsetting, like date with no time, time with no date, etc.

Remove the idiotic "\/" escaping rule. People think they have to encode "/" character because it shows up in the 'backslash escapeable character list'. Section 8, Examples, of RFC 4627 has an example which clearly does not use escaped "/" characters. I assume this escaping rule had something to do with the way JavaScript is/was parsed in some web browsers. feh.

Remove extraneous punctuation. The "," separators used between elements in {} and [] structures aren't needed. Make them optional. Same with the ":" characters between key / value pairs in {} structures. For human readable output, I'd not use "," but would use ":", because I like the visual parsing you get between key value pairs. I would typically write key / value pairs on a single line anyway, obviating the need for the "," separator.

Stop requiring quotes around strings unless required. This is the biggest noise reducer. There are only 3 keywords in JSON - true, false, and null. For property keys specifically, many times the key values aren't one of these keywords, aren't another primitive literal value, and don't include whitespace or other separator characters. Their values are basically that of 'identifiers'. That get used unquoted in JavaScript itself, in JavaScript dot notation and object literals.

I assume this quoted property name rule exists is to keep people from inadvertently using JavaScript keywords as property keys, which would then screw up the parse-by-eval() trick. For program generated JSON, it's probably just easier to always use quotes, but when JSON is being rendered for humans, I'd like quotes removed if they could be. When writing literal JSON, I'd also not use quotes if I didn't have to.

The same rules apply for all string usage, not just property keys. But I'd probably typically want to read and write strings as quoted values everywhere but property keys, as that's just what I'm used to.

Using 'bare strings' means that you might write out a property name that in the future became a new JSON keyword. Not good. So perhaps something like the Smalltalk/Ruby symbol syntax construct could be used. Basically a new character prefix, with no terminator, indicating the following characters up to a separator are the string value. If "#" was the prefix, then #abc would be the same thing as "abc".

Or perhaps we can find and set some rules that would allow new keywords and operators to be introduced without breaking serializations that weren't aware of them. The NextREXX language does this.

Provide single line and multi-line comments. Duh.

Examples

Here are some examples of using some of these changes, based on the following JSON object that I found on the web somewhere. To be fair, since these examples are supposed to show readability enhancements, I did a reformatting in TextMate. Here's the original, but reformatted, legal JSON.

{
    "members": [
        {
            "href": "34KJDCSKJN2HHF0DW20394",
            "etag": "0hf0239hf2hf9fds09sadfo90ua093j",
            "entity": {
                "id": "example.org:34KJDCSKJN2HHF0DW20394",
                "name": {
                    "unstructured": "Jane Doe"
                },
                "gender": {
                    "displayvalue": "女性",
                    "key":          "FEMALE"
                }
            }
        },
        {
            "href": "aaaaaaaaaaa11111",
            "etag": "alsjdfieflsajfajsfjadsljfalksjd",
            "entity": {
                "id": "example.org:aaaaaaaaaaa11111",
                "name": {
                    "unstructured": "Joe Gregorio"
                },
                "gender": {
                    "displayvalue": "Male",
                    "key":          "MALE"
                }
            }
        }
    ],
    "next": null
}

Let's remove the comma's. Comma removal doesn't help much with the readability, but it's much easier to not have to add the commons when you're writing JSON literals by hand.

{
    "members": [
        {
            "href": "34KJDCSKJN2HHF0DW20394"
            "etag": "0hf0239hf2hf9fds09sadfo90ua093j"
            "entity": {
                "id": "example.org:34KJDCSKJN2HHF0DW20394"
                "name": {
                    "unstructured": "Jane Doe"
                }
                "gender": {
                    "displayvalue": "女性"
                    "key":          "FEMALE"
                }
            }
        }
        {
            "href": "aaaaaaaaaaa11111"
            "etag": "alsjdfieflsajfajsfjadsljfalksjd"
            "entity": {
                "id": "example.org:aaaaaaaaaaa11111"
                "name": {
                    "unstructured": "Joe Gregorio"
                }
                "gender": {
                    "displayvalue": "Male"
                    "key":          "MALE"
                }
            }
        }
    ]
    "next": null
}

Now remove the quotes around property keys. That's really nice, but for compatibility with potential new keywords, it probably is best to not use 'bare strings'.

{
    members: [
        {
            href: "34KJDCSKJN2HHF0DW20394"
            etag: "0hf0239hf2hf9fds09sadfo90ua093j"
            entity: {
                id: "example.org:34KJDCSKJN2HHF0DW20394"
                name: {
                    unstructured: "Jane Doe"
                }
                gender: {
                    displayvalue: "女性"
                    key:          "FEMALE"
                }
            }
        }
        {
            href: "aaaaaaaaaaa11111"
            etag: "alsjdfieflsajfajsfjadsljfalksjd"
            entity: {
                id: "example.org:aaaaaaaaaaa11111"
                name: {
                    unstructured: "Joe Gregorio"
                }
                gender: {
                    displayvalue: "Male"
                    key:          "MALE"
                }
            }
        }
    ]
    next: null
}

Now try prefixing property keys with "#" using a symbol-like construct, and not using ":" to separate key value pairs, to make up for having to put the quote in. Add some comments.

{
    // the list of entries in the list
    #members [
        // Jane Doe's entry
        {
            #href "34KJDCSKJN2HHF0DW20394"
            #etag "0hf0239hf2hf9fds09sadfo90ua093j"
            #entity {
                #id "example.org:34KJDCSKJN2HHF0DW20394"
                #name {
                    #unstructured "Jane Doe"
                }
                #gender {
                    // apparently this data is multi-byte encoded
                    #displayvalue "女性"
                    #key          "FEMALE"
                }
            }
        }
        // Joe Gregorio's entry
        {
            #href "aaaaaaaaaaa11111"
            #etag "alsjdfieflsajfajsfjadsljfalksjd"
            #entity {
                #id "example.org:aaaaaaaaaaa11111"
                #name {
                    #unstructured "Joe Gregorio"
                }
                #gender {
                    #displayvalue "Male"
                    #key          "MALE"
                }
            }
        }
    ]
    #next null
}

To me, I see some readability enhancement with these changes. I think the big bang for the buck is the writability enhancement.

As I said above, the syntax isn't a really big deal for me. It's only a big deal when I have to read it with my eyes or type it at a keyboard. Which we seem to be doing more of these days. In fact, I had to fix two typos in Joe's original sample. Can you find them?

It's funny that with all the whining I do about unreadable and overly verbose XML, I find it easier to read and write XML than JSON. That might well just be because I'm typically reading and writing XML with syntax coloring viewers and editors, but not doing that so much with JSON.