Karl Hans Janke Kollaborativ
Heute die Welt, morgen das Sonnensystem!

JSON with blobs, still context-free

datalang.klein.jpg
Pen and paper design.

Inspired by their talk The Science of Insecurity I took Meredith Patterson and Sergey Bratus by their word and tried to solve my next network communication problem without crossing the line beyond deterministic context-free languages.

The upshot of said talk was that most if not all security problems stem from the fact that some software component could not foresee the consequences of its input. From a language-theoretic point of view, the problem boils down to recognizing the set (language) of acceptable inputs. There are different classes of languages whose recognizers require increasingly complex mechanisms. Things are basically pleasant with regular languages and one step up, aforementioned deterministic context-free ones. Up to this point we can algorithmically decide whether two specifications describe the same language; whether two peers in the network are cleanly interoperable.

When I was looking for a good data serialization format, in addition to my original requirements, I went looking for one that had a deterministic context-free grammar. Incidentally, one of the things I wanted to be able to do was efficiently transfer relatively large blocks of arbitrary data. Unfortunately, what immediately catapults you into the land of (mildly) context-sensitive languages are length fields.

JSON (as implied by the title) would have been my favorite choice, but the best way to put binary blobs in it is by encoding them as Base64-encoded strings.

{ "message":    "Hi, Bob!"
, "attachment": "ZGFmdXFpc3RoaXM="
}

For one thing, this means overhead in encoding time, decoding time and data volume. Also it is unsatisfying because one property of JSON is self-descriptiveness: recognizing a JSON value reveals its type. Base64 blobs would be hidden in strings and force the recipient to know exactly where to expect them.

A short note about overhead and efficiency concerns. It is generally rightfully considered foolish to optimize prematurely. From most standpoints, computers are fast, bandwidth is cheap and you are probably wasting ten times as much elsewhere as avoiding Base64 would ever save. Nevertheless, optimizing for efficiency is not useless and in the right place, a constant factor can make all the difference. Base64 will turn your 3GB download into a 4GB one. More importantly, I am treating this endeavor as much as an academic as a practical one, asking could we as much as do we want to. So below is the answer I came up with.

The idea is to break binary data into chunks of uniform size. I chose 4096 bytes rather arbitrarily. Allow one final chunk of variable length and encode that one in Base64. So every 4kB, there is one character (#) which means another 4k coming. There need not be any such raw chunks; they are always followed by exactly one (possibly empty) Base64 string enclosed in %. Examples:

#.....#.....%ZGFmdXFp%
#.....#.....%%
%ZGFmdXFp%

This syntax is added to JSON, along with a few other extensions.

Design goals

Non-goals

Notwithstanding the goal to support efficiency with large blobs, this format is not meant to squeeze every last bit out of everything. That conflicts with self-descriptiveness and is what Protocol Buffers are for.

Characteristics

Example

{ "null":         null
, "boolean":      true
, "integer":      1234
, "rational":     1234.56
, "exponent":     1234.56e2
, "hexadecimal":  0x123AB.CDxE
, "bytes":        %ZGFmdXFp%
, "string":       "Hello"
, "encoding":     "Mot\xF6rhead"_latin1
, "list":         [23,"skidoo"]
, "record":       {}
}

Show me the code!

Glad you asked! The child currently carries the rather stupid working title datalang and resides in a repository here:

darcs get http://code.khjk.org/datalang/

Included is the ABNF grammar as well as a demo parser implemented in C using hammer. Oh right, hammer. Given that this post has already turned into a novel, I am going to save that for later.

PS: If anyone thinks of a better name than datalang, your suggestion is very welcome at my easily-guessed email address.

Fingerprints are so 90s

p01.klein.jpg

A few years back I prepared a presentation on the so-called Socialist Millionaires' Protocol (SMP) for a university seminar. SMP is a solution to the problem of key authentication devised for OTR (Off-the-Record), the system for instant-messaging encryption.

Today I held a short version of the presentation for non-mathematicians at the CCC Hamburg. For the benefit of the Internet, the awesomely hand-made slides are in English. There is also a handy hand-out with a protocol diagram.

The written presentation for the course is completely in German and math-rich. I did try hard to make it a clear read for the so-inclined. Have fun! :)

An introduction to Bitcoin

I held a little intro talk about Bitcoin last night at a local Linux meetup kinda thing. It was a light technical description of what the system is and how it works.

Here are the slides and their LaTeX sources. That is all.

Blind signature basics

signature.klein.jpg

I'm starting work on my diploma thesis this month. The exact topic isn't set in stone yet, but it will be something crypto. If everything goes dreamy-awesome, I'll find something nice to write about lattice-based blind signatures or somesuch. Background:

So, time to sum up the basics. As far as my history serves, David Chaum invented blind signatures in the 80s for electronic voting but nobody wanted to buy that, so he also invented electronic cash. Then he got really paranoid and didn't sell it either. Real quick summary. ;)

Anyway…

The principle is to mix whatever you want signed (electronic voting ballot, 100 EUR banknote) with a random blinding factor and divide that out only after Trent (your government, bank) has signed. Thus Trent cannot recognize and connect the note to you when it comes back to him later.

The classic algorithm is based on RSA and is painted up fast. Unfortunately, my awesome markup language still has no fancy math support so you have to live with ASCII art:

m = message to be signed
e = public "encryption" (i.e. verification) exponent
n = public modulus
d = secret "decryption" (i.e. signing) exponent
k = blinding factor (just a random number)

x^(de) = x^(ed) = x (mod n)         -- RSA property

Alice prepares:  mk^e               -- blinded message
Trent signs:     (mk^e)^d = m^d k
Alice unblinds:  m^d k / k = m^d    -- signed message
Bob can check:   (m^d)^e = m

One might think that signing something completely blindly might be a bad idea. After all, a bank needs to know the value of the note it is signing. To ensure any desired property of the signed document, Trent can require a cut-and-choose step. In this case Alice must give him n different but equivalent messages. He chooses one of them and asks Alice to unblind all the others. Trent signs the remaining blinded one if the others satisfy the desired property. Alice's chance to cheat of 1:n can be made unattractive by attaching a suitable penalty.

Making a stupid little time waster with Javascript

memory.klein.png
Awesome!

I used the weekend to code a little memory game in Javascript. Apart from being pointless and annoying (especially if your visual memory sucks as much as mine), I think it turned out lovely!

I was hoping to gain some experience with user-facing Javascript from it, as previous excursions into the misunderstood programming language have been minimal in one way or another. So, this time I got to use some actual objects. Structured data, woo!

I poked around the design space a little to see what was up with Crockford's take on prototypal inheritance vs. others' assertions that his dislike of the new-operator was ill-founded. You can look at the code to see what I settled on in this instance, but maybe I'll leave that discussion for another post. Suffice it to say, it's a bit of a mish-mash but I'm sure it will crystallize nicely.

Oh yeah, and this is going onto my side project (I desperately needed one!) site kompilierfreizeit.de. I'm going to collect some nice other time wasters there, but won't say it out too loud, yet. That would just make me feel obligated…

Have fun!