Home > Uncategorized > Data structures and Serializations

Data structures and Serializations

April 20, 2010 5 Comments »

Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in.

What this post is not

There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them.

What I am comfortable saying is this:

Anyone advocating or dismissing a data model based on the data structure or serialization most-often associated with that model is missing the goddamn point.

Data serializations

…are boring. They’re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even MARC has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).

So, let me repeat again, serializations are boring and not worth talking about until you’ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.

Serializations are measured from “less pain” to “more pain”, but all have the exact same expressiveness. Data structures, on the other hand, do not.

A hierarchy of data structures

Think about the following data structures:

  • An ordered list
  • key-value pairs
  • A hierarchy (e.g., an XML document)
  • An undirected graph
  • A directed graph
  • A labeled, directed multigraph (e.g., a set of RDF Triples)

You don’t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item’s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you’re careful to make links both ways. And a directed graph can easily be represented as a set of RDF triples (where you may, for example, only have one label for your relationships: “links to”).

[Note that I didn't say any of these would be efficient implementations!]

The reverse is not true — or, at least, not without an incredible amount of “out of band” information in another layer somewhere.

The structures at the end of the list have more expressiveness. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I’m not going to try to model my set of key=value pairs in an array. I could do it, but it would take so much of my attention that the data modeling would suffer.

Don’t handicap yourself

Don’t start with the data structure.

DON’T START WITH THE DATA STRUCTURE!

GET THAT MOTHER-FREAKIN’ DATA STRUCTURE OFF MY MOTHER-FREAKIN’ PLANE!

Seriously. Don’t be stupid. If all you’ve got is a hammer, everything starts to look like a thumb.

If you start off with a restrictive data structure before you even fully define the domain you’re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you’re imagining in your head.

Domain modeling is ridiculously hard for any domain worth modeling. If you start with a handicap (a restrictive data structure) it’s going to be even harder.

No one would think of trying to model bibliographic data using only arrays. That’s premature optimization on an epic scale.

The appeal of RDF Triples

Even if you ignore all the semantics and rules that make RDF Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that any semantic model based on RDF Triples has enormous expressive power at its disposal.

Does it turn out that after you’ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You’ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a “type” of sorts, and add some out-of-band knowledge). And lord knows RDBMS’s have great performance characteristics.

But don’t start there. Start with the domain. Model it. Figure out what you need to describe and derive. Then pick the most appropriate data structure.

The nightmare that is MARC

MARC-the-data-structure (not to be confused with a serialization of that data structure, on the one hand, or with the AACR2 on the other) can incompletely (but usefully, I think) be described as:

  • A set of key-value pairs
  • …that have a defined order
  • …where keys can be repeated
  • …and values are strings
  • …and keys are a concatenation of tag/ind1/ind2/code

Control fields are especially restricted (ind1, ind2, and code are all ‘null’). There’s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.

It doesn’t give us much to work with. It’s restricted. And, sadly, so is our thinking.

Putting the cart before the horse

As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the MARC data structure. It’s a handicap. It’s an anchor around our necks.

MARC-the-data-model (nee AACR2) is not inherently bad because it’s built on an impoverished data structure. It’s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that’d be awesome because it would indicate that things are simple.

Things, of course, aren’t simple. They’re hard.

So, if you want to complain about MARC or RDA or FRBR, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don’t conflate data models, data structures, and serializations.

Oh, and don’t say “PIN Number” or “ATM Machine.” That drives me crazy, too.

Tags:

Comments:4

Leave my own
  1. Jonathan Rochkind
    April 20, 2010 at 5:12 pm

    A brief exchange me and Bill had in IRC, which I think is further illuminating:

    (5:10:13 PM) jrochkind: BillDueber: I’d say the problem is that MARC is BOTH a “data model” AND a “data structure.” Even though was never designed as a data model, it has become one.

    (5:11:02 PM) BillDueber: jrochkind: Right. We long ago passed the point where the model drives the data structure. It’s [now] the other way around. [which is a bad thing]

  2. MJ Suhonos
    April 20, 2010 at 10:08 pm

    Bill, just to clarify my perspective on the issue, I fully agree with everything you’re saying above. In fact, your explanation is probably the clearest I’ve seen to date. And the thread is definitely ridiculous. :-)

  3. Joe Montibello
    April 21, 2010 at 4:32 pm

    Hi,

    I’m not at all familiar with Domain Modelling – so far what I know about it comes from this blog post, plus a less useful Wikipedia article, plus a random white paper I googled up. (http://www.aptprocess.com/whitepapers/DomainModelling.pdf)

    My question is this: would modelling the domain for a library system consist of coming up with something like the set of behaviors that FRBR describes, and then building a data structure based on that?

    Thanks for an interesting post, anyway. Joe M.

  4. Jakob
    August 6, 2010 at 3:03 am

    Yes, FRBR is one example of a Domain Model – librarians can do this. But with FRBR they failed to define a serialization. Domain Models help you to talk about things with human beings. But to exchange data you need a serialization of the model. I agree that the model must come first, but when you stop there, you end up doing no data exchange but philosophy (which is nice too).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackbacks:1

Listed below are links to weblogs that reference Data structures and Serializations

pingback from Painting a picture of metadata « (d)atalog(ue) May 14, 2010

[...] librarian would put their metadata in a data format (or “content format” or “data structure“).  Some examples are binary or XML.  It is the carrier for the content, just like how a CD [...]