Archives: April 2009

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "fields" : [
  7.      ["001", "001 value"]
  8.      ["002", "002 value"]
  9.      ["010", " ", " ",
  10.       [
  11.         ["a", "68009499"]
  12.       ]
  13.     ],
  14.     ["035", " ", " ",
  15.       [
  16.         ["a", "(RLIN)MIUG0000733-B"]
  17.       ],
  18.     ],
  19.     ["035", " ", " ",
  20.       [
  21.         ["a", "(CaOTULAS)159818014"]
  22.       ],
  23.     ],
  24.     ["245", "1", "0",
  25.       [
  26.         ["a", "Capitalism, primitive and modern;"],
  27.         ["b", "some aspects of Tolai economic growth" ],
  28.         ["c", "[by] T. Scarlett Epstein."]
  29.       ]
  30.     ]
  31.   ]
  32. }

One Response to “MARC-HASH: The saga continues (now with even less structure)”

  1. [...] I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, [...]

Leave a Reply

Why do I ever, ever think that MARC might not rely on order? I don’t know.

In any case, control fields will now be just an array of duples:

  1. control: [
  2.   ['001', 'value of the 001'],
  3.   ['006', 'value of the 006']
  4.   ['006', 'another 006']
  5. }

Leave a Reply

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.

So…I’m suggesting we try something a little simpler. Something so brain-dead, in fact, that I’m loathe to put it down because it’s pretty much the obvious way to do it. To wit:

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "control" : [
  7.      ["001", ["all", "001", "values"]],
  8.      ["002", ["all", "002", "values"]],
  9.   ],
  10.   "data" : [
  11.     ["010", " ", " ",
  12.       [
  13.         ["a", "68009499"]
  14.       ]
  15.     ],
  16.     ["035", " ", " ",
  17.       [
  18.         ["a", "(RLIN)MIUG0000733-B"]
  19.       ],
  20.     ]
  21.     ["035", " ", " ",
  22.       [
  23.         ["a", "(CaOTULAS)159818014"]
  24.       ],
  25.     ]
  26.     ["245", "1", "0",
  27.       [
  28.         ["a", "Capitalism, primitive and modern;"],
  29.         ["b", "some aspects of Tolai economic growth" ],
  30.         ["c", "[by] T. Scarlett Epstein."]
  31.       ]
  32.     ]
  33.   ]
  34. }

Stupid MARC allows all the stupid fields to stupid repeat and be out of stupid order and such, so it’s just a lot of arrays. Easily round-tripable.

Why bother? Excellent question, and one that’s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it’s still a lot easier than working with raw MARC (or, I would claim, MARC-XML), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.

A few things worth noting about the assumptions in my mind:

  • By definition, it’s always UTF-8. The leader should be changed to note this on the sending end, but it’s not required.
  • We include both a type “marc-hash”, and a version with major/minor numbers.
  • Everything is a string.
  • Alpha characters in indicators/tags are all lowercased.
  • A control field is a duple: tag and array of values.
  • A data field has four values:
    1. The tag
    2. Indicator one
    3. Indicator two
    4. An array of duples: subfield and its value

A simple transformation to make it a little more queryable

Let’s say you don’t give a damn about tags that appear out of order, because that’s just a crime against nature, anyway. And you really don’t care what order the subtags appear in most of the time, ’cause really, who does?

A simple run-through (psuedocode ahead):

  my marchash = getTheMarcHash();
  my kindamarc;
  kindamarc{leader} = marchash{leader};

# Map the control fields by tag => array-of-values foreach cfield (marchash{control}) { kindamarc{control}{cfield[0] ||= []}; kindamarc{control}{cfield[0]}.push(cfield[1]); }

foreach d (marchash{data}) { (tag, ind1, ind1) = (d[0], d[1], d[2]);

# build up a hash based on subfields for this tag
newd = {};
foreach subfield (d[3]) {
  (stag, sval) = subfield;
  newd{stag} = sval;
}

# Store the subfield hash in a few places so it's easy to find.
foreach i1 ('*', ind1) {
  foreach i2 ('*', ind2) {
    kindamarc{data}{tag}{i1}{i2} ||= [];
    kindamarc{data}{tag}{i1}{i2}.push(newd);
  }
}

}

Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator ‘*’.

Basically, this will allow things like this:

  1.   $leader = $kindamarc{leader};
  2.   $first001 = $kindamarc{control}{"001"}[0];
  3.  
  4.   # Find 856s where indicator 2 is '1'
  5.  
  6.   @mystuff = $kindamarc{data}{856}{'*'}{1};

It’s easy to see how we could store the index from the original array to make it easy to find the original order, too.

For many, I’m sure, the prospect of dealing with something like this is more daunting than just learning to use MARC-XML or using existing libraries to deal with straight MARC. But there seems to be a set of folks out there for whom this might be useful, so I’m throwing it out there.

2 Responses to “MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records”

  1. This is basically just MARC-XML translated to JSON, yes?

    Do you really find it easier to work with than MARC-XML? I guess if you’re doing your work in javascript, maybe.

  2. Bill says:

    Meh. I agree — and hence the disclaimers — that it’s nothing special. But “nothing special that everyone agrees on” is better than no agreement at all, and for some people XML processing is a barrier they don’t want to have to deal with. For all its seeming ubiquity, lots of folks never have to deal with XML in their programming.

    I guess my thing is, “IF we’re going to serialize MARC as JSON/YAML/Whatnot, let’s all do it the same way.”

Leave a Reply