Skip to content

Tag: ruby

Reintroducing Traject: Traject 2.0

Traject 2.0.0 released! Now runs under MRI/RBX! traject is an ETL (extract/transform/load) system written in ruby with a special view towards extracting fields from MARC data and writing it out into Solr. [Jonathan Rochkind] (http://bibwild.wordpress.com) and I wrote this primarily out of frustration using other tools in this space (e.g., Solrmarc, or my own precursor to traject , marc2solr Note: Catmandu is another, perl-based system I don’t have any direct experience with. traject had its first release almost a year and a half ago (at least based on the date of my post introducting it), and I’ve used it literally…

Comments closed

Ruby MARC serialization/deserialization revisited

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…

Comments closed

Even better, even simpler multithreading with JRuby

[Yes, another post about ruby code; I’ll get back to library stuff soon.] Quite a while ago, I released a little gem called threach (for “threaded #each”). It allows you to easily process a block with multiple threads. # Process a CSV file with three threads FIle.open(‘data.csv’).threach(3, :each_line) {|line| send_to_db(line)} Nice, right? The problem is that I could never figure out a way to deal with a break or an Exception raised inside the block. The core problem is that once a thread trying to push/pop from a ruby SizedQueue is blocking, there’s no way (I could find) to tell…

Comments closed

Ruby gem library_stdnums goes to version 1.0

I just released another (this time pretty good) version of my gem for normalizing/validating library standard numbers, library_stdnums (github source / docs). The short version of the functions available: ISBN: get checkdigit, validate, convert isbn10 to/from isbn13, normalize (to 13-digit) ISSN: get checkdigit, validate, normalize LCCN: validate, normalize Validation of LCCNs doesn’t involve a checkdigit; I basically just normalize whatever is sent in and then see if the result is syntactically valid. My plan in my Copious Free Time is to do a Java version of these as well and then stick them into a new-style Solr v.3 filter so…

Comments closed

A short ruby diversion: cost of flow control under Ruby

A couple days ago I decided to finally get back to working on threach to try to deal with problems it had — essentially, it didn’t deal well with non-local exits due to calls to break or even something simple like a NoMethodError. [BTW, I think I managed it. As near as I can tell, threach version 0.4 won’t deadlock anymore] Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an…

Comments closed

Four things I hate about Ruby

Don’t get me wrong. I use ruby as my default language when possible. I love JRuby in a way that’s illegal in most states. But there are…issues. There are with any language and the associated environment. These are the ones that bug the crap out of me. Ruby is slow. Let’s get this one out of the way right away. Ruby (at least the MRI 1.8.x implementation) is, for many things, slow. Sometimes not much slower. Sometimes (e.g., numerics) a hell of a lot slower. Now, there’s nothing necessarily wrong with that. For what I do, MRI Ruby is usually…

Comments closed

Size/speed of various MARC serializations using ruby-marc

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data. I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21. Why bother? Binary MARC-21 is “broken” in…

Comments closed

Simple Ruby gem for dealing with ISBN/ISSN/LCCN

I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums. It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome. Functionality is all as module functions, as follows: ISBN char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn) boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn) thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn) tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn) ISSN char = StdNum::ISSN.checkdigit(issn) boolean = StdNum::ISSN.valid?(issn) LCCN normalizedLCCN = StdNum::LCCN.normalize(lccn) Again, there’s nothing special here…

Comments closed

Pushing MARC to Solr; processing times and threading and such

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.] What’s the question? The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage. I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the…

Comments closed

ruby-marc with pluggable readers

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this: require ‘marc’ require ‘my_marc_stuff’ mbreader = MARC::Reader.new(‘test.mrc’) # => Stock marc binary reader mbreader = MARC::Reader.new(‘test.mrc’ :readertype=>:marcstrict) # => ditto MARC::Reader.register_parser(My::MARC::Parser, :marcstrict) mbreader = MARC::Reader.new(‘test.mrc’) # => Uses My::MARC::Parser now xmlreader = MARC::Reader.new(‘test.xml’, :readertype=>:marcxml) # …and maybe further on down the road asreader = MARC::Reader.new(‘test.seq’, :readertype=>:alephsequential) mjreader = MARC::Reader.new(‘test.json’, :readertype=>:marchashjson) A parser need only implement #each and a module-level method #decode_from_string. Read all about it on the github page.

Comments closed