skip to content


  • Disco-Powered pymarc

    I'd been long interested in starting to develop code using some sort of MapReduce implementation for distributed computing. I have never been able to get my head around Hadoop, so I gave up with that pretty quickly. I recently discovered Disco, a MapReduce framework with an Erlang-based core. Disco also allows you to to write your worker code in Python, which was a huge plus to me. After stumbling through the tutorial, I took the word count demo and put together some basic code using pymarc that gathered tag count statistics for a bunch of MARC files. The code's still in a very early form, and arguably should carve up large files into smaller chunks to pass off to the worker processes; I've gotten around this for the time being by splitting up the files using yaz-marcdump. Once I split the files, I pushed them into a tag of DDFS, the Disco Distributed File System. This was a useful way for me to write some demo code both for using pymarc and Disco.
  • pybhl: Accessing the Biodiversity Heritage Library's Data Using OpenURL and Python

    Via Twitter, I heard about the Biodiversity Heritage Library's relatively new OpenURL Resolver, announced in their blog about a month ago. More specifically, I head about Matt Yoder's new Ruby library, rubyBHL, which exploits the BHL OpenURL Resolver to provide metadata about items in their holdings and does some additional screenscraping to return things like links to the OCRed version of the text. In typical fashion, I've ported Matt's library to Python, and have released my code. pybhl is available from my site, PyPI, and Github. Use should be fairly straightforward, as seen below: >>> import pybhl >>> import pprint >>> b = pybhl.BHLOpenURLRequest(genre='book', aulast='smith', aufirst='john', date='1900', spage='5', volume='4') >>> r = b.get_response() >>> len(['citations']) 3 >>> pprint.pprint(['citations'][1]) {u'ATitle': u'', u'Authors': [u'Smith, John Donnell,'], u'Date': u'1895', u'EPage': u'', u'Edition': u'', u'Genre': u'Journal', u'Isbn': u'', u'Issn': u'', u'ItemUrl': u'', u'Language': u'Latin', u'Lccn': u'', u'Oclc': u'10330096', u'Pages': u'', u'PublicationFrequency': u'', u'PublisherName': u'H.N. Patterson,', u'PublisherPlace': u'Oquawkae [Ill.] :', u'SPage': u'Page 5', u'STitle': u'', u'Subjects': [u'Central America', u'Guatemala', u'Plants', u''], u'Title': u'Enumeratio plantarum Guatemalensium imprimis a H.
  • "Using the OCLC WorldCat APIs" now available in Python Magazine

    As of last Thursday, I have been inducted into the pantheon of published Python programmers (aye, abuse of alliteration is always acceptable). My article, "Using the OCLC WorldCat APIs," appears in the latest issue (June 2009) of Python Magazine. I'd like to thank my editor, Brandon Craig Rhodes, for helping me along in the process, not the least of which includes catching bugs that I'd overlooked. The article includes a brief history lesson about OCLC, WorldCat, and the WorldCat Affiliate APIs, a detailed introduction to worldcat, my Python module to interact with OCLC's APIs, and a brief introduction to SIMILE Exhibit, which helps generate the holdings mashup referenced earlier on my blog. Subscribers to Python Magazine have access to a copy of the code containing a functional OCLC Web Services key ("wskey") to explore the application.
  • worldcat In The Wild at OCLC's WorldCat Mashathon in Amsterdam

    It's good to see other people using your code. Thanks to the OCLC Devnet Blog, I found out that Etienne Posthumus used worldcat for a demo application he built during the WorldCat Mashathon in Amsterdam last week. Even more interesting is that Etienne's application was deployed on Google App Engine. Courtesy of OCLC's Alice Sneary, there is a brief video of Etienne presenting his application to the other Mashathon attendees:
  • Batch Reindexing for Drupal + Solr

    Crossposted to NYPL Labs. Sorry for any duplication! Hey, do you use Drupal on a site with several thousand nodes? Do you also use the Apache Solr Integration module? If you're like me, you've probably needed to reindex your site but couldn't be bothered to wait for those pesky cron runs to finish — in fact, that's what led me to file a feature request on the module to begin with. Well, fret no more, because thanks to me and Greg Kallenberg, my illustrious fellow Applications Developer at NYPL DGTL, you can finally use Drupal's Batch API to reindex your site. The module is available as an attachment from that same issue node on Nota bene: this is a really rough module, with code swiped pretty shamelessly from the Example Use of the Batch API page on It works, though, and it works well enough as we tear stuff down and build it back up over and over again.
  • DigitalNZ and Brooklyn Museum API Modules for Python

    I've been busy the last few weeks, so I didn't even really announce this to begin with! I've been playing around with some of the cultural heritage APIs that are available, some of which I learned about while I was at Museums and the Web 2009. While I was away I released code for a Python module for interacting with the Brooklyn Museum Collections API. After chatting with Virginia Gow from DigitalNZ, I also got motivated to write a Python module to interact with the DigitalNZ API. The code for both is fairly unpolished, but I'm always ready for feedback! Both modules are available as Mercurial repositories linked from my Bitbucket account. There's also a small cluster of us working on a museum API wiki to begin sorting out some of these issues. Comparably speaking, the library and archives world has it somewhat easy...
  • Brooklyn Museum Releases API

    The always groundbreaking Brooklyn Museum has now released an API to allow the public to interact with their collections data. I can't even tell you how happy I am about this in terms of an open data perspective. Also, this is the direction that makes the whole "detailed curation by passionate amateurs" thing possible. There are only three simple methods for accessing the data. Ideally, it would be nice to see them put their collections metadata up as linked data, but now I'm daring to dream a little. Hey, wait a minute! I think that's the perfect way to start playing around with the API. Doing some digging through the documentation, I'm seeing that all the objects and creators seem to have URIs. Take a crack at it - the registration form is ready for you.
  • Moving worldcat to Mercurial and Bitbucket

    It's official - I've moved the codebase for worldcat, my Python module for working with the OCLC WorldCat APIs, to be hosted on Bitbucket, which uses the Mercurial distributed version control system. You can find the new codebase at
  • API Fun: Visualizing Holdings Locations

    In my previous post, I included a screenshot of a prototype, but glossed over what it actually does. Given an OCLC record number and a ZIP code, it plots the locations of the nearest holdings of that item on a Google Map. Pulled off in Python (as all good mashups should be), along with SIMILE Exhibit, it uses the following modules: geopy simplejson and, of course, worldcat. If you want to try it out, head on over here. The curent of the code will soon be able as part of the examples directory in the distribution for worldcat, which can be found in my Subversion repository.
  • This Is All I'm Going To Say On This Here Blogsite Concerning The Brouhaha About The Policy for Use and Transfer of WorldCat Records Because I Have Other, More Interesting And More Complex Problems To Solve (And So Do You)

    The moderated discussion hosted and sponsored by Nylink went pretty well. Also, I don't need the records to have fun with the data "” I just need robust APIs. (In fact, as I said today, I'd prefer not to have to deal with the MARC records directly.) Robust APIs would help making prototypes like this one I hacked together in a few hours into a real, usable service.
  • Lightening the load: Drupal and Python

    Man, if this isn't a "you got your peanut butter in my chocolate thing" or what! As I wrote over on the NYPL Labs blog, we've been up to our necks in Drupal at MPOW, and I've found that one of the great advantages of using it is rapid prototyping without having to write a whole lot of code. Again, that's how I feel about Python, too, but you knew that already. Once you've got a prototype built, how do you start piping stuff into it? In Drupal 6, a lot of the contrib modules to do this need work - most notably, I'm thinking about node_import, which as of yet still has no (official) CCK support for Drupal 6 and CCK 2. In addition, you could be stuck with having to write PHP code for the heavy lifting, but where's the joy in that? Well, it so happens that the glue becomes the solvent in this slow, slow dance.
  • Going off the Rails: Really Rapid Prototyping With Drupal

    Previously posted on The other Labs denizens and I are going off the rails on a crazy train deeper down the rabbit hole of reimplementing the NYPL site in Drupal. As I pile my work on the fire, I've found that building things in Drupal is easier than I'd ever thought it to be. It's a scary thought, in part because I'm no fan of PHP (the language of Drupal's codebase). Really, though, doing some things can be dead simple. It's a bit of a truism in the Drupal world at this point that you can build a heck of a lot just by using the CCK and Views modules. The important part is that you can build a heck of a lot without really having to know a whole lot of code. This is what threw me off for so long - I didn't realize that I was putting too much thought into building a model like I normally would with another application framework.
  • deliciouscopy: a dumb solution for a dumb problem

    You'd think there was some sort of tried and true script for Delicious users to repost bookmarks from their inboxes into their accounts, especially given that there are often shared accounts where multiple people will tag things as "for:foo" to have them show up on foo's Delicious account. Well, there wasn't, until now (at least as far as I could tell). Enter deliciouscopy. It uses pydelicious, as well as the Universal Feed Parser and simplejson. It reads a user's inbox, checks to see if poster of the for:whomever tag was added to your network, and reposts accordingly, adding a via: tag for attribution. It even does some dead simple logging if you need that sort of thing. The code's all there, and GPL license blah blah blah. I hacked this together in about an hour for something at MPOW - namely to repost things to our shared account. It's based on Michael Noll's but diverges from it fairly quickly. Enjoy, and give any feedback if you must.
  • Developing Automated Repository Deposit Modules for Archivists' Toolkit?

    I'd like to gauge interest for people to help add code to Archivists' Toolkit to automate the deposit of digital objects into digital repositories. At first glance, the biggest issue is having to deal with differing deposit APIs for each repository, but using something like SWORD would make sense to bridge this gap. Any and all feedback is welcome!
  • Python WorldCat Module v0.1.2 Now Available

    In preparation for the upcoming WorldCat Hackathon starting this Friday, I've made a few changes to worldcat, my Python module for interacting with OCLC's APIs. Most notably, I've added iterators for SRU and OpenSearch requests, which (like the rest of the module) painfully need documentation. It's available either via download from my site or via PyPI; please submit bug reports to the issue tracker as they arise. EDIT: I've bumped up the version number another micro number to 0.1.1 as I've just added the improvements mentioned by Xiaoming Liu on the WorldCat DevNet Blog (LCCN query support, support for tab-delimited and CSV responses for xISSNRequests, and support for PHP object responses for all xIDRequests). EDIT: Thanks to Thomas Dukleth, I was told that code for the Hackathon was to be licensed under the BSD License. Accordingly, I've now dual licensed the module under both GPL and BSD.
  • Introducing djabberdjaw

    djabberdjaw is an alpha-quality Jabber bot written in Python that uses Django as an administrative interface to manage bot and user profiles. I've included a couple of plugins out of the box that will allow you to perform queries against Z39.50 targets and OCLC's xISBN API (assuming you have the requisite modules). djabberdjaw requires Django 1.0 or later, jabberbot, and xmpppy. It's available either from PyPI (including using easy_install) or via Subversion. You can browse the Subversion repository, too.
  • Slaying the Scary Monsters

    Previously posted on

    Drawings of monster and devil. Digital ID: 434322. New York Public Library

    Getting up to speed is hard anywhere, and it's especially difficult in a large, complex institution like NYPL. Other than just understanding the projects that you're given, you also are thrown headfirst into making sense of the culture, the organization, and all the unspoken and occasionally unseen things that allow you to do your job. There's no clear place to start this, so a good portion of the time you have to keep on top of that while you start thrashing away at your work. The question remains, though, how do you organize this stuff? How do you enable sensemaking in yourself and your peers?

  • Python WorldCat API module now available

    I'd like to humbly announce that I've written a pre-pre-alpha Python module for working with the WorldCat Search API and the xID APIs. The code needs a fair amount of work, namely unit tests and documentation. I've released the code under the GPL. The module, called "worldcat", is available from the Python Package Index. You can also checkout a copy of the code from my Subversion repository.
  • Easy Peasy: Using the Flickr API in Python

    Since I'm often required to hit the ground running at $MPOW on projects, I was a little concerned when I roped myself into assisting our photo archives with a Flickr project. The first goal was to get a subset of the photos uploaded, and quickly. Googling and poking around the Cheeseshop led me to Beej's FlickrAPI for Python. Little did I know that it would be dead simple to get this project going. To authenticate: def create_session(api_key, api_secret): """Creates as session using FlickrAPI.""" session = flickrapi.FlickrAPI(api_key, api_secret) (token, frob) = session.get_token_part_one(perms='write') if not token: raw_input("Hit return after authorizing this program with Flickr") session.get_token_part_two((token, frob)) return session That was less painful than the PPD test for tuberculosis. Oh, and uploading? flickr.upload(filename=fn, title=title, description=desc, tags=tags, callback=status) Using this little code plus a few other tidbits, I created an uploader that parses CSV files of image metadata exported from an Access database. And when done, the results look a little something like this.
  • Announcing, or, how I stopped worrying and learned to love Z39.50

    After more than a few late nights and long weekends, I'm proud to announce that I've completed my latest pet programming project. is a lightweight Z39.50-Web gateway, written, naturally, in Python. None of this would be possible without the following Python modules: Aaron Lav's PyZ3950, the beast of burden; Ed Summers' pymarc, the smooth-talking translator; and, quite possibly the best and most straightforward Python web framework available. I initially undertook this project as an excuse to play with PyZ3950 and to teach myself the workings of; I'd played with Django, but it seemed entirely excessive for what I was working on. First, I should mention that isn't designed to be a complete implementation of a Z39.50 gateway. There are many areas in which there is much to be desired, and it's probably not as elegant as some would like. However, that wasn't the point of the project. My ultimate goal was to create a simple client that could be used as a starting point from which to develop a complete web application.
  • No Excuses To The Power of Infinity

    I have no excuses for not updating this blog. I thought about forcing myself to comply some sort of resolution - you know, given the new year and all - but everyone knows how those turn out. Regardless, I have a whole backlog of things to post about, most notably being the countless Python programming projects I've been working on lately. Expect more posts to arise over the next few days as a result of this. Also, I have no excuses for botching up ArchivesBlogs temporarily by mucking about and wiping out some of WordPress's databases that make FeedWordPress, the plugin that grabs content for ArchivesBlogs, do its thing. The recovery was simpler than I thought it would be, but this is probably the largest amount of unplanned downtime we've had. Keep your eyes open, as a replacement for FeedWordpress may itself becoming along sooner or later.
  • When Life Hands You MARC, make pymarc

    It's a bad pun, but what can you expect from someone who neglects his blogs as much as I do? I've been busy, somewhat, and one of my latest forays has been getting a grip on Python, an absolutely wonderful programming language. I actually enjoy writing code again, which is more than a bit scary. I was sick of the mangled scripts and workflows I came up with at MPOW to handle converting MARC data to HTML and other such nonsense. Writing Perl made me feel unclean. After playing around with Ed Summers' pymarc module, I began hacking about and putting my own hooks into the code here and there. I longed for MARC8 to Unicode conversion, which is a necessary evil. Digging around, I came across Aaron Lav's PyZ3950 module, which had its own little MARC code. After bugging Ed via #code4lib, and hassling Aaron in the process, Ed began incorporating the code and I started some testing. Just a short while later, the conversion code worked.
  • An updated version of Nick Gerakines'

    A little over a month ago, Nick Gerakines posted a Perl script to be called from a Procmail configuration file. It seemed to work pretty well, but the anal-retentive cataloger/standards geek in me decided to pass the results through a feed validator. It failed in a few key areas: missing version attribute in the rss tag, improper guid and link tags, and a pubDate with a non-RFC822 date. These all seemed pretty easy to fix, so I went ahead and made some changes. My fixes are a bit inelegant, but they create valid RSS 2.0. It was pretty trivial to add an RSS version number and to fix the guid error; the latter just required adding the isPermaLink="false" attribute to that tag. However, Nick's original code required parsing the pubDate tags to determine when to kill data that was over 6 hours old. I didn't want to be bothered parsing an RFC822 date with this, so I moved that information into a category tag.