Description Peddlers and Data.gov: Two Peas In a Pod

As you may have heard, the National Archives issued a press release today announcing the release of three data sets on Data.gov:

The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007—2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.

The Archival Research Catalog, or ARC, is NARA's primary access system for archival description, representing 68% of NARA's entire holdings. This breaks down to the following:

2,720,765 cubic feet
520 record groups
2,365 collections
102,598 series
3,265,988 file units
292,887 items

In addition, there are 6,354,765,793 logical data records and 465,050 artifacts described in ARC.

NARA's decision to share this data is a breakthrough for archives and people who love data. The size of the data provided by NARA in ARC is also immense; the ï»¿combined descriptions plus contextual information on represented organizations totals approximately 21 gigabytes when uncompressed.

Obviously, transferring this much data is difficult, and I was quite shocked when I discovered that NARA didn't bother to compress this data in the first place when I first decided to get my grubby paws on it. Not to be outdone, I corresponded with a few people over Twitter who were just as interested in the data, specifically Simon Spero at the UNC School of Information and Library Science, and Richard Urban, at UIUC's Graduate School of Library and Information Science. The three of us made a concerted effort to grab the data from NARA's web server and make a compressed version available.

After 6 hours of so of transferring the files and compressing them, Simon has posted the compressed dataset on ibiblio.org, as part of his Fred2.0 dataset project. Download the whole thing, decompress it, and start crunching - there's so much you can do with it! Convert the series descriptions to EAD! Convert the organizational descriptions and histories to EAC! Throw Mitchell Whitelaw's series browser on top of it! The future's in your hands, people, and now the data is too.

We've talked about posting a torrent, but between the compression and the high bandwidth available from ibiblio, it doesn't seem to be quite as a pressing need. However, if you'd like, it could be arranged. More detail on the datasets, including detailed information about the tags and structure of the data within, can be found on Data.gov.

Publish date: January 28, 2010

Tags:

by M.A. Matienzo

Permalink

Comments

3 Comments

💬 Jill at January 30, 2010, 02:57 UTC:
Hi Mark,
I work at NARA on social media and ARC. We heard about your blog post from Kate Theimer on our NARAtions blog.
We are really excited to hear that you and others are eager to start crunching the data NARA has made available! It wasn't that we didn't bother to compress the data. NARA IT staff kept running into technical issues while working on the compression, so we decided to go ahead and post it uncompressed rather than hold it back. I'm impressed that you, Simon and Richard figured out a solution to compress and share it.
I hope you'll keep us posted on anything cool that comes out of your work with the data. Have fun!
💬 M.A. Matienzo at January 30, 2010, 03:48 UTC:
Hi Jill, thanks for the additional comments! I appreciate everything that you and your colleagues at NARA did to get the data out there.
💬 Richard at January 30, 2010, 05:24 UTC:
Thanks for the mention, but the credit really goes to you and Simon!
I'll be interested to see what comes of this public data!