The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007—2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.
The Archival Research Catalog, or ARC, is NARA's primary access system for archival description, representing 68% of NARA's entire holdings. This breaks down to the following:
- 2,720,765 cubic feet
- 520 record groups
- 2,365 collections
- 102,598 series
- 3,265,988 file units
- 292,887 items
In addition, there are 6,354,765,793 logical data records and 465,050 artifacts described in ARC.
NARA's decision to share this data is a breakthrough for archives and people who love data. The size of the data provided by NARA in ARC is also immense; the ï»¿combined descriptions plus contextual information on represented organizations totals approximately 21 gigabytes when uncompressed.
Obviously, transferring this much data is difficult, and I was quite shocked when I discovered that NARA didn't bother to compress this data in the first place when I first decided to get my grubby paws on it. Not to be outdone, I corresponded with a few people over Twitter who were just as interested in the data, specifically Simon Spero at the UNC School of Information and Library Science, and Richard Urban, at UIUC's Graduate School of Library and Information Science. The three of us made a concerted effort to grab the data from NARA's web server and make a compressed version available.
After 6 hours of so of transferring the files and compressing them, Simon has posted the compressed dataset on ibiblio.org, as part of his Fred2.0 dataset project. Download the whole thing, decompress it, and start crunching - there's so much you can do with it! Convert the series descriptions to EAD! Convert the organizational descriptions and histories to EAC! Throw Mitchell Whitelaw's series browser on top of it! The future's in your hands, people, and now the data is too.
We've talked about posting a torrent, but between the compression and the high bandwidth available from ibiblio, it doesn't seem to be quite as a pressing need. However, if you'd like, it could be arranged. More detail on the datasets, including detailed information about the tags and structure of the data within, can be found on Data.gov.