As you may have heard, the National Archives issued a press release today announcing the release of three data sets on Data.gov:
The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007—2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.
The Archival Research Catalog, or ARC, is NARA's primary access system for archival description, representing 68% of NARA's entire holdings. This breaks down to the following:
2,720,765 cubic feet 520 record groups 2,365 collections 102,598 series 3,265,988 file units 292,887 items In addition, there are 6,354,765,793 logical data records and 465,050 artifacts described in ARC.
NARA's decision to share this data is a breakthrough for archives and people who love data.
What exactly is archival access, and how does archival description make it possible? I feel like that in some form or another I've been struggling with this question throughout my career. Recently, this blog post from The Top Shelf, the blog of the University of Texas at San Antonio Archives and Special Collections Department, came across my radar, wherein they write (emphasis in original):
UTSA Archives and Special Collections is among the growing number of archives to create an online presence for every one of its collections. ... We were able to utilize inventories generated by former and current collection assistants to create guides to the collection with folder-level and box-level descriptions. The project resulted in access to more than 130 collections and 2000 linear feet of materials.
What defines that accessibility? I certainly don't intend to be a negative Nancy about this - adding finding aids and other descriptive metadata about collections is obviously useful. But how has it necessarily increased access to the materials themselves?
I've been struggling with the fact that (American) archival practice seems to bind contextual description (i.e., description of records creators) to records description. Much of these thoughts have been stirring in my head as a result of my class at Rare Book School. If we take a relatively hardline approach, e.g. the kind suggested by Chris Hurley ("contextual data should be developed independently of the perceived uses to which it will be put", 1, see also 2), it makes total sense to separate them entirely. In fact, it starts making me mad that the <bioghist> tag exists at all in EAD. Contextual description requires that it be written from a standpoint relative to that of the creator it describes. I guess what I keep getting hung up on is if there could be a relevant case that really merits this direct intellectual binding. I therefore appeal to you, humble readers, to provide me with your counsel. Do you think there are any such cases, and if so, why?
This week in Charlottesville has been a whirlwind exploration of standards and implementation strategies thus far during my class, Designing Archival Description Systems, at Rare Book School. My classmates and I have been under the esteemed tutelage of Daniel Pitti, who has served as the technical architect for both EAD and EAC. Interestingly, there's been a whole lot of talk about linking data, linked data, and Linked Data, date normalization, and print versus online presentation, among other things. In addition, a few things have floated past on my radar screen this week that have seemed particularly pertinent to the class.
The first of these was a post by Stefano Mazzocchi of Metaweb, "On Data Reconciliation Strategies and Their Impact on the Web of Data". In Stefano's post, he wrote about the problem of a priori data reconciliation vs. a posteriori; in other words, whether you iron out the kinks, apply properties like owl:sameAs, etc., on the way in or on the way out.
Crossposted to NYPL Labs.
I'm staying with colleagues and good friends during my week-long stint in Charlottesville, Virginia for Rare Book School. If you're here - particularly if you're in my class (Daniel Pitti's Designing Archival Description Systems) - let me know. I'm looking forward to a heady week dealing with descriptive standards, knowledge representation, and as always, doing my best to sell the archives world on Linked Data. Notes and thoughts will follow, as always, on here.
This last Tuesday, I spoke at the Annual Meeting of the Archivists' Roundtable of Metropolitan New York, where I gave a talk on archives and the Semantic Web. The presentation went over very well, and colleagues from both the archives field and the semantic technology field were in attendance. I did my best to keep the presentation not overtly technical and cover just enough to get archivists to think about how things could be in the future. I also have to give a big hat tip to Dan Chudnov, whose recent keynote at the Texas Conference on Digital Libraries helped me organize my thoughts. Enjoy the slides, and as always, I relish any feedback from the rest of you.
So, it's time for another rant about my issues with EAD. This one is a pretty straightforward and short one, and comes down to the issue that I should essentially be able to mix and match metadata schemas. This is not a new idea, and I'm tired of the archives community treating it like it is one. Application profiles, as they are called, allow us to define a structured way to combine elements from different schemas, prevent addition of new and arbitrary elements, and tighten existing standards for particular use cases.
However, to a certain extent, the EAD community has accepted the concept of combining XML namespaces but on a very limited level. The creation of the EAD 2002 Schema allows EAD data to be embedded into other XML documents, such as METS. However, I can't do it the other way around; for example, I can't work a MODS or MARCXML record into a finding aid. Why not? As I said in my last dEAD Reckoning rant as well as during my talk at EAD@10, the use of encoding analog attributes is misguided, confusing, and just plain annoying.
A while back, I wrote a Bad MARC Rant, and I considered titling this a Bad Metadata Rant. However, as the kids say, I got mad beef with a little metadata standard called Encoded Archival Description. Accordingly, I figured I should begin a new series of posts discussing some of these issues that I have with something that is, for better or for worse, a technological fixture of our profession. This is in part prompted by thoughts that I've had as a result of participating in EAD@10 and attending the Something New for Something Old conference sponsored by the PACSCL Consortial Survey Initiative.
Anyhow, onto my first bone to pick with EAD. I'm incredibly unsatisfied with the controlled access heading tag <controlaccess/>. First of all, it can occur within itself, and because of this, I fear that there will be some sort of weird instance where I have to end up parsing a series of these tags 3 levels deep. Also, it can contain a <chronlist/>, which also seems pretty strange given that I've never seen any example of events being used as controlled access terms in this way.
The ICA's Committee of Best Practices and Standards released the first edition of the International Standard for Describing Functions (ISDF). Like much of ICA's other work in descriptive standards for archives, ISDF is designed to be used in conjunction with established standards such as ISAD(G) and ISAAR(CPF), as well as standards in preparation such as ISIAH. ISDF will assist both archivists and users to understand the contextual aspects of the creation of records of corporate bodies. Through ISDF and related standards, archivists will be able to develop improved descriptive systems that can be potentially implemented using a Linked Data model.