planning for disaster

how to set up unmaintainable indexing workflows

mark a. matienzo

center for history of of physics

american institute of physics

background

disclaimer: opinions expressed are mine alone and don't reflect those of CHP or AIP
CHP has an archives, but we have little space
one of our main responsibilities is coordinating placement of collections
even though we don't keep the collections, we maintain metadata
ICOS: MARC data served through Horizon
PHFAWS: EAD/HTML/PDF finding aids

background: PHFAWS

http://aip.org/history/ead/
goal: create searchable index of finding aids for history of physics, etc. collections
contains both finding aids for CHP held collections and ones held by other archives
originates in early collaborative EAD project (ca. 1999)
we still host a few finding aids for other archives

background: indexing

we were pushed to use Verity (AIP had lock-in already)
it never really worked well over a period of 4 years
2 or 3 months ago we began to rethink the process
BUT we still had to use verity, and we couldn't run the indexer ourselves

implementation (1)

we created XML file which contains URLs of finding aids we wanted to index: http://www.aip.org/history/ead/ead_urls.xml *
browse page created from this file via XSLT: http://www.aip.org/history/ead/browse.html *
redirects (needed by verity) created via XSLT

implementation (2)

we were told that verity could parse this file
in reality, it was used to create (via XSLT) a bash script to call indexer
we still couldn't run indexer ourselves
"solution": someone created a CGI perl script where we could "just click a button" to run bash script

end product?

credit: this old house home inspection nightmares

problems

transforms are done manually
if we don't remember to run transform, then indexer won't add new data
we still can't debug indexer problems ourselves
CGI page not behind firewall (potential DoS vulnerabilty)