ohilist.py
Processing MARC into HTML
Mark A. Matienzo
What is ohilist.py?
- Python script that creates static HTML list of NBL-held oral history interviews from MARC data
- Part of a group of three Python scripts used to convert MARC data into HTML for different purposes
- Used by archives professional staff every few months to generate new list
Why Python?
- Straightforward syntax, even for nonprogrammers
- Old scripts used a number of languages (Perl, XSLT and Java for transforms, Unix shell, Windows batch)
- pymarc
- http://pypi.python.org/pypi/pymarc/
- Does heavy lifting for all three scripts
- Often faster than Perl's MARC modules
- Active (but small) development community
- I contributed to code to its development
Using ohilist.py
- Create full dump of MARC data from Horizon, using specific export target
- From command line:
python ohilist.py [marcfile]
- Upload HTML file to AIP webserver
Script architecture
- Comprised of three files:
- ohilist.py: script itself
- ohitemplate.py: template for HTML
- aipmarc.py: AIP extensions for pymarc
- Template used to separate the layout from the rest of the code
How it works (1) - the Main Loop
for record in reader:
if record['998'] is not None:
if record['998']['c'] is not None:
collection = record['998']['c']
if collection == 'oh':
catdb = getCatdb(record)
bibno = getBibno(record)
url = 'http://www.aip.org/history/catalog/%s/%s.html' % (catdb, bibno)
interviewee = marc8_to_unicode(record.author())
interviewdate = '(Interview date: %s)' % getDate(record)
interview = [interviewee, interviewdate]
label = " ".join(interview)
interviews.append((url, label))
recordcounter += 1
else: pass
How it works (2) - Getting the Date
def getDate(record):
datelist = []
if record['245']['f'] or record['245']['g']:
if record['245']['f']: datelist.append(record['245']['f'])
if record['245']['g']: datelist.append(record['245']['g'])
return ' '.join(datelist)
if record['260']:
if record['260']['c']: return record['260']['c']
if record['008'].value()[7:11].isdigit():
datelist.append(record['008'].value()[7:11])
if record['008'].value()[11:15].isdigit():
datelist.append(record['008'].value()[11:15])
if len(datelist) > 1: return '-'.join(datelist)
else: return ''.join(datelist)
if getBibno(record) is not None:
sys.stderr.write('Could not derive date from bib number %s' % getBibno(record))
else:
sys.stderr.write('No date or bib number in: %s' % record['245'].formatField())
return None
How it works (3) - Sorting/Index
interviews.sort(key = lambda interviewkey: interviewkey[1].upper())
for interview in interviews:
for letter in letters:
initial = interview[1].upper()[0]
if initial == letter:
linkdata = '%s<br/>\n' % makeLink(interview[0], interview[1])
addToIndex(ohiindex, letter, linkdata)
ohikeys = ohiindex.keys()
ohikeys.sort()
shortcutlist = [makeLink('#' + key, key) for key in ohikeys]
shortcutlinks = " ".join(shortcutlist)
listbody.append(shortcutlinks)
for key in ohikeys:
listbody.append('<h2><a name="%s">%s</a></h2>\n' % (key, key))
linklist = [ohilink for ohilink in ohiindex[key]]
listbody.extend(linklist)
listbody.append('<br/><a href="#top">Back to Top</a>\n')
Evaluation/Questions
- Good, but not perfect
- Would be better if it didn't need to be run manually
- Nonetheless, best that we can do with what we have
- E-mail: mark@127.0.0.1 @ matienzo.org