Harvard University Library is pleased to announce the public launch of
Harvard's new Web Archive Collection Service (WAX)
http://wax.lib.harvard.edu.
WAX began as a pilot project in July 2006, funded by the University's
Library Digital Initiative (LDI) to address the management of web sites
by collection managers for long-term archiving. It was the first LDI
project specifically oriented toward preserving "born-digital"
material.
The pilot was designed to address the capture, management, storage, and
display of web sites for long-term archiving. It was a collaboration of
the University Library's Office for Information Systems with three
University partners, each fielding a single project: the Harvard
University Archives (Harvard University Library); the Arthur and
Elizabeth Schlesinger Library on the History of Women in America
(Radcliffe Institute for Advanced Study); and the Edwin O. Reischauer
Institute of Japanese Studies (Faculty of Arts and Sciences, with
sponsorship from Harvard College Library).
During the pilot, we explored the legal terrain and implemented several
methods of mitigating risks. We investigated various technologies and
developed work flow efficiencies for the collection managers and the
technologists. We analyzed and implemented the metadata and deposit
requirements for long term preservation in our repository. We continue to
look at ways to ease the labor intensive nature of the QA process, to
improve display as the software matures and to assess additional
requirements for long term preservation.
To date, we are storing 5,159 ARC files for 1405 WAX harvests
representing 141 seeds (starting URLs) in our Digital Repository Service
(DRS). These include 335 MIME types, 12,133,528 resources (individual
HTML pages, images, graphics, audio or video clips, style sheets,
scripts, etc.) for a total of 392 gigabytes.
WAX was built using several open source tools developed by the Internet
Archive and other International Internet Preservation Consortium (IIPC)
members. These IIPC tools include the Heritrix web crawler; the Wayback
index and rendering tool; and the NutchWAX index and search tool. WAX
also uses Quartz open source job scheduling software from OpenSymphony.
In February 2009, the pilot public interface was launched and announced
to the University community. WAX has now transitioned to a production
system supported by the University Library's central
infrastructure.
To view the collections, visit:
http://wax.lib.harvard.edu. For
more information, visit:
http://hul.harvard.edu/ois/systems/wax,
consult the May 2009 Power Point presentation:
http://hul.harvard.edu/ois/support/docs-wax.html,
or contact Wendy Gogel:
wendy_gogel@harvard.edu
Wendy Marcus Gogel
Digital Projects Program Librarian
HUL - Office for Information Systems
