Current Projects & Initiatives
DRS 2 Enhancements
In the fall of 2008, OIS started work on the first major upgrade of the DRS - "DRS 2". This is a multi-year project to enhance all DRS applications, services and documentation. The enhancements are being conducted in phases. Each phase has a theme and is scoped to cover a fixed set of enhancements and include new or enhanced support for a file format or genre. OIS is curently in the second phase of this project (DRS 2.2). For more information see DRS 2 Enhancements.
DuraCloud Pilot Project
During the summer of 2011, OIS is conducting a pilot project using DuraCloud, the cloud storage solution offered by DuraSpace. DuraCloud is being tested as a potential additional preservation storage location for DRS content. This project runs from May-August 2011.
Election 2012 Web Archive
As a member of the International Internet Preservation Consortium (IIPC), Harvard Library, including staff at the Harvard Kennedy School Library and at the Office for Information Systems, is collaborating with other academic libraries as well as non-profit and government organizations on a web archiving project. Through the Election 2012 Web Archive, we will collect and preserve web sites related to the 2012 election campaign in the United States. Our goal is to capture this important historical record to ensure long-term preservation and accessibility for teaching, research and the general public.
Subject experts at the Harvard Kennedy School and at other academic institutions in areas such as political science and public policy are identifying relevant web sites for long-term preservation.
For more information about the IIPC, see below.
For more information about web archiving at Harvard, see About WAX.
To search or browse Harvard’s web archive, see the WAX home page.
Email Archiving Pilot
Recognizing the value of email in the late 20th and 21st centuries, Harvard University Libraries began planning for an email archiving project in early 2007. A working group comprised of University archivists, curators, record managers, librarians and technologists studied the problem and recommended the undertaking of a pilot email archiving project at the University Library. This two-year pilot, which began in late 2008, will implement a system for ingest, processing, preservation, and eventual end user delivery of email, in anticipation of it becoming an ongoing central service at the University after the pilot. This pilot is managed within OIS and involves staff from three University repositories - the University Archives, Schlesinger Library, and Countway Library. For more information see: Electronic Archiving System (EAS).
FITS (File Information Tool Set)
FITS identifies, validates, and extracts technical metadata for various file formats. It wraps multiple third-party open source tools (JHOVE, Exiftool, National Library of New Zealand Metadata Extractor, DROID, FFIdent, and the File Utility), normalizes and consolidates their output, and reports any errors. For more information see the FITS website.
JHOVE (JSTOR/Harvard Object Validation Environment)
An extensive technical description of the formal characteristics of a digital resource is a necessary precursor to preservation planning for or intervention on that resource. These characteristics are highly dependent upon the format used to represent the resource's abstract content. With funding from the Andrew W. Mellon Foundation, LDI staff collaborated with the JSTOR Electronic-Archiving Initiative (now known as Portico) to produce an extensible tool, called JHOVE (the JSTOR/Harvard Object Validation Environment, pronounced "jove"), for automating format-specific identification, validation, and characterization of digital resources. Harvard and JSTOR have made this tool available to the wider community under an open source license, and it is widely deployed internationally. JHOVE has facilities to extract important technical characteristics of resources created in many commonly-used formats, such as AIFF and WAVE (audio); GIF, JPEG, JPEG 2000, and TIFF (still image); ASCII and UTF-8 (text); and PDF.
Harvard, in conjunction with Portico and Stanford University, has received a second round of funding from the Library of Congress under its National Digital Information Infrastructure Preservation Program (NDIIPP) to continue the development of JHOVE with significant extensions in functional capability and supported formats. For more information see the JHOVE website.
HathiTrust
In November 2010 Harvard joined HathiTrust, a shared digital repository for primarily digitized book and journal content. HathiTrust is co-owned and co-managed by a partnership of over 50 academic and research libraries. HathiTrust provides preservation and access services for the digital content contributed by its partnering libraries. For more information see the HathiTrust website.
International Internet Preservation Consortium (IIPC)
The IIPC is an international consortium of libraries, archives, museums and cultural heritage institutions involved in Web Archiving. Members of the IIPC collaboratively conduct research and projects; develop tools, standards and best practices; collect web content; and facilitate the long-term preservation and use of Web-based content. In 2010 after the release of Harvard's Web Archiving Collection Service (WAX), Harvard became a member of the IIPC and participates within the Preservation Working Group (PWG). For more information see the IIPC website and the PWG website.
National Digital Stewardship Alliance (NDSA)
Harvard University is a founding member of the NDSA - a network of organizations committed to preserving our national digital heritage. Harvard is represented within the NDSA by individuals from OIS and IQSS. OIS participates in two NDSA working groups - the Infrastructure and Standards & Best Practices working groups. For more information see the NDSA website.
Unified Digital Format Registry (UDFR)
The UDFR is an initiative begun in April 2009 to build a single shared formats registry. UDFR builds on years of work performed by a number of institutions internationally, including Harvard, whether it was for PRONOM, the Global Digital Formats Registry (GDFR), or other format registry projects. It will:
- be technically based on the existing PRONOM database
- create a community governance model for the registry involving all institutions willing to contribute to its development
- develop a mechanism for the distribution of the registry data in such a way as to support local extensions and additions to the database
- develop both technical and organizational support for distributed input to the registry, including some form of quality vetting of contributed data
Harvard serves on the UDFR interim governance working group and the technical working group. Near the end of 2009 the interim governance working group submitted a proposal to the Library of Congress' National Digital Information Infrastructure and Preservation Program (NDIIPP) to fund the one-year program of technical work needed to establish the first version of the UDFR. Under the proposal the work would be conducted at the University of California Curation Center (UC3) of the California Digital Library (CDL). UC3 will provide project oversight and management, and will hire two new staff for the project - a project architect and developer. The proposal was accepted by the Library of Congress in early 2010 and UC3 has now begun the hiring process for the project. When the UDFR becomes operational (estimated to be in the fall of 2011), it will have:
- A publicly accessible web-based user interface that can be used to search, browse, display and download registry records
- An API for tools and services to query, retrieve and export registry records for use in local repositories or applications
- Ability to export information to DROID, a format identification tool created by the UK National Archives
- Automatic tracking of the history of registry information changes
- Population of the registry with all of the PRONOM content
For more information see the UDFR website.
Zone 1: A Rescue Repository for Digital Content
Zone 1 is a project, begun in June 2011 and funded by the Harvard Library Lab, with three parallel activities:
- Development of a prototype rescue repository at Harvard
- An in depth study by Harvard and MIT of one category of content expected to be deposited to a rescue or long-term preservation repository--faculty records
- A series of discussions at Harvard and MIT of policies that would need to be addressed if the rescue repository were to become a production system at Harvard, or a similar system were implemented at MIT
The Zone 1 prototype rescue repository will serve as a proof-of-concept for a future production version of the repository which would close the gap in secure storage solutions at Harvard that currently exists. The repository will have a low deposit barrier to ensure that valuable content at Harvard won't be lost. It will also serve as a conduit of review by the Harvard community for potential re-use or long-term stewardship of the content and will facilitate transfer to other repositories.
This project is being conducted by staff fromthe Harvard University Library Office for Information Systems (OIS), the University Archives at Harvard, and the Institute Archives and Special Collections at MIT.
For more information see the Zone 1 web page.
Past Projects & Initiatives
E-journal Archiving
With funding from the Andrew W. Mellon Foundation, Harvard conducted an in-depth study of the licensing, economic, organizational, and technical issues involved in building a large-scale archive of electronic journals. Working with Blackwell Publishers, the University of Chicago Press, and John Wiley & Sons, Harvard analyzed E-journal content and technical formats, contractual arrangements under which an archive could be assembled and accessed, and who would benefit by and who should pay for archiving. Internally, a technical team studied what changes to the LDI infrastructure were required in order to archive content of this complexity and scale over extended time frames. The results of the Harvard study were widely reported and discussed at meetings of the Digital Library Federation (DLF), the American Library Association (ALA), the American Association for the Advancement of Science (AAAS), the Society for Scholarly Publishing (SSP), and the Coalition for Networked Information (CNI), among others. The April 2002 report is available from the DLF website:
Report on the Planning Year Grant For the Design of an E-journal Archive [pdf]]
As a follow-up to the 2002 Mellon Foundation-funded e-journal archiving project, LDI funded a collaborative project with the National Library of Medicine (NLM) to produce an open-source archival and interchange XML Document Type Definition (DTD). The DTD is designed to increase the ease of interchange between publishers and archives for article-level e-journal content. Without this DTD, the structure of e-journal content can vary widely, requiring costly human intervention and multiple parallel workflows within archival repositories. The DTD was designed after extensive document analysis in many subject domains to ensure that it does not reflect the bias of any particular academic discipline. Based on public standards, the DTD features a modular structure that allows customizing and that should be an easy target of transformation from existing XML- or SGML-encoded content. In addition to being used by NLM for the PubMed Central archive, this DTD is well positioned to become a standard format for the transfer and archival storage of scholarly literature.
NLM Journal Archiving and Interchange Tag Suite
LOCKSS (Lots of Copies Keeps Stuff Safe)
The goal of LOCKSS is to preserve access to web-based content, primarily e-journals, by maintaining multiple copies at physically disparate locations and by conducting periodic comparisons among them to ensure that materials remain consistent, authentic, and accessible. LDI staff participated in the alpha and beta development phases of the LOCKSS system.
LOCKSS (Lots of Copies Keeps Stuff Safe)
Registry of Digital Masters
To avoid unnecessary and expensive duplication of digital reformatting efforts, LDI staff participated with the Digital Library Federation (DLF) in plans for a national digital registry of born-digital materials and digitally reformatted books and journals. By consulting the registry before digitization efforts are undertaken, a content owner can determine if an appropriate digital version already exists and is being preserved in a professional manner that obviates the need for local management.
DLF/OCLC Registry of Digital Masters
Digital Repository Certification
The Research Library Group (RLG) and the National Archives and Records Administration (NARA) created a joint task force to develop criteria for identifying digital repositories meeting minimal standards for trustworthiness in the professional management and preservation of digital content. LDI staff participated in the development of these criteria, which include both technical and organizational metrics, as well as in follow-on work by the Center for Research Libraries (CRL) to develop a specific audit methodology. LDI staff use the audit guidelines development by the RLG/NARA task force for self-assessment of the Digital Repository Service (DRS) and for planning for its future functional and operational enhancements.
RLG/NARA Digital Repository Certification
CRL Certification of Digital Archives
Archive Ingest and Handling Test (AIHT)
By both policy and design, the LDI Digital Repository Service (DRS) is intended for highly "curated" digital assets; that is, those that are owned and submitted by known users, created according to well-known workflows and meeting well-known technical specifications, and in a small set of approved formats. The Library of Congress organized its Archive Ingest and Handling Test (AIHT) to investigate issues surrounding the import, export, and manipulation of a sizeable test corpus—approximately 57,000 files (12 GB) in more than 90 formats—of unknown provenance by institutions with very different preservation strategies and technological infrastructures. Harvard participated in this test along with Johns Hopkins University, Old Dominion University, and Stanford University. The results of the test highlighted the need for community agreement on descriptive and packaging standards, file transfer and validation tools, and best practices for repository operation and preservation planning.
AIHT: Conceptual Issues from Practical Tests
Harvard's Perspective on the Archive Ingest and Handling Test
D-Lib Magazine (December 2005)
Global Digital Format Registry (GDFR)
Preservation activities depend upon extensive knowledge of the formats which are used to represent digital content. Since this same information is useful to all institutions interested in preserving their digital assets, great economies of scale can be achieved from a centralized repository for this format information. LDI staff have been instrumental in articulating this concept within the digital library and preservation communities. With funding provided by the Andrew W. Mellon Foundation, Harvard and OCLC are engaged in the development of a Global Digital Format Registry (GDFR), a peer-to-peer network of independent, but cooperating format registries that will use a common protocol to synchronize their holdings of important format documentation and technical information. Beyond the technical work in creating the software underlying the GDFR network, Harvard is also cooperating with the National Archives and Records Administration (NARA) in an investigation of the business and governance issues that need to be addressed in order to ensure that the GDFR will remain viable over time as a core service to the preservation community.
Global Digital Format Registry (GDFR)
PDF/A
Adobe's Portable Document Format (PDF) has rapidly become a de facto standard for the dissemination and presentation of electronic documents on the web. Unfortunately, the feature-rich nature of PDF permits tremendous variability in the internal structure of these documents. Further, it allows documents to be dynamically composed at the time of their display from separate external resources, which leads to significant difficulties in ensuring their long-term viability. In order to address these concerns, the International Organization for Standardization (ISO) convened a Joint Working Group to produce a constrained version of PDF suitable for archival preservation, known as PDF/A. Stephen Abrams, then LDI's digital library program manager, was the project leader and document editor for the initial version of the PDF/A standard. The PDF/A standard defines the features that should be required, recommended, restricted, or prohibited in order to make electronic documents more amenable to long-term preservation.
For more information visit the PDF/A Competence Center.
PREMIS (Preservation Metadata Implementation Strategies)
More info on PREMIS
PREMIS: Preservation Metadata Implementation Strategies
NISO Z39.87, Data Dictionary – Technical Metadata for Digital Still Images
More info on NISO Z39.87
ANSI/NISO Z39.87, Data Dictionary – Technical Metadata for Digital Still Images
MIX: NISO Metadata for Images in XML Schema

