OASIS: History and Project Report

Background
Electronic access to information about manuscript and archival collections at Harvard
Genesis of the Digital Finding Aids Project
The Digital Finding Aids Project (DFAP)
Future directions


Background

Librarians have made major investments of time and money to automate their catalogs, because they saw the benefits to themselves and to their patrons of making bibliographic records more easily available. No longer is access to bibliographic inform ation bound to a physical card catalog; both librarians and researchers can, increasingly, obtain this information anywhere, whether within the library building itself or from a computer thousands of miles away. Harvard, with the largest university libra ry in the world, will maximize these benefits when in 1996 it completes its retrospective conversion project for roman alphabet monograph and serial cataloging records, creating an electronic database of some 8 million records in its on-line catalog, HOLL IS.

8 million records is an impressive figure. However, it is misleading to think that this immense database represents all, or even most, of the research materials available within the Harvard system. The so-called "non-book" collections at Harvard-- archives, manuscripts, photographs, drawings, and more--number in the hundreds of millions of items. The question before Harvard is: How can we make the descriptive information for these important and unique research collections more widely available as well?

Electronic access to information about manuscript and archival collections at Harvard

Since the Harvard/Radcliffe Manuscript Survey and Guide Project (1984-1986), many archival and manuscript collections at Harvard have been accessible via records in HOLLIS. These records, called "collection-level" records, are derived from and only briefly summarize detailed listings that more comprehensively describe, control, and provide access to the collection. This listing is often referred to as a "finding aid"; it can also be called a container listing or inventory.

For archives, manuscripts, and special collections, the finding aid is equal in importance to the collection-level bibliographic record--some would even argue that it is more important. Archivists and curators have been searching for an appropriate computer-based environment where these detailed descriptions could reside for many years. Repositories have been creating finding aids in electronic form over the last two decades, but electronic access to this information had been rudimentary. A survey of computer usage in Harvard's archive and manuscript repositories, conducted by the Harvard/Radcliffe Archivists Group in 1993, showed that more than half used a local word-processing or database system to create finding aids. But most repositories used these systems only for technical processing; there was little public access to these systems.

The highly decentralized nature of the Harvard library system--there are approximately 110 separate libraries, and some 49 archival repositories, including libraries, museums, hospitals, and a forest--has often made research at Harvard confusing to scholars. To track down a letter referenced in a footnote that states the letter is "at Harvard" can be a frustrating experience. While HOLLIS has certainly made this situation better, for collections, represented only by very summary records in HOLLIS, it is not a fully satisfactory solution. Harvard archivists and curators hoped that the electronic age would provide a solution.

Genesis of the Digital Finding Aids Project

In 1994, Harvard, as did many other institutions, established a gopher site to make electronic finding aids available to scholars over the Internet. The "Archives and Manuscripts at Harvard/Radcliffe" site, part of the HOLLIS portal (formerly known as HOLLIS Plus), includes descriptions of the 49 archival repositories, and finding aids from the Botany Libraries, Houghton, Law, and Schlesinger. While everyone felt that this was a good thing, it soon became obvious that such unstructured text offered only clumsy and frustrating searching, particularly with large files.

While Harvard archivists and curators were working on the gopher site, the university announced that work would soon begin on the design of the next generation of the on-line catalog, dubbed HOLLIS II. In response to intense interest from the archives and manuscripts community, the Harvard University Library Automation Planning Committee appointed a Special Collections Task Force in the spring of 1994, charged to examine the automation needs of repositories that acquire, preserve, and make accessible such collections of material. The report of that Task Force, issued in November 1994, focused on significant differences between electronic bibliographic records describing single items such as monographs, and those that describe collections. The Task Force felt strongly that the information in finding aids must be as easily available to scholars in electronic form as is the collection-level record, and recommended that options for doing this be pursued.

During the preparation of its report, the Task Force investigated work being done at the University of California, Berkeley, using SGML (Standard Generalized Markup Language) to encode electronic finding aids. It came to realize that in order to make all finding aids at Harvard accessible as if in a single database, a common structure was necessary, and it considered SGML the most promising structure available. It seemed to be a powerful yet flexible took that would address the varying needs of the large Harvard community of archivists for producing finding aids, making these electronic files available remotely, and meeting the growing expectations of our research clientele. The Automation Planning Committee agreed, and established the Digital Finding Aids Project in February 1995.

The Digital Finding Aids Project (DFAP)

This group (whose names are listed elsewhere on this site) is broadly representative of the Harvard library, archival, and electronic systems community. DFAP's first task was to review available options for the production of SGML-encoded texts, and to make recommendations. The choices ranged from free, simple SGML tools that work with existing document editors to expensive SGML-based publishing systems. Costs ranged from zero to thousands of dollars. The project wanted to recommend software that could be used by all repositories, regardless of hardware platform, so that Harvard archivists could develop and share a common experience and knowledge. The software also had to be affordable, since many repositories are single-person operations with minimal financial support. Software capability to import existing ASCII and word-processed files was also a factor, since the need for conversion of existing finding aids was also part of the project's considerations. DFAP concluded that in most cases an off-the-shelf "SGML-aware" word processor would be both economical and effective. Project members purchased WordPerfect 6.1, SGML Edition, then in beta version.

The second task was to decide on a way to "publish," that is, make available to everyone the SGML-encoded finding aids on- line. Concurrently, the Search Engine Evaluation Committee, also appointed by the Automation Planning Committee, investigated available SGML-aware publishing/searching software, and recommended that the University Library's Office for Information Systems acquire OCLC's SiteSearch. SiteSearch supports the inclusion of SGML documents in a database, allows for sophisticated Z39.50-based searching of them, and renders the search results as World Wide Web documents that can be viewed by any Web browser (such as Mosaic or Netscape). The finding aids were to be the "test" database for SiteSearch. Recently, however, given additional experience with SiteSearch, it was decided that it did not handle well long text documents such as finding aids, and that a new search engine was needed. Accordingly, the project now plans to use OpenText (used for large text databases such as the Oxford English Dictionary), and is waiting for version 6 to be released.

The third task, undertaken concurrently, was to mark up some Harvard finding aids using SGML. It quickly became clear that, given the divergence of traditional practice at Harvard repositories, a Harvard/Radcliffe manual of practice (using the Berkeley-developed standard, now called Encoded Archival Description (EAD)) was needed. A draft standard is now completed, although constantly being refined, and individual repositories are working on their own implementation standards. A style sheet, to display all Harvard finding aids, is also being constantly revised. The finding aids that you find at this site are also frequently revised, as the project gains more experience with the challenges of both SGML and of finding a common standard that will suit the varied needs and traditions of Harvard/Radcliffe repositories.

The project is also deeply concerned about training. As we have all struggled to understand the intricacies of the EAD, we have benefitted from the collaborative and supportive environment provided by the project's twice-monthly meetings with their lively discussions. The project hopes to begin designing a Harvard training program in the near future, so that those "lone archivists" among us will have the benefits of the project's early experience, and have a resource for help.

The Digital Finding Aids Project is the first step towards establishing a strong administrative structure for the creation and technical support of SGML-encoded finding aids at Harvard. The project is not the product of a single, technically inclined, person working in an isolated unit; it is a collaborative effort involving curators, archivists, catalogers, electronic text specialists, and library systems people, engaging participants from across Harvard. If one person leaves, the project will not grind to a halt. It is designed for the long term, integrating the efforts of archivists with those of book and serial catalogers, reference, and collection development staff at all levels, whose common aim it is to make the materials we are all charged to preserve available to all.

Future directions

In this context, it is important to emphasize that all of those on the project came to realize quite quickly that we will get out of the system what we put in. Working with highly structured documents such as finding aids, with pages typed on a variety of typewriters with handwritten annotations, scanning to get the information into electronic form is not a workable option--too much time-consuming cleanup is needed. Nor can SGML markup be done by untrained student staff. The markup is highly sophisticated, and requires supervision and training if the marked up finding aids are to produce more useful search results than the Gopher-based systems we now find so frustrating to use. The project will not end with the mounting of a small database of SGML-marked up finding aids; its work will continue as we establish a Harvard network of SGML-trained archivists, ready to help those at Harvard repositories without technical support staff.

It is not unreasonable to ask at this point--why are we going to all this trouble? If you look at one of the SGML finding aids on this site with the tagging displayed, the tagging does seem rather like learning MARC all over again, only worse, because there are even more tags! The answer is that for the first time, we seem to have an electronic tool that will enable us to create a better finding aid than the ones we create now, with more options for combining information in new ways. We find this exciting not only for what it will mean for our own work as archivists, but also for what it can do to foster more creative and innovative research among our scholarly user community.

We are not alone in our belief that SGML finding aids will be a major force in creating the library of the future. The EAD is rapidly becoming a national standard. The Network Development and MARC Standards Office of the Library of Congress, working with the Society of American Archivists, will maintain the EAD as it does the MARC standard. The Council of Library Resources, with SAA, is committed to the development and publication of application guidelines for the EAD. And the National Digital Library Federation sees in the EAD a common standard enabling searches for digitized archival materials across collections and institutions. Products like Hytime, a hypermedia extension of SGML, can be used to link images and sound to the finding aid. Overall, SGML and the EAD promise future flexibility as such enhancements are demanded by the scholarly community.

It has taken many, many years, but in SGML the archive and library community at Harvard has at last found a tool that will-- to use a slightly overworked phrase--help us all to "find common ground" as we work to create the library of the future.