Harvard Digital Finding Aids Project History (March 1999)
Note: This is an edited and updated version of Leslie A. Morris (Houghton Library), "Developing a cooperative intra-institutional approach to EAD implementation" American Archivist 60:4 (Fall 1997 [published August 1998]):388-407. ©The Society of American Archivists. Used with permission.
Abstract
Archives at Harvard University
Selection of SGML authoring software
Finding aids on the World Wide Web
Doing SGML markup with EAD
One repository's experience using EAD
The end of Phase I: OASIS
The larger context
Abstract
The Harvard/Radcliffe Digital Finding Aids Project (DFAP) was established in 1995 to explore options for making finding aids available remotely over the Internet. This paper reports on why SGML/EAD became the delivery medium of choice; the process of selection of SGML authoring software; the ongoing development of cross-Harvard standards and guidelines for application of EAD; cross-repository indexing and development of a search engine; and implementation challenges faced by the Harvard/Radcliffe repositories participating in the project. Also discussed are the development of an EAD template for future Houghton Library finding aids, and problems with retrospective conversion of existing finding aids.
Archives at Harvard University
To the world outside, the words "Harvard University" summon an image of a monolithic institution, one of the country's oldest and wealthiest universities, an organization that speaks with one voice. From the inside, the picture is somewhat different. One of the first pieces of University lore a new employee learns is the rather unlovely saying "Every tub on its own bottom." The "tubs" are the individual faculties, and they have virtually complete autonomy. The Librarians for each faculty report to their respective Deans, not to the University Librarian. Their budgets come from the faculties as well, which means that the Business and Law libraries have access to more funding than the Education and Divinity libraries. There is, however, a "Harvard University Library." It is a department of the Central Administration, and it has a purely coordinating role. But, if one of the faculty libraries decides it does not want to be coordinated--as when the Business School decided not to participate in HOLLIS, the Harvard online catalog--the University Library has no coercive powers.
When one refers to Harvard libraries and archives, therefore, it is important to remember this highly decentralized nature. There are not only 98 separate libraries in the Harvard system, there are 49 separate archival repositories, including archives that are part of libraries, museums, hospitals, and a forest. Each has its own traditions, and each has its own budget. Users of Harvard collections, however, are almost completely unaware of this situation and how it affects their research. Since the mid-1980s, the manuscript and archive community at Harvard has worked very hard to ameliorate the effect of such decentralization of collections on users. The first step was the Harvard/Radcliffe Manuscript Survey and Guide Project (1984-1986), a project to make many archival and manuscript collections accessible via collection-level MARC records in HOLLIS, the then-new Harvard online catalog.
This was a major step forward in providing access to information about Harvard collections, but Harvard archivists still hoped for more. Easy access to the more detailed collection information available in finding aids is as important to us as access to summary descriptions. In 1994, when gopher sites for finding aids were all the rage, Harvard archivists received permission to set up an archives site on the central library server and received a few hours of training from the Office for Information Systems (the Harvard University Library systems office) in how to maintain it. While everyone felt the gopher site was useful for delivering copies of individual finding aids, it was soon apparent that such unstructured text offered only clumsy and frustrating searching, particularly with large files.
While the gopher site was under development, the University Library announced that work would soon begin on the design of the next generation of the online catalog, known as HOLLIS II. In response to intense interest from the archives and manuscripts community (which felt that its needs largely had been ignored in HOLLIS I), the Harvard University Library Automation Planning Committee appointed a Special Collections Task Force. This group was charged to examine the automation needs of Harvard repositories that acquire and make accessible collections of material (as opposed to monographs). The report of that task force November 1994 contained many recommendations, but the one that is key to the Harvard EAD story was the statement that the information in finding aids must be as easily available to scholars in electronic form as is the collection-level MARC record, followed by the recommendation that options for doing this should be pursued.
During the preparation of its report, the Task Force investigated work being done at the University of California, Berkeley, using Standard Generalized Markup Language (SGML) to encode electronic finding aids. Berkeley's work demonstrated that SGML was a powerful yet flexible tool that could address the need of the large Harvard community of archivists to produce electronic finding aids and make them available remotely in order to meet the growing expectations of our research clientele. The Automation Planning Committee agreed with the Task Force's recommendation and established the Digital Finding Aids Project in February 1995.
The Harvard/Radcliffe Digital Finding Aids Project (known as "DFAP") began as a group with members from seven Harvard repositories: Historical Collections, Baker Library, Business School; Manuscripts and Archives, Andover-Harvard Theological Library, Divinity School; Library of the Gray Herbarium, a special library of the Faculty of Arts and Sciences; Manuscript Department, Houghton Library, Harvard College Library; Manuscript Division, Law School Library; Schlesinger Library on the History of Women in America, Radcliffe College; and the Harvard University Archives. In addition to these collection representatives, the group included a member from the Office for Information Systems (OIS), and the Electronic Texts Librarian for Harvard College, part of Research and Bibliographic Services.
The group's original charge was to "plan and oversee the design and deployment of a new computer application system to store, search, and retrieve digital finding aids in SGML format at Harvard/Radcliffe." Membership during this initial project phase was designed to provide broad representation of the Harvard archival and electronic systems community. Once DFAP established standards and procedures, contributions to the finding aids database were sought from all corners of the Harvard archival and library community.
Selection of SGML authoring software
DFAP's first task was to review and recommend SGML authoring software. The group wanted software that could be used by all repositories, regardless of hardware platform, thus enabling Harvard archivists to develop and share a common experience and knowledge. As "minority" members of the library community at Harvard, archivists are accustomed to depending largely on themselves and each other, rather than on understaffed computer support offices (for those who have such support available) for our specialized software needs. The authoring software had to be inexpensive, since several members of the group were single-person operations with minimal financial support. The ability of the software to import existing ASCII and word-processed files was also a factor, since the need for conversion of existing finding aids was also part of the Project's charge.
After investigating many SGML authoring packages, Project members selected WordPerfect 6.1 SGML Edition (working under Windows 3.x) for PC platforms. Perhaps the most important reason for this was cost. Copies could be purchased for $30 each under Harvard's site license with Novell (who then owned WordPerfect; it has since been sold to Corel). Also, about half of the group were already using WordPerfect for their finding aids and so felt very comfortable with the software. Two members of the group were Mac-based, and this posed a difficulty. The only package it was possible to recommend was Author/Editor--at $750, a much more expensive alternative, but still cheaper than many of the other authoring packages tested.
Since 1995, of course, the field has changed. WordPerfect now bundles its SGML module with Versions 7 and 8 (Windows95 or NT required), and many archivists obtain it "free" as standard software on local area networks. Several members of the project have recently purchased Author/Editor and are beginning to use it. The Research Libraries Group makes available to members a package including Author/Editor (SGML authoring), Panorama Pro (SGML browser and style sheet editor), and HotMetal (HTML authoring) at a cost of $375 for the first copy and $600 for subsequent copies. As the DFAP project needed only one copy of Panorama Pro to write the SGML display style sheet and did not need HotMetal, the Harvard systems office negotiated a price with SoftQuad of $275/package for Author/Editor, if at least five copies are purchased at one time. In 1998, SoftQuad sold Author/Editor to Interleaf; we are not certain of the future of the software.
Finding aids on the World Wide Web
The second task of the Project was to decide on a method for "publishing" the SGML-encoded finding aids on the Internet. After evaluating access issues, we decided that the finding aids will be available through two "gateways."
The first "gateway" is a link from the collection-level MARC record in HOLLIS, Harvard's public catalog. HOLLIS is already the database our researchers search first for information about manuscript and archive collections held at Harvard, and the Project has mandated that every SGML finding aid must have a corresponding collection-level MARC record in HOLLIS. One might assume that in a place like Harvard, particularly with the Survey and Guide Project, such records would already be in place. Not so! And Houghton (my own repository) is perhaps the worst offender. The Survey Project covered many, but not all, Houghton collections, and between the end of the project in 1986 and my arrival as curator in 1992, there was no one in the Manuscript Department who knew how to create a MARC record. A lack of support for MARC cataloging is still a problem for many of the smaller Harvard repositories, and this is something that Harvard will need to address programmatically. Keep this in mind if you begin your own EAD project--it will (or should) force you to think about your entire system of collection control, and there may be additional work other than EAD itself that you will need to anticipate.
The second "gateway" to the SGML finding aids is the search interface for a separately-searchable finding aids database (the database was eventually named OASIS, for Online Archival Search Information System). It is extremely important to us to be able to do cross-finding aid, cross-repository searching. This database resides in HOLLIS Plus, a large group of databases outside the online catalog, HOLLIS, and is available at no cost to those outside Harvard. To implement OASIS, a SGML search engine was necessary.
Initially, the DFAP finding aids were to serve as a test database for OCLC's SiteSearch, which the University Library was considering purchasing for use with a number of Harvard databases, including the finding aids. But we found that the finding aids were simply too complicated, and too long, for SiteSearch to handle well without an enormous investment in programming time. For a while, we used a freeware keyword search engine with limited capabilities. Eventually, in 1998, we selected the search engine now in place, OpenText 5, and a Harvard-customized version of a web gateway that was originally developed by the University of Michigan's Digital Library Production Services.
While we waited for the search engine to be implemented, the finding aids could be viewed individually in SGML using Panorama Free, as well as in HTML. A program converts the SGML into HTML documents "on the fly." Until SGML browsers are readily available--and XML offers that possibility in the not-too-distant future--individual finding aids will be viewable in HTML. (The indexes used by the search engine to locate the appropriate individual finding aids will, however, be drawn from the SGML tagging.) The major drawback to viewing a finding aid in HTML is that one loses the poweful navigational features that SGML browsers such as Panorama offer. Browsing a 200-page finding aid in HTML can be tedious; browser utilization of SGML markup makes going back and forth in an electronic finding aid much easier.
Doing SGML markup with EAD
The Harvard Project's third task, undertaken concurrently, was to mark up a variety of Harvard finding aids using SGML. It quickly became clear that, given the divergence of traditional practice at Harvard repositories, a Harvard standard (based, of course, on the Berkeley-developed standard, by this point in time called EAD) was needed. This standard does not so much prescribe the content of the finding aid (if you look at the Harvard guidelines on the DFAP web site, almost every tag is optional), as it does the structure of the finding aid. We discovered, once we trained ourselves to think about the intellectual content of the finding aid divorced from how it "looked" (and this is, truthfully, the single most difficult thing to do in applying the EAD), there were not that many differences in content. Rather, what differed was the order in which information was presented and the level of descriptive detail provided.
One of the initial difficulties was how to make sense of an incredibly detailed and complex SGML Document Type Definition (DTD). I remember very clearly our first few meetings, when it was difficult to see the forest for the trees--the trees being hundreds of pointy angle-brackets--and I felt very depressed at the prospect of trying to make sense of it all. This feeling passed once I realized that it was not necessary to apply all the EAD tags, and as I began to understand the structure of the EAD DTD. We began the job of understanding EAD by passing around copies of what we each considered to be "typical" finding aids, and then we analyzed their various parts, trying to map it to the many EAD tags. This process will, I hope, be much simpler once the formal application guidelines are available from the Society of American Archivists.
But even with guidelines, I would urge every archivist to undertake this exercise of evaluating finding aid structure; I found it an illuminating and stimulating experience. Archivists are accustomed to doing things a certain way, not examining why, or whether it is the best way, or if what we are doing is comprehensible to our user community. Learning how to apply EAD forces one to think hard about all these issues. I believe that over the next few years, finding aids will change--and not only the finding aids themselves, but also collection-level MARC records, and what information goes in which place, and how much overlap there is between them.
By analyzing and mapping the elements in each of the finding aids, we soon decided that less was more. We chose to mark the structure of the front matter of the finding aid at a fairly general, not at the most specific, level. We use optional "level" and "attribute" tags only if we feel a need to do so for indexing purposes. There are an enormous number of options available in EAD; the challenge is to use the minimum level of markup that will produce a useful electronic document. This is not to say that a Harvard repository is prohibited from doing more detailed markup; it can be as detailed as the repository deems appropriate, as long as the basic structure conforms to the Harvard guidelines. But in terms of broad access to the information in the finding aid, we reached consensus that in many instances, implementing EAD to its finest, most granular level really was not necessary.
There is greater detail, and more variety in practice, in the markup of the container listing portion of the finding aid. Here, each repository judges the level of access needed by its user community. With literary collections, for example, every correspondent's name may be an important access point. In a corporate archive, name access may not be as important, and the markup of names may be much more selective. Thus far, the participating repositories have marked-up all names and all dates. Many form and genre terms are also marked up. Most problematic is use of the <subject> tag, since this leads into a discussion of the usefulness of indexing uncontrolled vocabulary terms (this is also an issue with indexing names, as not all repositories do name authority work). We have not yet determined the ideal level of markup, but continue our discussions and our experimentation.
As a general principle, we mark up finding aids to a level that allows the building of useful indexes. (See Indexing decisions[link]) There is a school of thought that says that full-text keyword searching of finding aids, not field-specific indexing, provides adequate access. Harvard's experience with HOLLIS, with some 8 million bibliographic records, led us to favor field-specific keyword indexes, which we believe will provide more rapid and more precise retrieval of pertinent information.
In January 1997, the Project did a survey to attempt to quantify the scope of a possible Harvard retrospective conversion project for finding aids. Twelve of the 49 repositories responded. (This did, however, include all the largest collections, with the exception of the Medical School, then undergoing renovation and reorganization.) There are more than 14,000 finding aids in these 12 Harvard repositories. Only 21% are already in electronic form, 50% are typed pages or cards, and the remaining 29% are handwritten pages and cards. We do not know exactly how many pages of data this represents, but we estimate at least 500,000. In terms of numbers of alphanumeric characters, a Harvard finding aids database will be roughly equal in size to HOLLIS. Field-specific indexing will allow more precise retrieval within a very large database. Full keyword searching also will be an option, and it will be interesting to compare the uses of each approach.
In planning EAD projects, it is important to keep this question of scale in mind. For Harvard, given the potential size of a finding aids database, tagging at a level that will produce field-specific indexes is important. In the long term, if we think on a national or even international scale about access to electronic finding aids, it seems equally important. But for smaller repositories interested primarily in making finding aids available locally, detailed markup may not be seen as necessary; certainly developing a separate search engine would not be a wise use of resources. A few consortia, including RLG and SOLINET, are providing host sites for finding aids marked up by their members; this may be the best solution for small- and medium-sized repositories who want to make their collections available to the wider research community.
The Harvard encoding guidelines have been in place (and available on the DFAP web site) since fall 1996. They are modified and clarified as project members gain more markup experience. Individual repositories also are working on their own implementation standards, translating the Harvard guidelines into local practice. A SGML style sheet, which displays all Harvard finding aids in a uniform style, was developed using Panorama Pro; this work was done largely by Project member Susan von Salis of the Schlesinger Library with the assistance of MacKenzie Smith of the Office for Information Systems. The HTML style sheet was modified by MacKenzie Smith from that used by the University of Michigan. More than 60 finding aids had been encoded as of August 1997, and more than 120 by February 1999.
The diversity of archives at Harvard, and the considerable experience of the repositories' staffs, has made Project meetings lively forums for informed discussion and creative problem solving. Having participated in such a collaborative approach to understanding and implementing this technology, I have intense admiration for those who have undertaken EAD projects largely on their own initiative. At Harvard, we have found it necessary to meet regularly: for the first year we met for two hours every two weeks, and more recently, about every two months. We need to give each other moral support, because we all work on this project in our "spare" time. If it were not a congenial and extremely hard-working group, none of us would have the energy to keep going, given everything else we are expected to do.
Having many people involved in the project also makes its likelihood of survival much higher. When a project is the brain-child of one person, the project can atrophy or die if that person leaves the organization. At Harvard, there are now enough people involved and committed that one of us can leave (and some have), and the project will continue. I would recommend that any "lone arranger" who is considering EAD implementation try to find at least one other local archive with whom to collaborate. It is helpful to have someone with whom you can talk face to face (e-mail is good, but not really the same). As mentioned earlier, implementing EAD forces you to rethink how your finding aids are structured and what they accomplish; this takes considerable thought and energy, and it helps enormously to discuss the issues with other archivists.
One repository's experience using EAD
Over the past 10-15 years, Houghton finding aids were produced using a fairly standard format. In planning for the inclusion of EAD tagging in our workflow, I took the Harvard EAD guidelines, went through a typical Houghton finding aid, and created a template with SGML tags. (See Houghton Guidelines [link]) Instead of the manuscript cataloger having to know a lot about EAD markup in order to insert the appropriate tags, the tags appear already on the screen as part of the template, and the cataloger simply "fills in the form." The template is for a collection of personal papers, which is the bulk of what we acquire, and the template is modified when necessary. The template can receive text as well as tags, so any repeating text can also be inserted. This reduces the amount of keying and also helps enforce which tags should and should not be used. While inevitably there is some "tweaking" of the markup that needs to be done, the bulk of it is done quickly and easily. The template works equally well in both WordPerfect and Author/Editor (we used both initially, although now use only WordPerfect). One of the advantages of SGML is that data (and thus templates) can migrate between authoring packages with a minimum of fuss.
While the template approach works well for new finding aids, it is not particularly useful for Houghton's "legacy" finding aids. A finding aid created 5 years ago is formatted differently than one created 30 years ago. While the two contain much of the same information, the difference in the "look" makes it difficult to formulate routines for automatic conversion. This is even more the case when one looks at all the Harvard repositories over time. It will be necessary to formulate conversion guidelines not only for each repository, but for different time periods, and different kinds of collection finding aids, within each repository. Conversion of existing finding aids cannot take place within each archivist's day-to-day work; outside funding and staff clearly will be needed.
The end of Phase I: OASIS
After long delay, occasioned by changes in technology, changes in the EAD itself, and competing development priorities within the library systems office, a search interface was implemented for the finding aids database in July 1998. At this point, the database was no longer a project," it was in production--so it needed a name. We called the database OASIS (Online Archival Search Information System): an acronym easy to pronounce and spell, into which we could cram most of the words we felt were necessary to describe the system!
The group had formulated its preliminary indexing decisions [link] back in February 1997, and, updated to EAD v.1, this document was used to formulate the access points needed on the search screen. For several months, we debated terminology (what to call the indexes), drafted and re-drafted "help" screens, refined the HTML display of the finding aids, made minor changes in the guidelines to resolve indexing and display inconsistencies and, overall, worked to make OASIS understandable to those happening upon it for the first time.
In these discussions, two of the indexes--dates and names--posed particular challenges.
Dates appear in finding aids in a variety of forms: 1860-65, 1860-1865, 28 Feb. 1865, February 28, 1865, etc. Prospectively, we could ameliorate some of this variation by saying in the guidelines that years needed to be four digits, Because OpenText searches for a specific string of numbers, a search only finds a specific year if it were in four-digit form. Initially, we believed that LiveLink could not handle date ranges. Upon further investigation, however, we found that it could handle ranges, e.g. a search for 1862 would produce a finding aid with the date range 1860-1865.
Names also posed a difficulty because they often are not normalized within finding aids, and because they can appear in both direct (John Jones) and inverted (Jones, John) order. Obviously, when one does a name search, one wants to retrieve all forms of the name with one search. A partial solution was found by using proximity searching. Using two separate search boxes, one can put the first name in one and last name in the other, and find them near each other (within 40 characters). While this sometimes produces false hits, these are easily identifiable from the search results screen. We hope that enhancements to the search engine in the future will allow us to limit the "near" to a specific EAD tag, such as the <persname> (personal name) tag.
The larger context
I will conclude with some observations about the Digital Finding Aids Project and EAD within the larger Harvard community.
My first point has to do with the "inevitability" of EAD implementation within the Harvard context. Harvard is wary of innovation, as are many other institutions. It has been critical for us to stress that EAD is an outgrowth of all the work that has gone before. Without the Survey and Guide project to create collection-level MARC records, without experimentation with gopher sites, and without the recommendations of the Special Collections Task Force within the context of Harvard's preparations for a new online catalog, DFAP would not (and probably should not) have happened at Harvard.
Secondly, the long-term "survivability" of DFAP depends on its relevance within the overall Harvard electronic information environment. We are not solely a group of archivists. With the Electronic Texts Librarian for Harvard College as a DFAP member, we keep in close touch with what is being planned in electronic information resources, and can anticipate how our project can utilize and enhance these efforts. We obtained approval to switch the proposed search engine to LiveLink because that software was to be used for a resource much in demand by the faculty, the Oxford English Dictionary, and so we were able to argue that we should not stay with SiteSearch. Archives and manuscripts will continue to be a minority group within the larger electronic information system, and it is up to us to keep current with the larger context and to understand how it can serve our needs.
It also has been essential to have a good relationship with OIS, the university's library systems office, given that individual archives at Harvard will never have the staff and expertise to maintain a large and complex database over time. The University Library has been very supportive of what the manuscript and archive community is attempting to accomplish. But equally, members of DFAP have put much time and effort into the project; it certainly has not been a question of archivists saying, "We want this, please give it to us." The archivists in DFAP have been largely responsible for writing the style sheet and the SGML to HTML conversion program (Susan von Salis and Kim Brookes, from Radcliffe College), are maintaining the project's web site (Lisa DeCesare), are doing all of the actual markup without external programming support, and have invested much time in developing standards. OIS has provided the programming expertise for the implementation of the search engine, something most archivists cannot reasonably be expected to do themselves. But our contribution has been considerable.
The archival community is in a good position to get needed support from library systems offices at the moment, because what we want to do fits in well with other projects many libraries are, or are planning to, undertake. At Harvard, one member of DFAP is now the University Library's Digital Projects Librarian, four members were part of various Task Groups responsible for selecting HOLLIS II, the next generation Harvard online catalog, and two members are part of the College Library's new Digital Library Projects development team. The nationwide effort to define and develop a National Digital Library will highlight the contribution that archives and EAD can make; it also will draw needed resources into the development of fully functional electronic finding aids for all kinds of collections.
This brings me to my third point: the necessity of explaining the project to non-archivists within an institution. By making presentations to various library managers groups, we have garnered interest and support among the upper management tier, as well as potentially bringing in to the project many additional finding aids from visual, microform, and audio recording collections across the university. We regularly issue "press releases" on University listservs and have had enthusiastic feedback, particularly from reference librarians, who are now much more aware of manuscript and archive collections when giving research advice to students and faculty. Particularly in a large and decentralized institution such as Harvard, such "specialized" collections can become isolated from the larger research context. Electronic finding aids, integrated into the overall information structure of the university, enable us to demonstrate that "special collections" are essential to the research mission of the university, not luxury items that are peripheral to the "real" business of research. This, to me, is the most compelling argument for implementing EAD.


