Overview: Web Archiving

LDI is embarking on a two year pilot project addressing the acquisition of web sites for long term archiving where university project partners will provide content and curatorial perspective.

The key technical focus will be on the development of services for ingest, storage, preservation and basic delivery of web sites. The pilot will focus on harvesting the surface web for content in a focused area to enable small-scale, curated collections. In general, the "surface web" refers to content that is discoverable to search engines through web crawlers, as opposed to content in a database that is hidden from web crawlers.

During the course of this project, LDI staff and curatorial staff from the partners' project units will address collections management issues specific to archiving web sites, such as establishing policies concerning intellectual property rights issues, exploring best practices for describing archived web sites, and determining searching and browsing requirements for delivery.

For an introduction to the new service including background information and preliminary designs, see the September 2006 Architecture Review presentation Web Archiving Collection Service - WAX: HTML version or PDF version

Pilot Project Test

Right now, we are testing our pilot software and our ability to crawl, harvest and archive web sites. Our crawler is called:

hul-wax

Our crawler will obey all common instructions in robots.txt files. You may specifically instruct our crawler to harvest material from your site or not to harvest material from your web site by updating your robots.txt file to include us.

The following text added to the robots.txt file will allow our harvester to crawl your web site:

User-agent: hul-wax
Disallow:

The following text added to the robots.txt file will disallow our harvester to crawl your web site:

User-agent: hul-wax
Disallow: /

More information about robots.txt files can be found at: http://www.robotstxt.org/robotstxt.html

If our pilot project is successful, our long term plan is to create and maintain a system to collect, store and display web sites, including those we are collecting now, as documents of our history in the same way that we manage other significant historical materials.

We currently aim to begin displaying the harvested material in December of 2008.