Office for Information Systems - Harvard University Library

Web Archiving Resources

This list of resources covering web archiving tools and practices was last updated in 2007.

Mailing and News Lists
IP Rights
Metadata / Cataloging
Web Archives
Research Projects and Reports
Workshops and Conferences
General Web Archiving Information
Tools
    Deep Web / Database Software
    Harvesting Services
    Harvesting Software
    Discovery, Delivery and Access Software
    General Web Archiving Suites


Mailing and News lists

archive-crawler@yahoogroups.com
http://crawler.archive.org/mail-lists.html
Group to discuss Heretrix' (http://crawler.archive.org/) web crawling.

web-archive@cru.fr
http://listes.cru.fr/sympa/info/web-archive
General list focused on web content archiving, the legal deposit of online publications and digital preservation. Its creation was decided during the ECDL2001 (European Conference on Research and Advanced Technology for Digital Libraries 2001) workshop, in September 2001. Very little traffic.

IP Rights

Charlesworth, Andrew. (February 2003). Legal issues relating to the archiving of Internet resources in the UK, EU, USA and Australia. JISC, The Wellcome Trust, University of Bristol.
http://www.jisc.ac.uk/uploaded_documents/archiving_legal.pdf

(2003) “Website Permissions”, Copyright & Fair Use, Chapter 6, Stanford University Libraries.
http://fairuse.stanford.edu/Copyright_and_Fair_Use_Overview/chapter6/index.html

Field, Jr. Thomnas G. (2002) “Copyright on the Internet”, Franklin Pierce Law Center.
http://www.fplc.edu/tfield/copynet.htm

Kavcic-Colic, Alenka. (August 2002) Archiving the Web - some legal aspects. 68th IFLA Council and General Conference, Glasglow.
http://www.ifla.org/IV/ifla68/papers/116-163e.pdf

Minow, Mary. (2002) “Copyright: What You Need to Know about Your Library's Web Page”. California Library Association. Reprinted from California Libraries, March 2000, Vol. 10, No. 3.
http://www.cla-net.org/resources/articles/minow_copyright.php

National Diet Library of Japan copyright notes:
http://www.ndl.go.jp/en/attention/index.html

“Copyright Management Center”, Indiana and Purdue Universities.
http://www.copyright.iupui.edu/index.htm
Includes a fair-use checklist and a permissions information section.

Copyright Statement for the September 11 Web Archive, Library of Congress http://lcweb2.loc.gov/diglib/lcwa/html/sept11/sept11-overview.html#copyright

Preserving Access to Digital Information (PADI): Intellectual property rights management, The National Library of Australia
http://www.nla.gov.au/padi/topics/28.html

Harper, Georgia K. (2001) “The UT System Crash Course in Copyright”, University of Texas.
http://www.utsystem.edu/ogc/intellectualproperty/cprtindx.htm

U.S. Copyright Office
http://www.copyright.gov/
(December 2004) “Copyright Registration for Online Works”, Circular 66
http://www.copyright.gov/circs/circ66.html

Metadata / Cataloging

Appendix 2 of Dollar Consulting. (July 20, 2001). Archival Preservation of Smithsonian Web Resources: Strategies, Principles, and Best Practices. Smithsonian Institution Archives.
http://siarchives.si.edu/research/main_pubs.html

International Internet Preservation Consortium (IIPC)
http://netpreserve.org
The IIPC's Framework Working Group is working on metadata schema specifications for harvested Internet/web resources but there are not any publicly-available drafts on the web site yet.

Election 2002 Web Archive Cataloging and Description, MINERVA, Library of Congress, 2004.
http://lcweb4.loc.gov/elect2002/catalog/2860.html

Web Archives

Sites that permit public access to archive

The Internet Archive (IA)
http://www.archive.org
A 501(c)(3) public nonprofit organization that is given web content on a regular basis by Alexa Internet and displayed through the Wayback Machine interface. The largest web archive in the world.

MINERVA, Library of Congress
http://www.loc.gov/minerva/
A selective web archive based on themes of national importance, i.e. national political elections, wars and terrorism.

PANDORA (Preserving and Accessing Networked Documentary Resources of Australia), National Library of Australia with nine other Australian libraries and cultural collecting organizations
http://pandora.nla.gov.au/index.html
About PANDORA: http://pandora.nla.gov.au/about.html

UK Government Web Archive
http://www.nationalarchives.gov.uk/preservation/archivedwebsites.htm
A selective collection of UK Government websites, archived at regular intervals from August 2003, and developed by The National Archives using the services of the Internet Archive.

Web Archiving Project (WARP), National Diet Library, Japan
http://warp.ndl.go.jp/ (Site is in Japanese.)

Sites that Require a Password

DACHS (Digital Archive for Chinese Studies)
The metadata section of DACHS is open to everyone. Access to the documents and resources themselves is restricted to password owners. To apply for a password (free of charge) please inform the DACHS staff (dachs@sino.uni-heidelberg.de) about your research purpose and institutional affiliation.
http://www.sino.uni-heidelberg.de/dachs/

Sites that do not permit public access to the web archives – note: these sites still have useful documentation that is accessible

Austrian On-line Archive (AOLA), Austrian National Library and Technical University of Vienna (Not publicly-accessible)
http://www.ifs.tuwien.ac.at/~aola/
Attempting to archive the entire Austrian web space. They started with the Nedlib harvester, and then switched to the Combine harvester.

DAMP (Digital Archive for Web Publications), University of Zagreb and the National and University Library (NUL) in Zagreb, Croatia
Archiving the Croation web space.
An article about DAMP:
http://widwisawn.cdlr.strath.ac.uk/Issues/Vol3/issue3_3_1.html

Kulturarw3 - KB Web Archive, Royal Library - the National Library of Sweden (Only viewable within the Royal Library)

netarchive.dk, The State and University Library and the Royal Library, Denmark (initially can only be used for research purposes and with prior permission from the Danish Data Protection Agency)
http://netarchive.dk/

WebArchiv, National Library of the Czech Republic and Masaryk University in Brno
http://www.webarchiv.cz/index-e.html

Research Projects and Reports

Cornell Web Library

A project to make a large body of the Internet Archive's material available to academic researchers for primarily social science research. Involves the creation of tools and techniques for managing the web material.

http://www.dlib.org/dlib/february06/arms/02arms.html

The web at risk, California Digital Library
http://www.cdlib.org/inside/projects/preservation/webatrisk/
The California Digital Library has been awarded a three-year, $2.4 million grant from the Library of Congress as part of the National Digital Information Infrastructure and Preservation Program (NDIIPP) to develop web archiving tools that will be used by libraries to capture, curate, and preserve collections of web-based government and political information.

Boudrez, Filip and Sofie Van den Eynde. (July 2002). Archiving websites, DAVID (Digital Archiving in Flemish Institutions and Administrations) Project.
http://www.dma.be/david/website/teksten/Rapporten/Report5.pdf

Boyko, Andrew. (July 2004). Test Bed Taxonomy for Crawler. Metrics and Testbed Working Group, International Internet Preservation Consortium.
http://netpreserve.org/publications/reports.php
Describes the challenges of harvesting, detecting internal links and presenting web content.

Center for Research Libraries. (June 2004). Political Communications Web Archiving. Andrew W. Mellon Foundation.
http://www.crl.edu/content/PolitWebReport.htm
This report's 40 appendices cover a wide range of topics including evaluations of harvesters and archives, automatic harvesting of metadata, a METS template for metadata, etc.

Crook, Edgar National Library of Australia (August 15, 2006)
For the Record: Assessing the Impact of Archiving on the Archived
http://www.rlg.org/en/page.php?Page_ID=20962

Dollar Consulting. (July 20, 2001). Archival Preservation of Smithsonian Web Resources: Strategies, Principles, and Best Practices. Smithsonian Institution Archives.
http://siarchives.si.edu/pdf/dollar_report.pdfl

Dollar Consulting. (October 15, 2002). Archival Preservation of Web Resources: Digital Quality Assurance Tool - Technical Evaluation and Recommendations. Smithsonian Institution Archives.
http://siarchives.si.edu/research/main_pubs.html

Dollar Consulting. (July 1, 2002). Archival Preservation of Web Resources: HTML to XHTML Migration Test Technical Evaluation and Recommendations. Smithsonian Institution Archives.
http://siarchives.si.edu/research/main_pubs.html

Gupta, Amarnath. (January 18, 2001). Preserving Presidential Library Websites. San Diego Supercomputer Center.
http://legacy.sdsc.edu/TR/TR-2001-03.pdf

Kenney, Anne R.; Nancy Y. McGovern, Peter Botticelli, Richard Entlich, Carl Lagoze, Sandra Payette. (January 2002). Preservation Risk Management for Web Resources. D-Lib Magazine, Vol. 8, No. 1.
http://www.dlib.org/dlib/january02/kenney/01kenney.html

Marill, Jennifer; Andrew Boyko, Michael Ashenfelder, Martha Anderson and Gina Jones. (July 20, 2004). Web Harvesting Survey. Metrics and Testbed Working Group, International Internet Preservation Consortium.
http://netpreserve.org/publications/reports.php
Classifies web site content into types and rates them in terms of the relative ease of harvesting, parsing and presenting the content.

Rauber, Andreas; Andreas Ashenbrenner, Oliver Witvoet, Robert M. Bruckner, Max Kaiser. (December 2002). Uncovering Information Hidden in Web Archives: A Glimpse at Web Analysis building on Data Warehouses. D-Lib Magazine, Vol. 8, No. 12.
http://www.dlib.org/dlib/december02/rauber/12rauber.html

Senserini, Alessandro; Robert B. Allen, Gail Hodge, Nikkia Anderson and Daniel Smith Jr. (November 2004). Archiving and Accessing Web Pages: The Goddard Library Web Capture Project. D-Lib Magazine, Vol. 10, No. 11.
http://www.dlib.org/dlib/november04/hodge/11hodge.html
Prototype project experimenting with harvesting selected web sites which included using automatically extracted metadata as well as manually-entered metadata

Smithsonian Institution Archives Records Management Team. (May 2003). Archiving Smithsonian Websites: An Evaluation and Recommendation for a Smithsonian Institution Archives Pilot Project. Smithsonian Institution Archives.
http://siarchives.si.edu/research/main_pubs.html

Schneider, Stephen M. (April 2004). Library of Congress Election 2002 Web Archive Project: Final Project Report. Research Foundation of the State University of New York, SUNY Institute of Technology, WebArchivist.org.
http://www.webarchivist.org/resources.htm

Schneider, Stephen M. (June 2004). Library of Congress September 11 Web Archive Project: Final Project Report. Research Foundation of the State University of New York, SUNY Institute of Technology, WebArchivist.org.
http://www.webarchivist.org/resources.htm

Workshops and Conferences

International Web Archiving Workshop (IWAW)
http://bibnum.bnf.fr/ecdl/
This annual workshop, organized by the French National Library, has been held in conjunction with the European Conference on Digital Libraries (ECDL) since 2001. Most of the proceedings of IWAW2001 - IWAW2005 are available from this website.

General Web Archiving Information

International Internet Preservation Consortium (IIPC)
http://netpreserve.org
A consortium of national libraries and major archives developing common web archiving tools, standards and techniques.

Preserving Access to Digital Information (PADI): Web Archiving, The National Library of Australia
http://www.nla.gov.au/padi/topics/92.html

Web Archiving Bibliography, Austrian On-Line Archive.
http://www.ifs.tuwien.ac.at/~aola/links/WebArchiving.html

Tools

Deep Web / Database Software

DeepArc, National Library of France (BnF)
http://deeparc.sourceforge.net/
A graphical editor for exporting database content into an XML document. Allows users to map their database model to an XML Schema. GPL license. The purpose of this software is for the web publishers of a deep web database to export their database structure and content into XML for submission to a web archive. This work is sponsored by the IIPC (International Internet Preservation Consortium).

Xinq (XML Archive INQuiry Tool), National Library of Australia
http://sourceforge.net/projects/xinq/
http://www.nla.gov.au/xinq/
A search and browse tool for accessing an XML database. Apache Software License. Written in Java and XML. The purpose of this software is to provide access to a deep web database that was archived as XML. This work is sponsored by the IIPC (International Internet Preservation Consortium).

Harvesting Services

ArchiveIt

A subscription harvesting service provided by the Internet Archive. Through a web based interface, users can capture, catalogue and archive their institution's own web site or build additional collections, and then search and browse the collection when complete.
http://www.archive-it.org/

Harvesting Software

Open source harvesting software

Combine Harvesting Robot
http://combine.it.lth.se/
Harvesting and indexing software written in Perl and C++ and under the GPL license. Once used (and still?) by Swedish, Danish and Austrian archives. Do not know if this is actively developed anymore.

GNU Wget
http://www.gnu.org/software/wget/wget.html
A non-interactive command-line tool under the GPL license that can be used from scripts and other programs.

Heritrix, Internet Archive and Nordic National Libraries
http://crawler.archive.org/
A robust web archiving harvester under the LGPL license. Has very flexible means to configure and control the harvest. Designed to be extensible by writing new Java modules. Configurable through a web interface. This work is sponsored by the IIPC (International Internet Preservation Consortium).

HTTrack
http://www.httrack.com/
Offline browsers under the GPL license that can be used from a graphical interface or the command line.

Nalanda iVia Focused Crawler (NIFC)

http://ivia.ucr.edu/projects/Nalanda/

Designed to find Web resources with the same topic as a seed set of known resources. NIFC was created by Dr. Soumen Chakrabarti at the Indian Institute of Technology (Bombay), and further developed in collaboration with the iVia team.

Nedlib Harvester, Center for Scientific Computing - the Finnish IT Center for Science
http://www.csc.fi/english/csc/news/news/nedlib-2000-5-10/?searchterm=Nedlib%20Harvester
Developed as a part of the Nedlib project funded by the European Union. Written in C and dependent on the MySQL database. No longer supported or developed.

Commercial harvesting software

Internet Researcher, Zylox Software
http://www.zylox.com/
A Windows-only offline browsing tool with a graphical interface.

Offline Explorer and Mass Downloader, MetaProducts Software Corporation
http://www.metaproducts.com/mp/mpProducts_List.asp
Various offline browsing tools for Windows. The MetaProducts Offline Explorer Pro 2.1 is used by DACHS (Digital Archive for Chinese Studies) -
http://www.sino.uni-heidelberg.de/dachs/

RafaBot, Spadix Software
http://www.spadixbd.com/rafabot/
A Windows-only offline browsing tool with a graphical interface. As well as supplying it with a list of URLs, can give it search terms and RafaBot will download all matching web sites using search engines.

SuperBot, Sparkleware
http://www.sparkleware.com/superbot/index.html
A simple Windows-only offline browsing tool with a graphical interface.

SurfSaver, askSam Systems
http://www.surfsaver.com/
An add-on to Microsoft Internet Explorer.

Teleport Webspiders
http://www.tenmax.com/teleport/home.htm
Various sophisticated Windows-only versions with different interfaces (graphical, console, scriptable) and feature sets.

WebCopier, MaximumSoft Corp.
http://www.maximumsoft.com/index.html
An offline browsing tool with a graphical interface in multiple versions for different operating systems and performance/feature level.

Discovery, Display and Access Software

ARC Access Tools
http://archive-access.sourceforge.net/
Internet Archive's list of tools for processing and accessing content in ARC files.

Kea

http://www.nzdl.org/Kea/

A GPLed tool for automatic keyword extraction from text documents. Originally written in a combination of Perl, C and Java; now available in an all-Java version. From the New Zealand Digital Library at the University of Waikato, New Zealand.

libiViaMetadata

http://ivia.ucr.edu/manuals/unstable/libiViaMetadata/5.4.0/

A GPLed C++ library for assigning descriptive metadata to web files. Developed under the iVia Project. Includes the PhraseRate program which is described at

http://ivia.ucr.edu/projects/PhraseRate/

NutchWAX (Nutch + Web Archive eXtensions), Internet Archive and Nordic National Libraries
http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#toc
A tool for indexing and searching web archives. Currently works only with the Arc format
(http://www.archive.org/web/researcher/ArcFileFormat.php). Implemented as a Java servlet. Add parsers to handle different formats, e.g. xpdf for PDF files. This work is sponsored by the IIPC (International Internet Preservation Consortium).

Wayback, Internet Archive

http://archive-access.sourceforge.net/projects/wayback/

The open source version of the Internet Archive's proprietary search and display interface, the "Wayback Machine" (listed next).

Wayback Machine, Internet Archive
http://www.archive.org/web/web.php
A proprietary interface to the Internet Archive's huge collection of web pages archived from 1996 to the present.

WERA (Web Archive Access), Internet Archive and National Library of Norway
http://nwa.nb.no/
An archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections as well as the possibility to do full text search and easy navigation between different versions of a web page. WERA is based on, and replaces the NwaToolset. It uses the NutchWAX search engine and is written in PHP and Java. This work is sponsored by the IIPC (International Internet Preservation Consortium).

General Web Archiving Suites

Software that is more of a system of web archiving tools rather that individual applications

DataFountains

http://ivia.ucr.edu/manuals/unstable/DataFountains/2.2.0/

A tool for discovering, harvesting and describing web resources. Developed under the iVia Project.

PANDAS (PANDORA Digital Archiving System), National Library of Australia http://pandora.nla.gov.au/pandas.html
Tools for controlling the harvest, conducting quality assurance checking, initiating archiving processes, managing the metadata including access restrictions, and producing management reports. Uses the HTTrack harvester. PANDAS was created to enable very selective harvesting and is not intended for large-scale automated harvests. The developers of this software are re-engineering PANDAS to use IIPC tools like Heretrix and WERA, and to be better integrated with their digital repository.

WebArchivist Software Suite, SUNY Institute of Technology and University of Washington
http://www.webarchivist.org/resources.htm
Tools for entering metadata, searching, analyzing and displaying archived sites. The software isn't licensed yet but according to the product's website the plan is to make this software available to other organizations. Used for the Library of Congress' Election 2002 (http://lcweb4.loc.gov/elect2002/) and September 11 (http://september11.archive.org/) web archives as well as the Asian Tsunami Web Archive (http://tsunami.archive.org/).