Web Archiving Resources
This list of resources covering web archiving tools and practices was last updated in 2007.
Mailing and News Lists
IP Rights
Metadata / Cataloging
Web Archives
Research Projects and Reports
Workshops and Conferences
General Web Archiving Information
Tools
Deep Web / Database Software
Harvesting Services
Harvesting Software
Discovery, Delivery and Access
Software
General Web Archiving Suites
Mailing and News lists
archive-crawler@yahoogroups.com
http://crawler.archive.org/mail-lists.html
Group to discuss Heretrix' (http://crawler.archive.org/)
web crawling.
web-archive@cru.fr
http://listes.cru.fr/sympa/info/web-archive
General list focused on web content archiving, the legal deposit of online
publications and digital preservation. Its creation was decided during
the ECDL2001 (European Conference on Research and Advanced Technology
for Digital Libraries 2001) workshop, in September 2001. Very little traffic.
IP Rights
Charlesworth, Andrew. (February
2003). Legal issues relating to the archiving of Internet resources
in the UK, EU, USA and Australia. JISC, The Wellcome Trust, University
of Bristol.
http://www.jisc.ac.uk/uploaded_documents/archiving_legal.pdf
(2003) “Website Permissions”,
Copyright & Fair Use, Chapter 6, Stanford University Libraries.
http://fairuse.stanford.edu/Copyright_and_Fair_Use_Overview/chapter6/index.html
Field, Jr. Thomnas G. (2002)
“Copyright on the Internet”, Franklin Pierce Law
Center.
http://www.fplc.edu/tfield/copynet.htm
Kavcic-Colic, Alenka. (August
2002) Archiving the Web - some legal aspects. 68th IFLA Council
and General Conference, Glasglow.
http://www.ifla.org/IV/ifla68/papers/116-163e.pdf
Minow, Mary. (2002) “Copyright:
What You Need to Know about Your Library's Web Page”. California
Library Association. Reprinted from California Libraries, March
2000, Vol. 10, No. 3.
http://www.cla-net.org/resources/articles/minow_copyright.php
National Diet Library of Japan
copyright notes:
http://www.ndl.go.jp/en/attention/index.html
“Copyright Management
Center”, Indiana and Purdue Universities.
http://www.copyright.iupui.edu/index.htm
Includes a fair-use checklist and a permissions information section.
Copyright Statement for the
September 11 Web Archive, Library of Congress http://lcweb2.loc.gov/diglib/lcwa/html/sept11/sept11-overview.html#copyright
Preserving Access to Digital
Information (PADI): Intellectual property rights management,
The National Library of Australia
http://www.nla.gov.au/padi/topics/28.html
Harper, Georgia K. (2001) “The
UT System Crash Course in Copyright”, University of Texas.
http://www.utsystem.edu/ogc/intellectualproperty/cprtindx.htm
U.S. Copyright Office
http://www.copyright.gov/
(December 2004) “Copyright Registration for Online Works”,
Circular 66
http://www.copyright.gov/circs/circ66.html
Metadata / Cataloging
Appendix 2 of Dollar Consulting.
(July 20, 2001). Archival Preservation of Smithsonian Web Resources:
Strategies, Principles, and Best Practices. Smithsonian Institution
Archives.
http://siarchives.si.edu/research/main_pubs.html
International Internet Preservation
Consortium (IIPC)
http://netpreserve.org
The IIPC's Framework Working Group is working on metadata schema specifications
for harvested Internet/web resources but there are not any publicly-available
drafts on the web site yet.
Election 2002 Web Archive Cataloging
and Description, MINERVA, Library of Congress, 2004.
http://lcweb4.loc.gov/elect2002/catalog/2860.html
Web Archives
Sites that permit public access to archive
The Internet Archive (IA)
http://www.archive.org
A 501(c)(3) public nonprofit organization that is given web content on
a regular basis by Alexa Internet and displayed through the Wayback Machine
interface. The largest web archive in the world.
MINERVA, Library of Congress
http://www.loc.gov/minerva/
A selective web archive based on themes of national importance, i.e. national
political elections, wars and terrorism.
PANDORA (Preserving and Accessing
Networked Documentary Resources of Australia), National Library of Australia
with nine other Australian libraries and cultural collecting organizations
http://pandora.nla.gov.au/index.html
About PANDORA: http://pandora.nla.gov.au/about.html
UK Government Web Archive
http://www.nationalarchives.gov.uk/preservation/archivedwebsites.htm
A selective collection of UK Government websites, archived at regular
intervals from August 2003, and developed by The National Archives using
the services of the Internet Archive.
Web Archiving Project (WARP),
National Diet Library, Japan
http://warp.ndl.go.jp/ (Site is in
Japanese.)
Sites that Require a Password
DACHS (Digital Archive for Chinese
Studies)
The metadata section of DACHS is open to everyone. Access to the documents and resources themselves is restricted to password owners. To apply for a password (free of charge) please inform the DACHS staff (dachs@sino.uni-heidelberg.de) about your research purpose and institutional affiliation.
http://www.sino.uni-heidelberg.de/dachs/
Sites that do not permit public access to the web archives – note: these sites still have useful documentation that is accessible
Austrian On-line Archive (AOLA),
Austrian National Library and Technical University of Vienna (Not publicly-accessible)
http://www.ifs.tuwien.ac.at/~aola/
Attempting to archive the entire Austrian web space. They started with
the Nedlib harvester, and then switched to the Combine harvester.
DAMP (Digital Archive for Web
Publications), University of Zagreb and the National and University Library
(NUL) in Zagreb, Croatia
Archiving the Croation web space.
An article about DAMP:
http://widwisawn.cdlr.strath.ac.uk/Issues/Vol3/issue3_3_1.html
Kulturarw3 - KB Web Archive,
Royal Library - the National Library of Sweden (Only viewable within the
Royal Library)
netarchive.dk, The State and
University Library and the Royal Library, Denmark (initially can only
be used for research purposes and with prior permission from the Danish
Data Protection Agency)
http://netarchive.dk/
WebArchiv, National Library
of the Czech Republic and Masaryk University in Brno
http://www.webarchiv.cz/index-e.html
Research Projects and Reports
Cornell Web Library
A project to make a large body of the Internet Archive's material available to academic researchers for primarily social science research. Involves the creation of tools and techniques for managing the web material.
http://www.dlib.org/dlib/february06/arms/02arms.html
The web at risk, California
Digital Library
http://www.cdlib.org/inside/projects/preservation/webatrisk/
The California Digital Library has been awarded a three-year, $2.4 million
grant from the Library of Congress as part of the National Digital Information
Infrastructure and Preservation Program (NDIIPP) to develop web archiving
tools that will be used by libraries to capture, curate, and preserve
collections of web-based government and political information.
Boudrez, Filip and Sofie Van
den Eynde. (July 2002). Archiving websites, DAVID (Digital Archiving
in Flemish Institutions and Administrations) Project.
http://www.dma.be/david/website/teksten/Rapporten/Report5.pdf
Boyko, Andrew. (July 2004).
Test Bed Taxonomy for Crawler. Metrics and Testbed Working Group,
International Internet Preservation Consortium.
http://netpreserve.org/publications/reports.php
Describes the challenges of harvesting, detecting internal links and presenting
web content.
Center for Research Libraries.
(June 2004). Political Communications Web Archiving. Andrew W.
Mellon Foundation.
http://www.crl.edu/content/PolitWebReport.htm
This report's 40 appendices cover a wide range of topics including evaluations
of harvesters and archives, automatic harvesting of metadata, a METS template
for metadata, etc.
Crook, Edgar National Library of Australia (August 15, 2006)
For the Record: Assessing the Impact of Archiving on the Archived
http://www.rlg.org/en/page.php?Page_ID=20962
Dollar Consulting. (July 20,
2001). Archival Preservation of Smithsonian Web Resources: Strategies,
Principles, and Best Practices. Smithsonian Institution Archives.
http://siarchives.si.edu/pdf/dollar_report.pdfl
Dollar Consulting. (October
15, 2002). Archival Preservation of Web Resources: Digital Quality
Assurance Tool - Technical Evaluation and Recommendations. Smithsonian
Institution Archives.
http://siarchives.si.edu/research/main_pubs.html
Dollar Consulting. (July 1,
2002). Archival Preservation of Web Resources: HTML to XHTML Migration
Test Technical Evaluation and Recommendations. Smithsonian Institution
Archives.
http://siarchives.si.edu/research/main_pubs.html
Gupta, Amarnath. (January 18,
2001). Preserving Presidential Library Websites. San Diego Supercomputer
Center.
http://legacy.sdsc.edu/TR/TR-2001-03.pdf
Kenney, Anne R.; Nancy Y. McGovern,
Peter Botticelli, Richard Entlich, Carl Lagoze, Sandra Payette. (January
2002). Preservation Risk Management for Web Resources. D-Lib
Magazine, Vol. 8, No. 1.
http://www.dlib.org/dlib/january02/kenney/01kenney.html
Marill, Jennifer; Andrew Boyko,
Michael Ashenfelder, Martha Anderson and Gina Jones. (July 20, 2004).
Web Harvesting Survey. Metrics and Testbed Working Group, International
Internet Preservation Consortium.
http://netpreserve.org/publications/reports.php
Classifies web site content into types and rates them in terms of the
relative ease of harvesting, parsing and presenting the content.
Rauber, Andreas; Andreas Ashenbrenner,
Oliver Witvoet, Robert M. Bruckner, Max Kaiser. (December 2002). Uncovering
Information Hidden in Web Archives: A Glimpse at Web Analysis building
on Data Warehouses. D-Lib Magazine, Vol. 8, No. 12.
http://www.dlib.org/dlib/december02/rauber/12rauber.html
Senserini, Alessandro; Robert
B. Allen, Gail Hodge, Nikkia Anderson and Daniel Smith Jr. (November 2004).
Archiving and Accessing Web Pages: The Goddard Library Web Capture
Project. D-Lib Magazine, Vol. 10, No. 11.
http://www.dlib.org/dlib/november04/hodge/11hodge.html
Prototype project experimenting with harvesting selected web sites which
included using automatically extracted metadata as well as manually-entered
metadata
Smithsonian Institution Archives
Records Management Team. (May 2003). Archiving Smithsonian Websites:
An Evaluation and Recommendation for a Smithsonian Institution Archives
Pilot Project. Smithsonian Institution Archives.
http://siarchives.si.edu/research/main_pubs.html
Schneider, Stephen M. (April
2004). Library of Congress Election 2002 Web Archive Project: Final
Project Report. Research Foundation of the State University of New
York, SUNY Institute of Technology, WebArchivist.org.
http://www.webarchivist.org/resources.htm
Schneider, Stephen M. (June
2004). Library of Congress September 11 Web Archive Project: Final
Project Report. Research Foundation of the State University of New
York, SUNY Institute of Technology, WebArchivist.org.
http://www.webarchivist.org/resources.htm
Workshops and Conferences
International Web Archiving
Workshop (IWAW)
http://bibnum.bnf.fr/ecdl/
This annual workshop, organized by the French National Library, has been
held in conjunction with the European Conference on Digital Libraries
(ECDL) since 2001. Most of the proceedings of IWAW2001 - IWAW2005 are
available from this website.
General Web Archiving Information
International Internet Preservation
Consortium (IIPC)
http://netpreserve.org
A consortium of national libraries and major archives developing common
web archiving tools, standards and techniques.
Preserving Access to Digital
Information (PADI): Web Archiving, The National Library of Australia
http://www.nla.gov.au/padi/topics/92.html
Web Archiving Bibliography,
Austrian On-Line Archive.
http://www.ifs.tuwien.ac.at/~aola/links/WebArchiving.html
Tools
DeepArc, National Library of
France (BnF)
http://deeparc.sourceforge.net/
A graphical editor for exporting database content into an XML document.
Allows users to map their database model to an XML Schema. GPL license.
The purpose of this software is for the web publishers of a deep web database
to export their database structure and content into XML for submission
to a web archive. This work is sponsored by the IIPC (International Internet
Preservation Consortium).
Xinq (XML Archive INQuiry Tool),
National Library of Australia
http://sourceforge.net/projects/xinq/
http://www.nla.gov.au/xinq/
A search and browse tool for accessing an XML database. Apache Software
License. Written in Java and XML. The purpose of this software is to provide
access to a deep web database that was archived as XML. This work is sponsored
by the IIPC (International Internet Preservation Consortium).
Harvesting Services
ArchiveIt
A subscription harvesting service
provided by the Internet Archive. Through a web based interface, users
can capture, catalogue and archive their institution's own web site or
build additional collections, and then search and browse the collection
when complete.
http://www.archive-it.org/
Harvesting Software
Open source harvesting software
Combine Harvesting Robot
http://combine.it.lth.se/
Harvesting and indexing software written in Perl and C++ and under the
GPL license. Once used (and still?) by Swedish, Danish and Austrian archives.
Do not know if this is actively developed anymore.
GNU Wget
http://www.gnu.org/software/wget/wget.html
A non-interactive command-line tool under the GPL license that can be
used from scripts and other programs.
Heritrix, Internet Archive and
Nordic National Libraries
http://crawler.archive.org/
A robust web archiving harvester under the LGPL license. Has very flexible
means to configure and control the harvest. Designed to be extensible
by writing new Java modules. Configurable through a web interface. This
work is sponsored by the IIPC (International Internet Preservation Consortium).
HTTrack
http://www.httrack.com/
Offline browsers under the GPL license that can be used from a graphical
interface or the command line.
Nalanda iVia Focused Crawler (NIFC)
http://ivia.ucr.edu/projects/Nalanda/
Designed to find Web resources with the same topic as a seed set of known resources. NIFC was created by Dr. Soumen Chakrabarti at the Indian Institute of Technology (Bombay), and further developed in collaboration with the iVia team.
Nedlib Harvester, Center for
Scientific Computing - the Finnish IT Center for Science
http://www.csc.fi/english/csc/news/news/nedlib-2000-5-10/?searchterm=Nedlib%20Harvester
Developed as a part of the Nedlib project funded by the European Union.
Written in C and dependent on the MySQL database. No longer supported
or developed.
Commercial harvesting software
Internet Researcher, Zylox Software
http://www.zylox.com/
A Windows-only offline browsing tool with a graphical interface.
Offline Explorer and Mass Downloader,
MetaProducts Software Corporation
http://www.metaproducts.com/mp/mpProducts_List.asp
Various offline browsing tools for Windows. The MetaProducts Offline Explorer
Pro 2.1 is used by DACHS (Digital Archive for Chinese Studies) -
http://www.sino.uni-heidelberg.de/dachs/
RafaBot, Spadix Software
http://www.spadixbd.com/rafabot/
A Windows-only offline browsing tool with a graphical interface. As well
as supplying it with a list of URLs, can give it search terms and RafaBot
will download all matching web sites using search engines.
SuperBot, Sparkleware
http://www.sparkleware.com/superbot/index.html
A simple Windows-only offline browsing tool with a graphical interface.
SurfSaver, askSam Systems
http://www.surfsaver.com/
An add-on to Microsoft Internet Explorer.
Teleport Webspiders
http://www.tenmax.com/teleport/home.htm
Various sophisticated Windows-only versions with different interfaces
(graphical, console, scriptable) and feature sets.
WebCopier, MaximumSoft Corp.
http://www.maximumsoft.com/index.html
An offline browsing tool with a graphical interface in multiple versions
for different operating systems and performance/feature level.
Discovery, Display and Access Software
ARC Access Tools
http://archive-access.sourceforge.net/
Internet Archive's list of tools for processing and accessing content
in ARC files.
Kea
A GPLed tool for automatic keyword extraction from text documents. Originally written in a combination of Perl, C and Java; now available in an all-Java version. From the New Zealand Digital Library at the University of Waikato, New Zealand.
libiViaMetadata
http://ivia.ucr.edu/manuals/unstable/libiViaMetadata/5.4.0/
A GPLed C++ library for assigning descriptive metadata to web files. Developed under the iVia Project. Includes the PhraseRate program which is described at
http://ivia.ucr.edu/projects/PhraseRate/
NutchWAX (Nutch + Web Archive
eXtensions), Internet Archive and Nordic National Libraries
http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#toc
A tool for indexing and searching web archives. Currently works only with
the Arc format
(http://www.archive.org/web/researcher/ArcFileFormat.php).
Implemented as a Java servlet. Add parsers to handle different formats,
e.g. xpdf for PDF files. This work is sponsored by the IIPC (International
Internet Preservation Consortium).
Wayback, Internet Archive
http://archive-access.sourceforge.net/projects/wayback/
The open source version of the Internet Archive's proprietary search and display interface, the "Wayback Machine" (listed next).
Wayback Machine, Internet Archive
http://www.archive.org/web/web.php
A proprietary interface to the Internet Archive's huge collection of web
pages archived from 1996 to the present.
WERA (Web Archive Access), Internet
Archive and National Library of Norway
http://nwa.nb.no/
An archive viewer application that gives an Internet Archive Wayback Machine-like
access to web archive collections as well as the possibility to do full
text search and easy navigation between different versions of a web page.
WERA is based on, and replaces the NwaToolset. It uses the NutchWAX search
engine and is written in PHP and Java. This work is sponsored
by the IIPC (International Internet Preservation Consortium).
General Web Archiving Suites
Software that is more of a system of web archiving tools rather that individual applications
DataFountains
http://ivia.ucr.edu/manuals/unstable/DataFountains/2.2.0/
A tool for discovering, harvesting and describing web resources. Developed under the iVia Project.
PANDAS (PANDORA Digital Archiving
System), National Library of Australia http://pandora.nla.gov.au/pandas.html
Tools for controlling the harvest, conducting quality assurance checking,
initiating archiving processes, managing the metadata including access
restrictions, and producing management reports. Uses the HTTrack harvester.
PANDAS was created to enable very selective harvesting and is not intended
for large-scale automated harvests. The developers of this software are
re-engineering PANDAS to use IIPC tools like Heretrix and WERA, and to
be better integrated with their digital repository.
WebArchivist Software Suite, SUNY Institute of
Technology and University of Washington
http://www.webarchivist.org/resources.htm
Tools for entering metadata, searching, analyzing and displaying archived
sites. The software isn't licensed yet but according to the product's
website the plan is to make this software available to other organizations.
Used for the Library of Congress' Election 2002 (http://lcweb4.loc.gov/elect2002/)
and September 11 (http://september11.archive.org/)
web archives as well as the Asian Tsunami Web Archive (http://tsunami.archive.org/).
