Harvard University Library - Library Digital Initiative

Harvard University Library / Office for Information Systems


METS Java Toolkit
DRAFT Version 1.0 (alpha) 2002/03/15

Abstract

We present a Java toolkit for the procedural construction, validation, and marshalling and unmarshalling for METS, the Metadata Encoding & Tranmission Standard for XML encoding descriptive, administrative, and structural metadata regarding objects within a digital library. The toolkit API is based on Sun's JAXB binding framework, under which elements are represented by classes with accessor and mutator methods for attributes and the element content model, and additional methods for validation and marshalling to and from instance documents. JAXB was chosen as the basis for the METS API due to its anticipated market acceptance as a key component of Sun's Java XML Pack bundle. The toolkit's parser is James Clark's XP.

The toolkit was developed to allow procedural processing of METS files in the context of an archiving project in which multiple content providers submit materials packaged in METS files to a centralized archive. To achieve necessary operating efficiencies, a maximum level of automation is required for the creation of syntactically valid METS files on the provider side and for the ingest of those METS files on the archive side. The toolkit will be used as the basis for development of these automated systems.

The METS Framework

METS is intended to provide a standardized XML encoding for transmission of complex digital library objects between systems. While it provides standard containers and encoding mechanisms for descriptive and administrative metadata, it does not define the content or format of that metadata. However, the content and format of structural metadata is explicitly mandated within the METS specification.

METS incorporates by reference a subset of the XML XLink schema for defining simple relationships between METS files and external entities.

The METS schema is expressed using the W3C XML Schema definition language. The standard is maintained in the Network Development and MARC Standards Office of the Library of Congress, and is being developed as an initiative of the Digital Library Federation.

The JAXB Framework

In creating an API for the METS toolkit we had two options:

  1. Define a new API
  2. Use an existing API specification
In an increasingly balkanized software environment, the decision to define yet another new API should be made sparingly, and only if no existing specification is found to be appropriate. We selected the JAXB specification as the basis for the METS API due to its functional completeness as well as its anticipated market acceptance as a key component of Sun's Java XML Pack bundle.

The JAXB product is comprised of a set of base classes (the javax.xml.bind package); a schema-driven compiler for generating schema-specific binding classes, which are sub-classed from the JAXB base classes; and a set of run-time marshalling classes (the javax.xml.marshal package) invokable by the derived binding classes. JAXB provides mechanisms for local and global type and structural validation, and for marshalling to and from binding class instatiations and XML instance documents.

The organization of JAXB is presented graphically in the following figure:

The JAXB specification is still in early pre-release development. Although a preliminary early version implementation of the class generator has been released, it only accepts as input a subset of the XML DTD language; XML schema is not supported at all. Thus, all of the base classes as well as the METS schema-specific class used in the toolkit were constructed manually, following the JAXB API specification.

Note: Sun has announced that the future production release of the JAXB specification may be incompatible with the current pre-release version. The question of whether to stay with the current format of the METS toolkit or whether to migrate it into compliance with subsequent JAXB specification at some future time remains under discussion.

The Toolkit

The METS toolkit consists of an implementation of the javax.xml.bind and javax.xml.marshal packages (org.mets.xml.bind and org.mets.xml.marshal, respectively) and the METS API package org.mets.xml.mets following the JAXB binding framework API for schema-derived classes.

In general, each element defined in the METS schema maps to a public class:

public class Element
{
  public Element ();
  ...
}

with public type-specific accessor and mutator methods for all element attributes:

public type getAttr ();                 // Scalar-valued attribute
public void setAttr (type value);

public List getAttr  ();                // List-valued attribute
public void List.add (type value);
...

and content model:

public List getContent ();
public void List.add   (Element content);
...

List-valued attributes are represented as ordered lists of objects stored in an ArrayList object, which implements the List interface. Thus, to add a value to list-valued attribute, first retrieve the list and then use one of the standard List mutator methods to add the child to the list:

Element element = new Element ();
...
List list = element.getAttr ();
list.add (type value);

Similarly, child elements are represented in the content model as an ordered list of objects. To add a repeatable child element to its parent's content model, first retrieve the list and then use one of the standard mutator methods to add the child to the list:

Parent parent = new Parent ();
...
Child child = new Child ();             // Instantiate child element
List list = parent.getContent ();       // Get parent content model list
list.add (child);                       // Add child to content model

The Mets class encapsulating the root <mets> element provides public methods for validation:

public void validate ();

and marshalling to and from instance documents:

public void marshal   (OutputStream os);
public void unmarshal (InputStream  in);

To encapsulate arbitrary metadata elements, possibly namespace-qualified, defined external to the METS specification, the toolkit provides a generic element class:

public class Any
{
  public Any (String name);
  public Any (String namespace, String name);

  public String getNamespace ();
  public String getName  ();
  public String getQName ();
  ...
}

with a public accessor and mutator methods for ID attributes:

public String getID ();
public void   setID (String id);

a public list-valued accessor method for all other attributes:

public List getAttributes ();
public void List.add (Object value);
...

and a public list-valued accessor method for its content model:

public List getContent ();
public void List.add (Element content);
...

A generic attribute class is provided to support type-indendent manipulation of attributes, possibly namespace-qualified:

public class Attribute
{
  public Attribute (String name, Object value);
  public Attribute (String namespace, String name, Object value);

  public String getNamespace ();
  public String getName  ();
  public String getQName ();
  public Object getValue ();
}

Note that the generic element class does not perform any local validation on its attributes or content; global validation is performed for ID uniqueness.

The toolkit explicitly defines the default namespace to be the METS namespace, http://www.loc.gov/METS/.

The XMLScanner class used for unmarshalling is built using the token-level interface of XP.

Procedural Construction

The general recursive rule for procedural construction of a METS file is:

  1. The root document content model is the <mets> element
  2. For each element in the content model, instantiate the appropriate element object, set its attributes, define its content model, and then add itself to the content model of its parent:

Mets mets = new Mets ();
mets.setID ("1234");
...

  MetsHdr metsHdr = new MetsHdr ();
  metsHdr.setRECORDSTATUS ("Prod");
  ...

    Agent agent = new Agent ();
    agent.setROLE (Role.CREATOR);
    ...

      Name name = new Name ();
      name.setContent (new PCData ("S. L. Abrams"));
      ...

    agent.getContent ().add (name);
    ...

  metsHdr.getContent ().add (agent);
  ...

mets.getContent ().add (metsHdr);
...

  DmdSec dmdSec = new DmdSec ();
  dmdSec.setID ("abc");
  ...

    MdWrap mdWrap = new MdWrap ();
    mdWrap.setMDTYPE (Mdtype.DC);
    ...

      XmlData xmlData = new XmlData ();

        Any any = new Any ("my", "descMD");
        any.getAttributes ().add (new Attribute ("my", "attr", "value"));
        ...

          ...

        any.getContent ().add (...);
        ...

      xmlData.getContent ().add (any);
      ...

    mdWrap.getContent ().add (xmlWrap);
    ...

  DmdSec.getContent ().add (mdWrap);
  ...

mets.getContent ().add (dmdSec);
...

Once the entire content tree has been constructed, it can be validated and marshalled.

mets.validate ();
mets.marshal (System.out);

Validation

Validation occurs at both the global (document scope) and local level (element scope).

  1. Global Validation
    1. ID uniqueness
    2. IDREF-to-ID consistency, i.e., all IDREFs refer to existing IDs
  2. Local Validation
    1. Attribute type correctness
    2. Existence of required attributes
    3. Content model (sequence, choice, etc.)

Note: In the current implementation, the IDREF-to-ID consistency validation is performed using an mechanism that is an extension to JAXB.

Note: In the current implementation, the ordering of sequential content model elements is not validated.

Marshalling/Unmarshalling

The procedure to marshal a METS file is to create an in-memory representation composed of instantiated objects, and invoke the root <mets> class validate() and marshal() methods:

Mets mets = new Mets ();
...

mets.validate ();
mets.marshal (outputStream);

This generates a well-formed and valid METS file on the output stream.

The procedure to unmarshal a METS file to an in-memory representation is to instantiate the root <mets> class, and invoke its unmarshal() and validate() methods:

Mets mets = Mets.unmarshal (inputStream);
mets.validate ();

At this point, the entire document is represented by a tree structure of instantiated objects, against which the standard accessor and mutator functions can be invoked.

Implementation

The current version of the METS toolkit is based on the METS schema Version 1.0 (zeta) of February 8, 2002; the JAXB working draft specification Version 0.21 of May 30, 2001; and version 0.5 of XP. The toolkit has constructed and tested with the Sun J2SE 1.3.1_02-b02 JDK under Solaris 2.7.

Javadoc for the METS tooklit packages is available.

A distribution version of the METS toolkit can be downloaded as a gzipped tar file, mtk-20020315.tar.gz (378 KB).

gunzip mtk-20020315.tar.gz
tar xvf mtk-20020315.tar

The distribution directory structure is as follows:

mtk-1.0/
        README
        Makefile
        Marshal.java           # Test marshalling application
        Unmarshal.java         # Test unmarshalling application
	marshal.xml            # Marshalling output
	unmarshal.xml          # Unmarshalling output
        bin/
             bind.jar
             marshal.jar
             mets.jar
             xp.jar
        com/
             jclark/
                    util/
                         ...
                    xml/
                         tok/
                                 ContentToken.java
                                 Encoding.java
                                 ...
        doc/
             index.html
             ...
        org/
             mets/
                    xml/
                         bind/
                                 Makefile
                                 MarshallableObject.java
                                 MarshallableRootElement.java
                                 PCData.java
                                 ValidatableObject.java
                                 ...
                         marshal/
                                 Makefile
                                 XMLScanner.java
                                 XMLWriter.java
                         mets/
                                 Makefile
                                 Mets.java
                                 MetsHdr.java
                                 ...

The marshalling application, Marshal, procedurally constructs a METS file and marshals to the standard output unit. The unmarshalling application, Unmarshal, unmarshals the file specified by the file argument, validates it, and then re-marshals it to standard output.

To test the marshalling and unmarshalling applications:

cd mtk-1.0
java Marshal > marshal.xml
java Unmarshal marshal.xml > unmarshal.xml

The two files, marshal.xml and unmarshal, should be identical.

The following list documents currently unsupported features of the METS and JAXB specifications:

  1. Collection properties are initialized to non-null, but empty, as oppossed to the JAXB norm of initialization to null, which would require an invocation of the Element.emptyAttr() method (before setting attribute values) or the Parent.emptyElement() method (before defining an element content model).
  2. No strict validation of sequence ordering.
  3. The <area>, <par>, and <seq> elements not supported.
  4. Cannot specify non-UTF-8 encoding in XMLWriter.

The following limitations are in force in the current implementation:

  1. Single token cannot span more than two buffers.

Things to do:

  1. Base 64 encoder/decoder methods in binData and FContent classes.
  2. Add time zone field to dates.
  3. Support for the named entities and character references (&#xHH;).
  4. Support comments and PIs.
  5. Better diagnostic error messages.
  6. Use specific, rather than general exceptions (when documented).
  7. Full coverage of all public methods in the bind and marshal package classes.
  8. Monitor JAXB specification updates.

References

Java API for XML Binding (JAXB), Sun Microsystems, <http://java.sun.com/xml/jaxb/>.

Metadata Encoding & Transmission Standard (METS), Library of Congress, Network Development MARC Standards Office <http://www.loc.gov/standards/mets/>.

XP - An XML Parser in Java, James Clark <http://www.jclark.com/xml/xp/>.


Stephen Abrams
Harvard University Library, Office for Information Systems
1280 Massachusetts Avenue, Suite 404
Cambridge, MA 02138
stephen_abrams@harvard.edu
Copyright © 2002 by the President and Fellows of Harvard College.
This information about the METS Java Toolkit is provided for evaluation purposes only.

The METS specification was developed as an initiative of the Digital Library Federation and is maintained by the Network Development and MARC Standards Office of the Library of Congress.
The toolkit is compliant with the JAXB API specification, copyright © 2001 Sun Microsystems, Inc.
The toolkit uses the XP parser, copyright © 1997,1998 James Clark.