Office for Information Systems - Harvard University Library
OASIS Home | OASIS Documentation Home

Diacritics and Special Characters in the EAD

The EAD DTD supports a subset of the ISO8879 standard for characters. Which sets they support is documented in the EAD Application Guidelines (version 1.0) in section 6.5.2 on General Entities. They include the character entity sets covering: Latin-1 and Latin-2, Greek symbols (also alternate Greek symbols, Greek letters, and Monotoniko Greek), Diacritics, Numeric and special graphics, Publishing symbols, and the General Technical characters. Physically, these character sets live in files which are named according to the form ISOxxxx.ent where xxxx is the particular set (e.g. ISOlat1.ent or ISOpub.ent).

However just because the EAD supports a character doesn't mean that character can be indexed and displayed properly in the specific system that the finding aids live in. In particular, when the Web is involved and the display mechanism is HTML there are severe limits to which diacritics and special characters can display correctly. In fact, it can vary from browser to browser, and from user to user depending on which fonts are installed on their PC.

Indexing Special Characters
Display of Special Characters
Handling of Musical symbols
What characters are supported where?


Indexing of Special Characters

OASIS strips out diacritic marks and normalizes special characters before indexing the finding aids so that searching and retrieval will always work no matter what character-generating capabilities the user has on their local PC. The normalization rules follow those of HOLLIS where there is overlap with the ALA character set.

Display of Special Characters

OASIS doesn't do anything with text entities it finds in finding aids (these are the entities with the form "δ" which is the way these characters are handled in SGML/XML). If the entity is part of ISO Latin-1 then most Web browser will interpret them correctly and display them as intended. If the entity is NOT in ISO Latin-1 then sometimes it will display correctly and sometimes the user will see the string "δ" in the document. Most PCs can handle some of the common non-Latin-1 characters, but again, this varies from case to case.

Handling of Musical symbols

Musical symbols (and other things like that such as math symbols, publishing marks, etc.) are supported in EAD and OASIS to a limited extent right now. There are a bunch of "entity reference sets" that WordPerfect and Author/Editor know about which are valid in EAD and look like "♯". To put one in a finding aid you use the same mechanism you are using now to put diacritics and special characters into your finding aids. If you're curious about what entities are defined, they're in the files ISOpub.ent, ISOlat1.ent, etc. on your PC.

For music symbols, the actual entity references you would use are ♯ ♭ and ♮ from the ISOpub.ent set. If you put these into a finding aid the HTML display will not know what to do with them, so they will display as is (i.e. you will see "Sonata in C♯ by J.S. Bach" in the HTML display in OASIS). In Panorama they will be replaced with whatever string was defined in the ISOPub.ent file, and I make no promises about how reasonable those might look.

In OASIS you can retrieve these finding aids using the symbols if you search on the whole string "♯" or "C♯", but not the word "sharp" by itself without those surrounding "&" and ";" characters. At least it's consistent.

What characters are supported where?

  1. What character sets does EAD support?

    See the detailed documentation in the EAD Application Guidelines, section 6.5.2.1 (pp. 176-178)
  2. What character sets does WordPerfect8 support?

    The EAD DTD is compiled for WordPerfect using the ISO entity sets that ship with the software. There is a mapping file called ISO8879.map which is used, and which includes these entity sets: Latin-1, Latin-2, Greek letters, Greek symbols, Monotoniko Greek, Alternate Greek symbols, Russian cyrillic, Non-Russian cyrillic, Numeric and Special graphic characters, Diacritics, Publishing symbols, General technical characters, Box and line drawing, and Added math symbols (including ordinary, binary operators, relations, negated relations, arrow relations, and delimiters).NOTE that this is more sets than are supported in the EAD.
  3. What characters sets does OASIS support?

    ISO Latin-1 today, ALA eventually.

    Note on ALA character set support: since there's no way to display many of the ALA characters in Web browsers today I'm not sure what we could do except strip them.

    When Web browsers support Unicode then we should be able to resolve many of these issues. Until then, the best we can do is throw away characters that can't be displayed to the users. In the meantime, I don't want to invalidate your finding aids just because you've used a legal, but currently undisplayable, character. There's an item on the OASIS enhancements list to look into this, but it's not a high priority compared to some of the other things on that list...