Our standard minimum accuracy rate for text is 99.995%, which amounts to one error in 20,000 characters or better.
The character-level accuracy rate must be substantiated by sampling and proof reading of at least 5% of the data (as measured in both pages and bytes). The only sanction we employ to guarantee this rate is rejection of non-compliant data: any text that does not meet specification is rejected and must be corrected or redone and resubmitted.
At the minimum, we capture all text that uses characters based on the Latin alphabet (whether or not they have diacritics or other attachments); all standard symbols; unusual symbols when they can be identified and are found in running text; and most standard non-alphabetic conventional signs,'dingbats,' and similar symbols, at least when they are part of the text stream.
That is, every printed character (considered with respect to the universe of characters) must be captured distinctively, so long as it belongs to the Latin alphabet and its customary extensions, including all standard diacritics and supplementary characters (e.g. thorn, yogh, eth, j, v, w, and of course Arabic numerals); to the set of common conventional signs and symbols (para, sect, *, etc.); or to the modern set of punctuation marks (,.; etc.). Also captured should be any recognizable sign or symbol that occurs regularly within the books in question, intermingled with such Latin characters (etc.), even if a novel character encoding has to be invented for it. In practice, almost everything we capture belongs to Latin-1 (ISO 8859-1), with occasional use of the other most commonly used ISO character sets (ISOlat2, ISOtech, ISOpub, ISOdia, ISOnum), and occasionally supplemented by characters from other alphabets or symbolisms. In Unicode terms, this represents a subset of the Latin, supplemented Latin, extended Latin, punctuation, general punctuation, number, diacritic, and symbol blocks, with rare extension to (say) Greek, Hebrew, IPA, and Cyrillic. Use of certain 'overloaded' ASCII-based characters should be discussed and decided on a case-by-case basis. In general, less ambiguity is better, but we are accustomed to using the traditional ambiguous forms in many cases, e.g. HEX 27 (') as both apostrophe and single opening or closing quote (and occasionally also for minutes or prime); HEX 22 (") as both opening and closing quote mark; or HEX 2D (-) as hyphen, figure-dash, and minus-sign. In one or two cases, we customarily divert some of these overused forms to idiosyncratic purposes, e.g. HEX 7C (|) to indicate an EOL hyphen or HEX 5E (^) to flag a superscripted character; any such use must be declared in advance.
Capture of extended text in non-Latin alphabets is optional.
Locally, there are forms of text that we leave uncaptured, but outside suppliers are again free to exceed our standard and capture it all. We do not generally attempt to capture handwritten additions or corrections to printed works; running text in non-Latin (e.g. Greek, Arabic) alphabets (that is, a word or more, as opposed to the use of individual characters used as symbols or sigla), with a few exceptions; and text in idiosyncratic symbolic systems, e.g. a novel system of shorthand or a personal cipher.
Finally, any text received must resolve in some logical manner the problem of end-of-line hyphens. Locally, our solutions depend on the nature of the material. In modern materials (i.e., 19th century or later), we typically remove end-of-line soft hyphens and preserve the relatively few EOL hard hyphens. In older material, with uncertain hyphenation rules, we preserve all hyphens but capture EOL and other hyphens using separate characters or markup.
We can accept most forms of character encoding that are unambiguous and readily convertible to other forms, so long as we are supplied with an inventory and explanation of characters encoded. We are most familiar with, and readiest to deal with, documents encoded in the traditional SGML manner using ISO 646 (US-ASCII) supplemented by SDATA character entities, preferably but not necessarily exclusively those in the standard ISO character entity sets. We find that this method gives us ready control over the characters in our texts: our SGML declarations forbid 8-bit characters; and our DTDs limit character entities to the specified sets.
Any other form of encoding would require some discussion beforehand, and should allow the same level of control. We should have no trouble accepting documents encoded using ISO 8859-1 (Latin-1) whether or not supplemented by character entities; XML-style numeric (Unicode) character entities; XML general entities equated with ISO entity names in braces (e.g. á => {aacute}); Universal Character Names (UCNs; aka Java Unicode escapes); and Unicode/UTF-8 encoding. The last should be used only if the actual codepoints used are constrained to a defined subset of Unicode and listed in a character inventory, for reasons already given.
Our minimum standard for markup is roughly that described in level 4 of the "TEI-in-libraries" guidelines: light markup, chiefly structural but falling back on presentational markup when necessary; calculated to allow intelligible display, efficient navigation, the discrimination of all salient features, automatic linking to corresponding page images, and the automatic extraction of a table of contents or outline directly from the marked-up structure.
We welcome densely encoded text, but recognize that accurate, intelligent, interpretive markup costs a lot. The deeper and denser it goes, the more it costs. As a result, given our institutional interest in bulk production as opposed to hand-crafted work, we have tended to reduce costs as much as possible by restricting markup to only:
The same would be our minimum standard for text accepted from outside.
In practice, assuming that the books are reasonably modern in convention and general in type, we expect at least the following features to be marked up:
Our standard DTD also makes provision for some more detailed markup that we require only occasionally, but only when the matter justifies it and makes it feasible, such as:
Our present DTD does not allow for the following, but easily could be made to.
At the production level, we are best able to work with SGML-encoded text, but are happy to accept XML as well. In either case, our minimum standard would specify only convertability. That is, we are happy to accept either SGML or XML, so long as the dtd and document instances can be readily converted from one to the other without significant loss of information or technical impediments.
Any XML or SGML features that would hinder such conversion would require discussion beforehand.
We are a TEI shop and expect markup in most cases to employ at least TEI-based semantics, if not a fully TEI-compliant dtd. This is perhaps not a minimum requirement, but is certainly a strong preference. For most projects, especially those involving outsourced keying and coding we have come to rely on a small subset of TEI that we refer to as our 'vendor' dtd (now in version 2), and derivatives thereof. We have found it adequate to an extremely wide range of material and readily adaptable even to some specialized texts. 'Vendor' and its derivatives handle the encoding for projects as diverse as EEBO/Evans, the Corpus of Middle English, and the Encyclopedic Survey of the University of Michigan. Nevertheless, we realize that projects may require more specialized markup that cannot be readily or efficiently captured using the vendor dtd, and will be glad to discuss more specialized markup schemes; we've used such schemes ourself for a number of small-scale text conversion projects, such as Knight's American Mechanical Dictionary
We recommend a text capture and encoding regime along the lines described below, but recognize that particular classes of material may require a different approach.
NOTE: the material that follows is a much abridged and slightly altered version of the documentation for the TCP (EEBO/Evans) projects. The full version (complete with TCP peculiarities), with much additional documentation including extensive sample files, can be found online at http://www.lib.umich.edu/eebo/docs. Documentation for other and earlier DLPS projects, including version 1 of the vendor dtd and keying and coding instructions, are linked from http://www-personal.umich.edu/~pfs/codetext.html. Much of this material is obsolete, but still contains some useful examples, especially in the case of Knight's dictionary.
All DLPS dtds potentially exist in at least two versions: (1) a limited version intended for use by conversion firms (V); and (2) an inhouse version intended for use by inhouse markup reviewers (R). All DLPS dtds potentially exist in at least two formats: (1) as an SGML dtd (S); and (2) as an XML dtd (X). The latest version of our aforementioned 'vendor' dtd exists presently only as a reviewer version in SGML format under the name "vndr2rs.dtd" (or may be invoked with the public ID DOCTYPE ETS PUBLIC "-//UMDLPS//DTD Proof 2.0//EN"). It may be found online at [URL]. This dtd is an extract from TEI P3/P4 (with some slight modifications) and uses TEI semantics; the TEI guidelines (TEI P3 or P4) may be safely used as a general guide to the meaning of particular tags, though local usage may dictate some specific practices. TEI P3 documentation is available online at Michigan: http://www.hti.umich.edu/t/tei/ .
Note: grayed-out elements (marked also with a *) should be used cautiously, rarely, or not at all.
Element | Used to tag a... | Brief description |
---|---|---|
<abbr> | word | Contains an abbreviation, especially a word that uses a diacritic or an abbreviation mark (e.g. overlining, etc.) for which no character-level provision has been made. Optional EXPAN attribute contains the expanded form. |
*<add> | span | Contains material added after printing, usually by hand. |
<argument> | head/foot | Contains a summary (in prose or verse) found at the head or foot of a division. Often labeled "Argument". Sometimes extended to similar material. |
*<author> | bibl part | Contains name of author, editor, etc. (role defined by attrib ROLE), espec. within bibliographic citation |
<b> | span | Contains bold-face text. Equals <HI REND="b">. May be used instead of HI or mixed with it. |
<back> | part of text | Contains the "back" matter belonging to a given <TEXT>. Compare <BODY>, <FRONT>. |
<bibl> | span | Contains a bibliographic citation. Usually obligatory only within <EPIGRAPH>, but may be used almost anywhere, e.g. in association with quotations or in bibliographic footnotes. May contain AUTHOR TITLE DATE IMPRINT |
<body> | part of text | Contains the main body of a given <TEXT>. Compare <FRONT>, <BACK>. |
<byline> | head/foot | Contains the 3rd-person statement of authorship of a given text division; not always easy to distinguish from SIGNED, a 1st-person attribution. |
<cell> | part of table | Table cell. Use like HTML <TD>. "ROWS" attribute = HTML "ROWSPAN"; "COLS" attribute=HTML "COLSPAN". Cells containing headings or labels (=HTML <TH>) should add attribute ROLE="label". |
<closer> | head/foot | Appears at foot of text division; corresponds to <OPENER> at head of division. Used especially when there is internal structure, e.g. <SIGNED>, <SALUTE>. Compare <TRAILER> |
<date> | span | Contains a date. Usually obligatory only within <DATELINE> and <HEAD>, but may appear almost anywhere. |
<dateline> | head/foot | Use within OPENER or CLOSER. Contains span of text at head or foot of text division indicating the circumstances of writing (especially the place and/or date). |
*<del> | span | Contains material deleted (e.g. scratched out) after printing, usually by hand. |
<div1> | part of front,back,body | A subdivision of <FRONT> <BACK> or <BODY>. |
<div2> | part of div1 | A subdivision of <DIV1> |
<div3> | part of div2 | A subdivision of <DIV2> |
<div4> | part of div3 | A subdivision of <DIV3> |
<div5> | part of div4 | A subdivision of <DIV4> |
<div6> | part of div5 | A subdivision of <DIV5> |
<div7> | part of div6 | A subdivision of <DIV6> |
<document> | entire item | Toplevel container for entire item, temporary header excluded. |
*<emph> | span | Contains text set apart, e.g. by typeform, as being emphatic. |
<epigraph> | head/foot | Contains quotation or motto (whether or not accompanied by <BIBL>) at head or foot of text division. Use also for scriptural quotations at head of sermons or commentary chapters. |
<figDesc> | description of figure | Used with controlled vocabulary to indicate form and content of illustrations especially when other means of identification (e.g caption) is lacking, especially for maps and portraits: <FIGDESC>Map of Africa</FIGDESC> |
<figure> | illustration 'event' | Marks location of illustrations within text. Captions (or similar text) attached to an illustration are captured within <HEAD> <P> or <L> tags within the <FIGURE> tag. <FIGURE> may nest. |
*<foreign> | span | Contains text set apart, e.g. by typeform, as being in a language other than the primary one. Has attrib LANG |
<front> | part of text | Contains the "front" matter of a given <TEXT>. Compare <BODY>, <BACK>. |
*<fw> | span | Occurs only in non-empty variant of PB and MILESTONE tags; contains literal text of material associated with milestone or page break, e.g. page number, running header, or milestone number. |
<gap> | gap 'event'=" | Empty tag used to mark the location and nature of material present but not captured (DESC="music" DESC="math", DESC="foreign" DESC="intruder" DESC="duplicate" etc.), or of material that is missing or illegible (DESC="missing" DESC="illegible" DESC="blank"). |
<group> | group of texts | Used to group <TEXT>s if item consists of more than one separate <TEXT> (usu. signalled by separate title page) |
<head> | head/foot | Contains heading for a text division (<DIV>), a stanza (<LG>), and argument (<ARGUMENT>), or a list or table (<LIST>, <TABLE>). Appears at the top (head) of the structural division. Compare <TRAILER>. Also used to capture the caption of an illustration. |
<hi> | span | Contains text that is designed to be set apart for some reason from the surrounding text, unless the reason is specified by use of a structural tag (e.g. <HEAD>). Attribute REND indicates presentation (e.g. REND="i" (italic), REND="b" (bold), REND="u" (underline), REND="marginal quotes" (marg. quotes).) Cp. <I> <B> <U> (may be used instead of I, B, U, or mixed with them) |
<i> | span | Contains italic text. Equals <HI REND="i">. May be used instead of HI or mixed with it. |
<idg> | (id elements) | Contains ID numbers CAT, VID, BIBNO used to identify item; ID attribute contains tracking number or primary unique item identifier. |
*<imprint> | bibl part | Contains imprint info (publisher, pubplace, date) within bibliographic citation. |
<insert> | (macro) | Shortcut for <Q><TEXT><BODY><DIV0>: for inserted textual objects such as quoted documents and letters. Defined identically to *DIV0. |
<item> | part of list | Contains an item in an ordered or unordered list (where it may contain a <LABEL>); or the second item in a dictionary list or list of pairings (in which the first item is tagged as <LABEL>). ATTRIB ROLE="label" when item serves as heading to list column. |
<l> | verse structure | Contains a partial or complete line of verse. Often part of <LG>. |
<label> | part of list | Contains a label attached to an item in a list. May either be paired with the <ITEM> or contained within it. When paired, may use ATTRIB ROLE="label" to tag label used as column header. |
*<lb> | line-break 'event' | Used rarely. Empty tag used to indicate a line break. Use only if there is no other way to indicate the relationship between the material before and after the break. |
<lg> | verse structure | Contains a group of verse lines that form a structural unit, e.g. a stanza, refrain, or verse paragraph. May nest. |
<list> | special format | Contains an ordered or unordered list of short items (long items can usually be treated as paragraphs) [cp. HTML <UL>, <OL>]; or a list of label-item pairs [cp. HTML <DL>]. |
<milestone> | milestone 'event' | Empty tag used to indicate a numbered stage in an (often non-structural) series (e.g. page number in a different edition; year in a running chronology). Optional non-empty version contains literal text of milestone in FW tag. |
*<name> | span | Contains a name. Use attrib TYPE to indicate personal, place, etc.; attrib NORM to contain controlled normalized version. |
<note> | note | Contains most material that appears with but stands outside the main text flow, whether or not anchored by footnote numbers, etc. Use attrib "N" to record marker; attrib "PLACE" to record location of note relative to text block. Generally NOT used for end notes, since that would require excessive text displacement. |
<opener> | head/foot | Appears at head of text division; corresponds to <CLOSER> at foot of division. Used especially when there is internal structure, e.g. <SIGNED>, <SALUTE>. Compare <HEAD> |
<p> | prose structure | Paragraph. Basic unit of prose structure. Use "N" attrib. if numbered. |
<pb> | page-break 'event' | Placed at beginning of each page to mark page-break "event". "N" attribute used to capture printed page numbers; "REF" attrib. contains number of image on which page appears. Optional non-empty version contains literal text of page number or running heads in FW tag. |
<pscript> | postscript/prescript (material appended to DIV) | Used to capture self-standing block of text (perhaps with heading and its own signature) that is added to the end of a division after the usual closing elements (e.g. the signature or dateline), especially postscripts to letters. Contained within CLOSER or OPENER and defined otherwise the same as a DIV7 |
*<ptr> | linking tag | Empty pointer to another location in the document, referenced with TARGET attrib. |
*<publisher> | imprint part | Contains publisher name within imprint |
*<pubplace> | imprint part | Contains publication place within imprint. |
<q> | span | Contains "block" quotations of all kinds (even if set off by typographic cues other than indentation). Use also as an alternative to INSERT to embed quoted documents within prose, using <Q><TEXT><BODY>... etc. |
*<ref> | linking tag | Contains material constituting a cross-reference to another location in the document, referenced with TARGET attrib. |
<row> | part of table | Table row = HTML <TR>. |
<salute> | head/foot | Greeting attached to letter or letter-like text division, placed within <OPENER> or <CLOSER>: 'my lord,' 'dear sister.' |
<signed> | head/foot | Signature statement (1st-person attribution) attached to letter or letter-like text division, or to other 'verbal actions' (e.g. a praise poem, will, or proclamation); placed within <OPENER> or <CLOSER>. 'Your friend,' 'Yours always'. Actual name optionally tagged as NAME. |
<sp> | drama structure | "Speech"--the basic unit of drama or drama-like texts; normally headed by <SPEAKER> element. |
<speaker> | drama structure | When found at the head of a speech, contains the name or designation of the speaker (or speakers) |
<stage> | drama structure | Contains stage directions of all kinds, whether within the text or in the margin. |
<table> | special format | Contains tabular material that cannot easily be made intelligible without retaining the two-dimensional layout of the original page. Tables containing nothing searchable (e.g. all symbols or numbers) may be omitted and captured as <FIGURE>. |
*<term> | keyword | Contains contolled keywords (index terms) in keywords tag. |
<text> | self-standing item | Usually=the whole book. May also be used to tag embedded documents that are substantially complete. (cp. <INSERT> <Q> and <GROUP>). |
*<title> | bibl part | Contains document title, espec. within bibliographic citation |
<trailer> | head/foot | Contains heading for a text division (<DIV>) a <TABLE>, or a <LIST>. Appears at the bottome (foot) of the structural division. Compare <HEAD>. |
<u> | span | Contains underlined text. Equals <HI REND="u">. May be used instead of HI or mixed with it. |
*<unclear> | span | Used rarely to contain text that is difficult to read but has nevertheless been partly or completely captured with some degree of doubt. |
Specialized vs. general markup. As a rule, if it is not clear that something qualifies for specialized treatment, it can safely be captured as straight text. If you're not sure whether an elaborate treatment is justified, use the simpler treatment instead. This is almost always the safe thing to do: we don't lose any text that way, and we don't perpetrate any incorrect markup: better LESS markup than WRONG markup.
Page-image ID numbers The beginning of each page (including the first page and all blank pages) should be recorded with a <PB> tag. The REF attribute of the <PB> tag is required: its value should be filename or number of the page-image file such as will provide unambiguous reference to the appropriate page image. E.g., a page appearing on the the third page image will begin with <PB REF="3">; a page appearing on the seventh page-image might begin with <PB REF="7">. If it is necessary for some reason to capture the contents of the images in an altered sequence (e.g. because the scans were taken out of order), the REF values must still reflect the original sequence, e.g. as reflected in the filenames of the tiff files.
File naming The text captured from each book should be returned as a single file, [idnumber].sgm (zipped up either singly or as a batch in a standard .zip file). A resubmission of a file previously submitted should insert a "rev" (for "revised") in the filename, e.g. WB1187.rev.sgm.
Full bibliographic metadata is normally stored separately from the text, e.g. in a TEI header or (better) in a MARC record from which a TEI header can be automatically generated, and is not regarded as part of text encoding per se.
The text itself as delivered must contain some bare minimum administrative metadata, with the option to provide more.
Required
Optional
With a few standard exceptions noted below, the entire text will be recorded in its entirety, first page to last, in the order it was intended to be read (top left to bottom right, left column before right column, etc.).
The chief (and rare) exception is parallel texts. Running parallel texts, printed in a multi-column, multi-row, or facing-page arrangement, or some combination thereof, need to be treated as separate texts (normally, separate <DIV>s, sometimes perhaps separate <TEXT>s), each one recorded until its end and not restarted on each page. Notes and other material relating to only one of the texts on a page needs to be embedded in that text, not in any of the others. If a single heading or figure applies to more than one of the parallel texts, it should be recorded at the appropriate place in each text to which it applies.
Partial or fragmentary parallel texts will normally be broken primarily at the chapter or section level (e.g. <DIV1 TYPE="chapter">), then into parallel versions of that chapter (e.g. <DIV2 TYPE="version">) when necessary. But full parallel texts, e.g. an entire Latin-English parallel New Testament, or a Latin-English parallel Boethius) will normally be broken primarily into versions first (<DIV1 TYPE="version">), then each version into its chapters (<DIV2 TYPE="chapter">).
All material should be recorded in the form in which it appears in the book: do not attempt to correct spelling or typographic errors (except upside-down letters or physically displaced text). Spaces between words should always consist of one space character. Spacing around punctuation should be normalized, either to modern standards or to contemporary ones, if they clearly differ.
Page numbers as printed in the book will be preserved only as the value of the "N" attribute of the <PB> (page-break) tag (unless the non-empty form of the PB tag is chosen). Unnumbered pages should receive a <PB> tag with the N attribute omitted. Incorrect page numbers, if they arise from typographic error, should be recorded just as they appear (otherwise: see comments on out-of-order pages). Page numbers will usually consist of arabic or roman numerals, but may also appear as letters or letter-number combinations. If there appear to be multiple separate paginations, choose one to record with the <PB> tag; record the other with a <MILESTONE> tag. Ignore any typographic elements used to set off the page number. E.g. -2-, {p. 2}, and PAGE 2 should all be recorded as <PB N="2">; (ccii) .cc.ii. and -ccii- should all be recorded as <PB N="ccii">; etc.
Placement of <PB> tags. The rules are: (1) "pages always break at the top"; that is, <PB> tags will be inserted in the text at the actual location of the page break (the "top" of the page), regardless of the location of the page number on the printed page. (2) "Divisions begin at page breaks; they don't end there"; that is, if a structural break of some kind coincides with the page break (e.g., if a new section (<DIV>), paragraph, stanza, etc., begins at the head of the new page, the <PB> tag should be tucked inside the opening tag for the first non-empty structural element, neither inside the closing tag for the old division nor between the two divisions. And (3) "Words cannot break at page breaks"; that is, if a hyphenated word straddles a page break, finish the word and any attached punctuation, then insert the <PB> tag. Treat the hyphen as any other end-of-line hyphen.
In parallel texts, material on a single page is often recorded at widely separated points in the data stream (once in each parallel <DIV>). In that case, the <PB> tag, including the page number, should be repeated, i.e., recorded in both <DIV>s. (The same may be done in the case of footnotes that carry over to the next page, but in fact we generally omit the extra PB in that case.)
Foliation. Some books may be foliated instead of paginated, i.e., every leaf may receive a number, rather than every page (in which case, typically, the back page of each leaf has no number). Record a foliated book in the same way as a paginated book, supplying the folio number as the value of the "N" attribute of the <PB> tag. A typical page sequence in this kind of book will look like this:
<PB N="iij"> <PB> <PB N="iv"> <PB> <PB N="v"> <PB>The folio number may be explicitly labeled as such ("Fol.xvii." or "Folio .cxli."). Discard the label and punctuation and record only the actual number (<PB N="xvii"> <PB N="cxli">), again unless you are using the non-empty PB option).
Page breaks in unbreakable objects. Occasionally an object such as a table will spread across a two-page opening so that the opening becomes in effect a single page. (This is different from a table that is simply continued from one page to the next.) In this case there is no sensible place to insert the <PB> tag that marks the break between the left and right pages, so it should be inserted before the unbreakable object, with a double "N" value.
E.g., if a single table is spread across pp. 46 and 47 (both of them on image 22), the tagging should look like this:
<P> <PB REF="22" N="46-47"> <TABLE> ... </TABLE> </P>
Objects that span two or more IMAGES (as opposed to pages) are another matter. This happens fairly commonly with large fold-outs, which may have been filmed in sections. In some cases, it may be possible to break these up into separate objects. In that case, each piece of the original foldout will be tagged as a separate (e.g.) <TABLE> with intervening <PB> tags to indicate the appropriate image on which the piece appears. In other cases, it is more feasible to treat the set of images as a single unbreakable object and insert the PBs in a group at the beginning of it.
Other (largely) non-structural numerations and alternative numerations. If the book contains some other running numeration system alongside folio or page references, you may use the milestone element to record it. If the nature of the unit is obvious (e.g. chapter 1 -- chapter 2 -- [etc.]), you may use the "unit" attribute to capture that information: <MILESTONE UNIT="chapter" N="2"> <MILESTONE UNIT="chapter" N="1">. Particularly complex sets of MILESTONEs often appear in works with a running marginal chronology, or set of chronologies; in these books, one sometimes finds a suitable UNIT value at the head of the column of marginal years: <MILESTONE UNIT="year before Christ" N="1234">. Note that this applies only to a sequence; occasional notes of this sort should be recorded simply as <NOTE>. If in doubt whether a set of numbers represents <MILESTONE>s or <NOTE>s, use <NOTE>. (Milestones can of course also be found embedded in notes that contain additional information). Some books contain conflicting structural enumerations, e.g. a system of proposition numbers in the margins that does not correspond with the chapter numbers; the former may be recorded using <MILESTONE> tags.
Some books mark the fine structure of the book or of the book's argument with marginal sequences of numbers. In many cases, such small units of structure (without headings) do not merit tagging as DIVs and the marginal indications can be readily tagged as MILESTONEs
Biblical verse numbers inserted in the text of a translation or paraphrase (either verse or prose) are usually most readily tagged as MILESTONEs (<MILESTONE UNIT="verse" N="14">).
Line numbers in verse should be recorded only as the value of the "N" attribute of the <L> tag. Record in this fashion only line numbers actually printed in the book, and use the form of the number that appears in the book. (Line numbers in prose should usually be regarded as non-structural--that is, they do not correspond to any structure that we are tagging--and recorded as milestones, as above.)
Stanza, chapter, section numbers, etc. (that is, sequential numbers that appear in the headings to <LG>s and numbered <DIV>s) should be included as they appear in the book as part of the text surrounded by the appropriate <HEAD> tag, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the appropriate <DIV> or <LG> tag.
<DIV2 TYPE="chapter" N="5"><HEAD>Chapter V.</HEAD>
<LG N="14"><HEAD>Stanza XIV.</HEAD>
Note that though the chief use of the N attribute is to record numbers, it can be used (guardedly) to record any comparable information, especially if it is sequential : alphabetical sequences (N="a" N="b"); names of countries; reigns of kings; in the latter cases, the N attribute can serve a normalizing function: <DIV1 TYPE="reign" N="Edward I">.
Paragraph numbers (sequential numbers appearing at the beginning of a series of paragraphs that you have not chosen to regard as <DIV>s) should be included as they appear in the book as part of the text surrounded by the <P> tags, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the <P> tag.
Item numbers and label numbers in lists should be recorded as part of the text included within the <ITEM> (or <LABEL>) tags. They may additionally, if feasible, be recorded and normalized as values of the attribute "N". See below under "Lists and Tables."
Enumerations in tables may be variously treated: given a column of their own, left as part of the text in a row, or even made part of an embedded <LIST>, whichever adequately represents the information most simply and efficiently. It is usually best to include the numbers as part of the text. See below under "Lists and Tables."
Language. Supply a value for the LANG attribute of numbered <DIV>s and of whole <TEXT>s, but do so only if the bulk of the text (barring notes) in that <DIV> or in that <TEXT> is in the indicated language. Supply the attribute at the highest level at which it applies: e.g., if an entire text is in Latin, add LANG="lat" to the <TEXT> tag, but not to all the <DIV> tags within that <TEXT>; if one of the <DIV1>s in a text is in Latin and other is in English, assign LANG="lat" to one of the <DIV1>s and LANG="eng" to the other; and so on. Assume that the LANG property is inherited. Optionally: mark units as small as LG, Q, and FOREIGN with LANG attributes.
Assign multiple LANG values to the same <DIV> or <TEXT> only if it contains two or more languages in some kind of organized relationship. E.g., a bilingual Latin/English dictionary should be coded as <TEXT LANG="lat;eng"> (with a semicolon between the two codes). Use USMARC 3-letter language codes published by the Library of Congress at http://lcweb.loc.gov/marc/languages/ (These are identical to the 3-letter codes contained in the ISO standard 639-2; see http://lcweb.loc.gov/standards/iso639-2/langhome.html) Unless the scope and nature of the project require it, do not normally attempt to differentiate between forms of the same language: e.g., record LANG="fre" for French texts and LANG="eng" for English ones, not LANG="frm" ('Middle French') or LANG="enm" ('Middle English').
TYPEs of DIV. Supply a value for the TYPE attribute of numbered <DIV> elements if the appropriate value is obvious; otherwise, omit the attribute entirely. You may find it useful to consult our (aging) TCP list of common and preferred DIV TYPEs. If you do supply a value, use these rules:
If the designation in the book is a verbose version of a common English term, use the simpler form. E.g., if the book says "Prefatory Remarks by the Author," you shouldn't be afraid to translate this into <DIV1 TYPE="preface">
As a fallback, use whatever is printed.
<DIV1 TYPE="poem"> <DIV1 TYPE="poem">See further under Poetry, below.
Provide attribute values only when instructed to and when there is specific information to supply. Do not supply values of this sort: TYPE="unknown" or TYPE="unspecified".
Captions. It is not always easy to distinguish between captions and other text within the illustration. Captions may appear below the illustration, above it, in a circle around it, or even within it (e.g. on a "shield" or similar device), and may often be distinguished from other text by the fact that they provide a summary identification or description of the illustration. If in doubt, assuming that the text can be read, capture it.
Mixed text and illustration (e.g. where the woodcut frames the text, or where a block of text (e.g. a poem) is printed by means of woodcut, can in most cases be captured by treating the illustration per se and the text as separate items. In the case of a poem printed by means of woodcut at the bottom of a larger illustration, for example, it is often easiest to capture like this: <P><FIGURE></FIGURE></P><LG><L>... </L></LG>.
In-line illustrations, if they are truly in-line (that is, can be unambiguously located within a line of text) should be inserted (as <FIGURE>) within the text at the appropriate spot. If the appropriate location is not quite so obvious (e.g., an illustration occupying two or three lines of text inserted in the text or placed in the margin), use the rules for marginal notes (below). That is: if the correct location can be identified easily (e.g. by an identifying phrase, "as shown in this figure:") place the <FIGURE> tag within the text at that point; if not, simply place it after the nearest sentence-ending punctuation (e.g. a period or colon).
This should be distinguished from spaces deliberately left blank. If these are significant and occur within the text, e.g. as blank spaces left to be filled in by hand in a legal or commercial form, capture these as <GAP DESC="blank">
There is no firm rule as to *which* copy to keep and which to <GAP> out, except that it would be sensible to keep the better copy and exclude the worse one. Often it will be the second copy which is the better (it is because the photographers thought there might be something wrong with the first copy that they made a second copy). If there is a duplicate run of images, and one is complete and the other incomplete, normally you should keep the complete set and exclude the incomplete set. If the situation gets more complicated, e.g. if both sets are incomplete, but are missing different pages, or if only set is complete, but includes some bad images that can be replaced by images from the incomplete set, you may have to mix and match. In any case, the desired result is : the best possible text, from the best images, in the right order.
Any images that are given the <GAP> treatment should be represented by separate <GAP> tags for each page (not each image), rather than attempt to represent a span of pages or a span of images with a single <GAP> tag. This is so that each image, regardless of whether it is captured or not, will still be represented in the text by a <PB> tag. Each <PB> tag should, of course, point to the actual image number using the REF attribute.
Surrounding structures should be preserved if possible, at the highest level that applies . A line of verse quoted in Greek, for example, should be recorded as <Q><L><GAP DESC="foreign"></L></Q>; a paragrah in Greek as <P><GAP DESC="foreign"></P>; and a stanza in Greek as <LG><GAP DESC="foreign"></LG>.
Record as: the semicircle .18.15, <GAP DESC="foreign"> .21.7, <GAP DESC="foreign"> .23
Extended spans of music should be captured using a single <GAP> tag, so long as other material (such as text, illustrations, or a page-break) do not interrupt.
Lyrics printed between lines of music should be recorded as ordinary prose. At every point at which the line of lyrics ends and a line or two of musical notation appears, insert within the running prose a <GAP DESC="music"> tag.
The illegibility threshhold. Two extremes should be avoided as far as possible: (1) using the illegibility markers promiscuously to avoid capturing text about which there is some difficulty; and (2) "creative" capture of text that really cannot be read, simply in order to avoid using the illegibility marker. We have prepared some examples of both overuse (EXAMPLE SET 1) and underuse (EXAMPLE SET 1; EXAMPLE SET 2; see also the bottom of SET 3) of the illegibility markers. It is admittedly not always easy to tell when a letter can be recognized with sufficient confidence to make its capture reliable.
One text or many? Most works will consist of a single <TEXT> containing a single <BODY> element (optionally also a <FRONT> and/or <BACK> element for front and back matter respectively). Some works will consist instead of a <GROUP> element that contains multiple <TEXT>s (each <TEXT> with its own <BODY> and, optionally, <FRONT> and <BACK>). The GROUP element will normally be reserved for items that contain several works published or bound together, each with its own title page, that were originally printed separately, e.g. the collected works of an author.
The DTD allows for two options that cover most real books:
The <BODY> (and, if necessary, the <FRONT> and <BACK> elements) will normally be divided into numbered <DIV>s corresponding to the main divisions of the text. Very simple documents, on the other hand, with no internal division (a work consisting of a single poem, for example, or tract containing only a series of paragraphs) do not require <DIV>s at all: <BODY><P> is sufficient. Use no more <DIV> layers than necessary.
The numbered <DIV> elements, from <DIV1> to <DIV7>, represent a hierarchy: the <BODY> (as also the <FRONT> and <BACK> matter) is subdivided into <DIV1>s; <DIV1>s, if necessary, are subdivided into <DIV2>s, and so on. <DIV>s divide into parts: with few exceptions, you need to have more than one of something to call it a <DIV>.
Individual small texts embedded within a larger work (e.g. entire poems quoted within a chapter of a treatise) should usually not be tagged as <DIV>s but should instead be placed within <Q> tags. The <Q> element may if necessary contain an entire <TEXT>, with its own <BODY>, <FRONT>, <BACK>, numbered <DIV>s etc. The <INSERT> element is a shortcut for this structure. See further under quotations, below.
Useful clues to the DIV structure include:
In general, these are not sufficient to establish a <DIV> and should instead be recorded as ordinary text. Numbered paragraphs, for example, should simply retain the number as part of the paragraph (and as the value of the "N" attribute of the <P> tag), but there is no need to call the number a <HEAD> and therefore make the <P> a <DIV>.
<P N="3">¶ III. In the third place, the Calvinist partie striveth ...Marginal "headings" that you decide not to treat as <HEAD>s can usually be encoded either as <NOTE>s, with the PLACE attribute set to "marg" or (if they contain a sequential numeration), as <MILESTONE>s.
TYPES of DIVs. See above under "attributes."
Front matter (material to include in the <FRONT> element) typically includes title pages, dedications, tables of contents, prefaces, prologues, honorific poems and prose blurbs (encomia), remarks "to the Reader", etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc., just as with <BODY>.
Title pages do not require special tags. Each title page should be recorded as a numbered <DIV> within the <FRONT> element. Include both the front and back (recto and verso), if there is material there to record. If there are multiple title pages, record each in a separate <DIV>. Most title pages can be recorded as simple blocks of prose text (recorded with <P>s). Other structural tags (e.g. <HEAD> or <EPIGRAPH>) should be avoided; verse quotations and illustrations on the title page should of course be recorded as such, using <LG>, <L>, <Q>, and <FIGURE>. Entirely engraved title pages should place the text for the entire page within a <FIGURE> tag and supply the TYPE value TYPE="engraved title page".
Back matter (material to include in the <BACK> element) typically includes indexes, glossaries, colophons, afterwords, appendices, etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc., as with <BODY> and <FRONT>.
Do not attempt to record the physical appearance of the page (centering, extra spaces, justification, type face, type size, etc.), though such cues may and should be used to determine the beginning and end of divisions within the text, the distinction between text and notes, etc. On type forms, see the special instructions below about use of the <HI> tag and its relatives.
Record physical line-breaks (with the <LB> tag) only (1) if the text is unintelligible without a break; and (2) if the break is not reflected by a structural tag. Many times, it is better to repeat a tag than to insert a line break in the middle of one; but more often it is possible to get by without doing either, especially if there is any punctuation at the line break. E.g., record this:
CHAP. XI. Some Advantages and Helps for raising and affecting the Soul by Meditation.like this:
<HEAD>CHAP. XI.</HEAD> <HEAD>Some Advantages and Helps for raising and affecting the Soul by Meditation.</HEAD>or, better, like this:
<HEAD>CHAP. XI. Some Advantages and Helps for raising and affecting the Soul by Meditation.</HEAD>but NOT like this:
<HEAD>CHAP. XI. <LB>Some Advantages and Helps for raising and affecting the Soul by Meditation.</HEAD>
However, some loosely formatted text can only be rendered intelligible by use of <LB> tags, in much the same circumstances that would require the use of the <PRE> tag in HTML. Common uses include the capture of inscriptions with significant line breaks, and quasi-mathematical text such as syllogisms:
Rover is a dog<LB> All dogs have tails<LB> ERGO Rover has a tail
Paragraph breaks should be recorded with <p> in prose and with <lg> (line-group or stanza) in verse.
Set the default typeform for a given region using the REND attribute of the various structural tags, e.g. DIV1 TYPE="dedication" REND="italic". Text in roman type is the default and does not need to be marked. Set default typeforms at the highest level that applies (much as when assigning the LANG attribute).
Mark text in a different typeform within a given region by using a mixture of the I B U and HI tags. I B and U are simply shortcuts for HI REND="italic" HI REND="bold" and HI REND="underlined". In practice, it is often easiest to use the separate I B and U tags, and reserve HI for typeforms not otherwise accounted for.
Treat I, B, U, and HI as cancelling the value of the REND attribute of the structural element. But treat I, B, U, and HI as cumulative with respect to each other, much as is done in HTML: <I><B> = bold italic.
Alternatively, treat the tags as mutually exclusive and use the REND attribute of HI to indicate combined typeforms, separating the values with a semicolon. E.g., <HI REND="bold;italic">
When punctuation coincides with the end of a span marked by the <HI> elements, and there is doubt as to whether the punctuation belongs inside or outside the closing tag, place it within the closing </HI> tag:
<HI>Sillepsis,</HI> or the Double supply.
Record superscripted and subscripted text using the keyboard "circumflex" or "caret" character (^ = DECIMAL 94, HEX 5E) before each superscripted character (^a;, ^b; 5^t^h; 2^n^d) and the same "caret" character doubled (^^) before each subscripted character (i.e., ^^a;, ^^b;, etc.), including punctuation characters.
Record large initials, "drop caps," etc., as ordinary capital letters.
Record "small caps" as ordinary capital letters.
Record vertical text (text printed perpendicularly to the main text) as if it were horizontal.
<Q>s are used for block quotations, whether of prose or verse. Don't use them for ordinary "inline quotations."
"Block quotations" include both quotations that are set off from the main text by indentation and blank lines (in the modern way) and also lengthy quotations that are set off by the use of other typographic cues such as a change of typeface (if unambiguously marking a block quotation). If you're not sure if a block of text is a <Q>, simply record the appearance of the text (using, e.g. <P> and <HI>).
See below for the special problem of marginal quotations marks or marginal inverted commas.
<Q>s are usually the best way to tag even very substantial items embedded in prose, e.g. a poem or a document of some kind quoted within a chapter, or within a note, or within an introduction.
<Q> can if necessary even contain an entire <TEXT>, with its own <FRONT> matter, <BODY>, <DIV> structure, and so on. Use <Q> for such embedded items (or the stand-in tag <INSERT>), rather than trying to treat them as <DIV>s of the main text (unless that's really what they are). Treating them as <DIV>s forces you to treat all the material surrounding them as <DIV>s too, at the same level.
Prefer this:
<DIV1 TYPE="introduction"> <P>blah blah</P> <P>blah blah</P> <P> <Q>here's a poem</Q> </P> <P>blah blah</P> </DIV1>
to this:
<DIV1 TYPE="introduction"> <DIV2 TYPE="stuff before the poem"> <P>blah blah</P> <P>blah blah</P> </DIV2> <DIV2 TYPE="poem"> <LG><L>here's a poem</L></LG> </DIV2> <DIV2 TYPE="stuff after the poem"> <P>blah blah</P> </DIV2> </DIV1>
Block quotations accompanied by citations should record the quotation within <Q> tags and the citation within <BIBL> tags.
Most material that is set off from the main body of the text but is adjacent and related to it can be safely tagged as <NOTE>. (But arguments (summaries at the head of <DIV>s), salutations, and speaker names and stage directions in drama are among the note-like features that have their own tags. Also see above concerning MILESTONEs.)
With the exception of end notes, record each note at the point in the main text to which it relates, set off by appropriate tags, not at the point where it appears on the page.
A note that spills onto the next page needs to be treated as a single note, not two, and should be placed in the text where it applies.
If the note points to a place in the text which is marked with a flag of some kind (e.g. a footnote reference number, an asterisk (*), etc.), discard this marker from both note and text once it has served its purpose by locating the <NOTE> in the right place in the text. The marker should be preserved only as the value of the "N" attribute of the <NOTE> tag. Notes that use non-alphabetical symbols such as "daggers," section-marks, paragraph marks, etc., should preserve those characters too in the "N" attribute if possible, using character entities or however the character would be represented elsewhere in the text, like this: <NOTE N="†">. If the character is not recognized as corresponding to a readily available character entity, supply "#" or "@" as the value, using the rules for unrecognized symbols. If the note contains a marker, but the text does not, or vice versa, act as if the marker were present in both places. If the notes contains a marker that is different from the one in the text, use one or the other (usually the one that makes most sense in the local sequence) and ignore the other.
Sometimes notes can be accurately placed only by noting their sequence. There may be three marginal notes on a page, for example, matched by three asterisks in the text; the first note is inserted at the first asterisk, the second note at the second asterisk, and the third note at the third asterisk.
If the note is keyed to the text by line number, verse number, etc., place the note at the end of the line (etc.) to which it applies, and discard the literal number from the note, if that can be done without loss of clarity.
Use the "PLACE" attribute of the <NOTE> tag to indicate where the note appears on the page:
- PLACE="marg" in margin or adjacent to the text (even if part of it runs across the whole page because of lack of room in the margin, or if it is set into the edge of the text as a "shoulder" note)
- PLACE="foot" in a footnote, below the text
- PLACE="inter" interlinearly (between the lines of text)
If there are multiple distinguishable sets of notes in the same location (two sets of footnotes, for example; or multiple sets of marginal notes marked by different kinds of flags, one set marked by numbers, one by letters), distinguish them by using the SERIES attribute with something distinctive as its value (usually a simple number): PLACE="foot" SERIES="1" and PLACE="foot" SERIES="2" for example.
Example of book with two sets of marginal notes, one keyed to letters, one to numbers; record them as <NOTE PLACE="marg" SERIES="1"> and <NOTE PLACE="marg" SERIES="a">
Notes that apply to two (or more) distinct loci or lines should be reproduced and inserted at *both* (or all) the relevant points.
These need to be distinguished from notes that apply to a span of loci or lines; notes applying to a span of lines should be placed after the last line in the span with indications of the length of the span (e.g., "14-23" [with reference to line numbers] or "*-*" [with reference to two "*" flags in the text]) retained.
A note that appears next to a single verse line or set of lines and seems to relate to that line (or set) should be placed at the end of the line(s) in question.
A note that relates to a specified group of lines, verses, etc., should be moved into the text at the end of the last item to which it applies.If there are line numbers, the line number indication in the note should be preserved. If physical arrangement, rather than explicit line numbers, serve to specify the line or verse number range, and there are line numbers in the verse, supply the appropriate number range in brackets at the beginning of the note.
Notes referenced to a line (verse, etc.) number followed by "f." ("2365 f." meaning "line 2365 and following") should be treated as notes referenced to a span of two lines (in this case, 2365-66), that is, placed at the end of the second line (2366), with the full line reference preserved in the note: <NOTE PLACE="foot">2365 f.: ... </NOTE>
A note that seems to relate to an entire text division (e.g. a <DIV> or <P>) should be inserted at the beginning of the text that comprises that division, or to end of the <HEAD> if that is more convenient (and if it has one). E.g. a marginal note applying to a paragraph as a whole may be inserted at the beginning of the paragraph. This occurs commonly in books that contain a running summary or set of running headers in the margins: if these are not treated as <HEAD>s, or <ARGUMENT>s, they should be treated as <NOTE>s (PLACE="marg") and inserted at the beginning of the section to which they apply. If the summary is found centered at the head of the text proper (instead of in the margin) it should usually be given a tag of its own and tagged as <ARGUMENT> or <HEAD> (see below under Heads").
A marginal note in a prose text that seems to apply vaguely to the material next to which it is placed should be inserted at the end of the nearest sentence (as marked by punctuation--a period, semicolon, or colon), or at some other break in the text if that seems more appropriate.
A note that relates generally to the material on a page, or for which the appropriate place cannot readily be determined, should be attached to the last line of text at the bottom of the page.
End notes (whether appearing at the end of the book or at the end of the section), especially if they occupy any considerable space (a page or more) should not be inserted in the main text, but should instead be captured on the page and in the place where they appear. Depending on their extent, they may be captured as <P>s, <DIV>s, or even list <ITEM>s. All reference information in the note (e.g. footnote number, brief quotation of words from the main text) should be left in place.
Additionally and optionally, provide links from the main text to the end notes using the <PTR> or <REF> tags. You will need to supply each note with a unique ID based on the ID number for the book as a whole (e.g. ID="A12345-page17-note3"), and reference that ID in the TARGET attribute of the PTR or REF element. The PTR element is an empty element, to be used when there is no particular literal text serving as the cross-reference; the REF element contains any literal text that serves as a cross-reference, e.g. a note marker: <REF TARGET="A12345-page17-note3">^3</REF>.
Reference numbers in the text that point to something other than a note (e.g. to some part of an illustration), or for which the target cannot be found, should simply be recorded as part of the text.
Passages of verse (especially 2 or more lines, quoted and arranged as verse) within a note will normally be most readily coded as a quotation (<Q>) containing <L>s or <LG>s, embedded within the <NOTE> element.
Notes comprising a running interlinear commentary or interlinear gloss poses special problems.
In general, prefer to record itemized sequences as <LIST>s rather than <TABLE>s if possible. Use <TABLE> when the material cannot be readily understood without the spatial organization that tables provide. It is sometimes possible to capture items outside the main text flow as <NOTE>s or <MILESTONE>s rather than resorting to a <TABLE>.
The TEI 'list' element may be compared to a merger of the HTML OL, UL, and DL elements. It contains essentially two content models, one consisting of a sequence of ITEMs, the other consisting of a series of LABEL-ITEM pairs. The former is more commonly used.
Numbered sequences of items when the items themselves are blocks of text of considerable size (numbered paragraphs, for example) should not be treated as lists, but simply as numbered paragraphs (<P N="3">3. ...).
Complex lists (lists within lists) should be encoded with nested <LIST> tags, i.e. a <LIST> tag within an <ITEM> of another <LIST>:
<LIST> <ITEM> .. </ITEM> <ITEM><LIST> <ITEM> .. </ITEM> <ITEM> .. </ITEM> </LIST> </ITEM> </LIST>
Outline structures, genealogical trees and similar tree structures, and complex formatting involving braces can often best be tagged as nested lists, sometimes nested to a very deep level.
Treat any numbers that enumerate items in a list as part of the text of that item; do not record them with separate <LABEL> tags, though you may (optionally) also include them in normalized form as the N attribute of the ITEM tag. E.g.:
<LIST> <HEAD>Sins</HEAD> <ITEM>1. Avarice</ITEM> <ITEM>2. Sloth</ITEM> <ITEM>3. Pride</ITEM> </LIST>or
<LIST> <HEAD>Sins</HEAD> <ITEM N="1">1. Avarice</ITEM> <ITEM N="2">2. Sloth</ITEM> <ITEM N="3">3. Pride</ITEM> </LIST>
Typical indexes and tables of contents can be readily tagged using simple lists containing only <ITEM>s, especially if there is punctuation between the items and the page numbers. Always prefer this option if possible. E.g.:
<LIST> <HEAD>M.</HEAD> <ITEM>Malva, Wild Mallow, 46.</ITEM> <ITEM>Maple, 87, 91.</ITEM> <ITEM>March Mallows, 59.</ITEM> <ITEM>Matricaria, Featherfew, 54.</ITEM> <ITEM>Meadow Saffron, 19.</ITEM> <ITEM>Medune celebrated, 35.</ITEM> <ITEM>Meleagris, checquer'd Daffedil, 52.</ITEM> <ITEM>Melilot, Plaister Claver, 46.</ITEM> <ITEM>Melissa, Balm, 59.</ITEM> </LIST>
The page number is technically an internal cross-reference and may therefore be tagged using the <REF> tag, especially if it is desired to provide links from the table of contents to the items listed therein.
Even when punctuation is lacking (e.g. when the indexed item is left justified and the page number right justified, simple <ITEM>s will often do. Here is an example without punctuation (and including some nested lists):
<LIST> <HEAD>M.</HEAD> <ITEM>Man, <LIST> <ITEM>at variance with himself &c. 24</ITEM> <ITEM>An inbred malice in him 48</ITEM> <ITEM>Pindars account of him 97</ITEM> <ITEM>Vnable to judge of crimes 229</ITEM> <ITEM>He hath a will but not the power to resist God 125</ITEM> <ITEM>Prone to aggravate his own afflictions 254</ITEM> </LIST> </ITEM> <ITEM>Masanissa, his famous plot. 142 <ITEM>Mercy, <LIST> <ITEM>what it is 68</ITEM> <ITEM>How it differs from pitty Ib.</ITEM> </LIST> </ITEM> <ITEM>Michael Ducas, the great plague in his reign 267,268</ITEM> <ITEM>Mithridates, his cruelty 276</ITEM> </LIST>
OPTIONALLY, lists of pairs may be tagged with the element pair <LABEL> and <ITEM> (in that order). If you use this option, you may omit any "leader" (e.g. a dot leader) between the paired items. E.g.:
THE PLAYERS' NAMES The Prince...............Jn. Longfellow The Pauper...............Thomas Goodrich Joan the Tappester........Jack Smithson <LIST> <HEAD>THE PLAYERS' NAMES</HEAD> <LABEL>The Prince</LABEL><ITEM>Jn. Longfellow</ITEM> <LABEL>The Pauper</LABEL><ITEM>Thomas Goodrich</ITEM> <LABEL>Joan the Tappester</LABEL><ITEM>Jack Smithson</ITEM> </LIST>Such 2-column table-like lists, if they contain column headers, may distinguish them by use of the ROLE attribute on the ITEM and LABEL element, thus:
THE PLAYERS' NAMES (The character) (The Player) The Prince...............Jn. Longfellow The Pauper...............Thomas Goodrich Joan the Tappester........Jack Smithson <LIST> <HEAD>THE PLAYERS' NAMES</HEAD> <LABEL ROLE="label">(The character)</LABEL><ITEM ROLE="label">(The Player)</ITEM> <LABEL>The Prince</LABEL><ITEM>Jn. Longfellow</ITEM> <LABEL>The Pauper</LABEL><ITEM>Thomas Goodrich</ITEM> <LABEL>Joan the Tappester</LABEL><ITEM>Jack Smithson</ITEM> </LIST>
Tables should be recorded as you would using HTML tables, oriented by row, with the number of columns determined by the number of cells within the row. Use the spatial organization of the text to determine the number of rows and columns (not necessarily reflected in printed border lines). The ROWS and COLS attributes of the <CELL> tag should be used just like the ROWSPAN and COLSPAN attributes of the <TD> in HTML to indicate cells that extend across two or more rows or columns. Cells that contain a heading or label for a row (or column) should receive the attribute ROLE="label".
DLPS dtd | HTML equivalent |
---|---|
<TABLE> | <TABLE> |
<ROW> | <TR> |
<CELL> | <TD> |
<CELL ROLE="label"> | <TH> |
<CELL ROWS=""> | <TD ROWSPAN=""> |
<CELL COLS=""> | <TD COLSPAN=""> |
Particularly complex tables may be recorded (again as in HTML) with nested <TABLE> tags, i.e., a <TABLE> within a <CELL>, or by combinations of <LIST> and <TABLE>, i.e. a <LIST> within a table <CELL> or a <TABLE> within a list <ITEM>.
Physical arrangements that cannot easily be accommodated by our simple table model (e.g., labels with text running vertically) may need to be adapted and adjusted until they fit; it is more important to preserve the relationships between the items in the table than to preserve its exact layout.
Tables that continue from one page to the next may be tagged as one continuous table, with an embedded <PB> tag, especially if its headings are not repeated on the new page. If the headings are repeated, it is usually easier to close the old <TABLE> and open a new one on the new page.
These are to be distinguished from tables that spread across a page. See above under "Page breaks in unbreakable objects."
Difficult-to-capture complex tables containing only numbers or symbols (i.e., without any substantial textual content worth searching) may optionally be captured as <FIGURE> as if they were illustrations. Note, however, that just as with the captions attached
to "real" <FIGURES>, the heading for the tables should
be included within a <HEAD> tag inside the <FIGURE> tag.
For example, this table may be tagged like this: <FIGURE><HEAD>A Table of Houses for the Latitude of 51.degr.34. min. <HI>Sol in Aries.</HI></HEAD></FIGURE>
Here is a sample simple table (this one is simple enough that it could almost be done as a <LIST>).
Recorded as:
<DIV TYPE="table">
<HEAD><HI>TABLE</HI></HEAD>
<HEAD>By this table, shall ye fynde the Epistles and Gospels, for the Son|daies, and other feastiuall dayes.</HEAD>
<P>FOR TO fynde them the sooner, shall ye seke for these capital letters, <HI>A, B, C D,</HI> whi|che stande by the syde of this boke alwaies, On or vnder the letter shall you fynde a crosse ✗, where the Epistle or the Gospell begynneth, and where the end is, there shal ye find an halfe crosse, @ And the fyrst lyne in this table is alway the e|pistle, and the seconde lyne is alway the Gospell.</P>
<TABLE>
<ROW><CELL ROLE="label" COLS="3">On the fyrst Sonday in Aduent.</CELL></ROW>
<ROW>
<CELL>Rom. xiii.</CELL>
<CELL>C</CELL>
<CELL>And for as muche as we knowe</CELL>
</ROW>
<ROW>
<CELL>Math. xxi.</CELL>
<CELL>A</CELL>
<CELL>Nowe when they drew nye vnto</CELL>
</ROW>
<ROW><CELL ROLE="label" COLS="3">On the second sonday in the Aduent.</CELL></ROW>
<CELL>Rom. xv.</CELL>
<CELL>A</CELL>
<CELL>what so euer thynges are writen</CELL>
</ROW>
<ROW>
<CELL>Luc. xx.</CELL>
<CELL>C</CELL>
<CELL>And there shall be signes</CELL>
</ROW>
</TABLE>
Headings at the head of text divisions, tables, lists, and stanzas (<DIV>s and <LG>s) should be tagged as <HEAD>. Subheadings may be tagged with a second <HEAD> tag, with the TYPE attribute set to "sub," i.e., <HEAD TYPE="sub">, though in most cases it is probably better to combine the two into one HEAD.
Some headings have special tags (see below). If heading-like material doesn't fall clearly into one of these special categories, use simple <HEAD>.
HEAD has a quite inclusive content model, and may be used to tag objects of various kinds occurring at the head of a division. One such use is the division-beginning illustration occurring before the textual heading of a chapter. E.g. a portrait of "Prince x" occurring at the head of a chapter about Prince x. In such cases, the object may be treated as a special kind of heading (since FIGURE can't appear before HEAD):
<HEAD TYPE="illustration"> <FIGURE> <HEAD>x, princeps</HEAD> <FIGDESC><Portrait of Prince x</FIGDESC> </FIGURE> </HEAD>
Other heading-like material with its own tags includes:
"Idlenesse is lesse harmefull then vnprofitable occupation."
PUTTENHAM<EPIGRAPH> <Q>"Idlenesse is lesse harmefull then vnprofitable occupation."</Q> <BIBL>PUTTENHAM</BIBL></EPIGRAPH>
Epigraphs are a common place to find bits of non-roman script; record those bits with <GAP DESC="foreign"> as described above, but place the "foreign" portion inside the <Q> or <BIBL> of the <EPIGRAPH> tag.
Commentaries and sermons frequently quote a passage of text at the beginning (or at the beginning of each division), then comment on it. These passages may usually be readily encoded as <EPIGRAPH><Q> ... </Q></EPIGRAPH>, though occasionally there may be enough conmingled head-like material to force all of it into <HEAD>: <HEAD>A sermon on <BIBL>Rom. 8:28</BIBL> <Q>All thinges worke together for good</Q> with some reflections on Providence.</HEAD>.
Material at the end of a text division that is set off from the main text is normally to be tagged as a <TRAILER> or <CLOSER>. <TRAILER> is the more general tag, corresponding to HEAD and used for material without such internal structures as datelines, salutations, or signatures. Typical <TRAILER>s include "Amen," "Finis," and title-like material such as explicits ("here ends the tract written by Master John Knox"; "Explicit liber de gubernatione Dei."). <CLOSER>, on the other hand, is the counterpart of <OPENER>; it is used when the concluding material includes lengthy or complex information, including datelines, salutations, or signatures, especially in letters. See Letters, below, for examples. Requests for prayer for the author's soul are typically recorded as <CLOSER>s.
Epigraphs and bylines can appear at the foot of a division as well as at its head (see above for a description of epigraphs).
<BYLINE> vs. <SIGNED>. It is not always easy to decide whether to use byline (3rd-person) or signed (1st-person) for ascriptions of authorship. If the phrase actually uses "by" ("By Philip Sidney"), <BYLINE> is the better choice. If the item is a document that is normally signed in order to take effect (a letter, a will, an edict or proclamation), <SIGNED> is better.
Verse lines. Each verse line should be enclosed in <L> tags. Though the REND attribute can be used to indicate degrees or levels of indentation, we normally do not attempt to record the varying indentation of verse lines, but rather pay attention to indentation only insofar as it indicates a stanza break or a "broken" (carry-over) line (see below).
Broken (carry-over) lines. Sometimes when a verse line is too long to fit on the page, its last word or two is placed (sometimes marked off with an opening bracket or opening parenthesis) at the end of the next line or at the end of the preceding line (wherever it fits best); or rarely at the end of a line several lines away. Such detached bits of verse lines should be recorded if possible at the end of the line to which they really belong.
Mary had a little lamb, [snow. Its fleece was white as<L>Mary had a little lamb,</L>
<L>Its fleece was white as snow.</L>Partial lines occur commonly either when a verse extract is quoted, or when a line of verse is interrupted by some larger feature, e.g. a change of speaker in verse drama. The DTD includes the TEI PART attribute by which it is possible to tage partial lines as initial, medial, or final (PART="I" "M" or "F"), or some similar scheme, but we do not generally require this level of markup.
Groups of lines (<LG>s).
<DIV1 TYPE="poem">
<L>When the cat's away</L>
<L>The mice will play</L>
</DIV1>
<P>A stitch in time saves nine.</P>
<LG>
<L>When the cat's away</L>
<L>The mice will play</L>
</LG>
<P>Too many cooks spoil the broth</P>
<P>John walked along, chanting constantly:
<Q>
<L>When the cat's away</L>
<L>The mice will play</L>
</Q>
But no one noticed.</P>
<P>John walked along, chanting constantly:
<Q>
<LG>
<L>Red rover Red rover,</L>
<L>Come over Come over</L>
</LG>
<LG>
<L>The bird's on the wing,</L>
<L>The dog's had his fling.</L>
</LG>
</Q>
But no one noticed.</P>
Lines vs. line-groups. It is often unclear when a group of lines has enough organization to be called a stanza (line- group <LG>). If in doubt, err on the side of fewer line-groups rather than more. And be consistent throughout a particular poem, so that a particular structure is not sometimes tagged as a <LG> and sometimes left untagged. Clues to look at include, in decreasing order of significance:
<LG>s vs. <DIV>s. It is not always easy to distinguish between <LG>s and <DIV>s: both can have headings; both can nest to create a structural hierarchy. Metrical units (true stanzas) are always <LG>s; verse paragraphs of irregular length are frequently best recorded as <LG>s, especially if they are not consistently supplied with headings. On the other hand, <DIV>s should be used for line-groups big enough to have true titles, or to appear in tables of contents. Any poem with its own title deserves its own <DIV>.
Groups of stanzas within a poem should receive a numbered <DIV> tag. In most cases, you will use only a single level of <LG> (no nesting), and treat it effectively as the lowest-level text division. Any grouping of stanzas is therefore recorded as a <DIV>. (but see comments on songs in plays, etc., below.)
Entire poems. Each poem will usually be recorded as a <DIV> of the appropriate number (<DIV1> etc.), with TYPE="poem" (or TYPE="sonnet" etc. if you prefer to distinguish forms). Poems may, of course be subdivided further into <DIV>s and <LG>s of various types. If a poem is quoted within a prose context, it is usually easiest to treat it as a <Q>. See next.
Poetry mixed with prose. When poetry is truly interspersed with prose, and either the poetry is the predominant form, or there is no clearly predominant form, the prose should be recorded within <P> tags, the verse within <LG> tags. When poetry gives way to prose, close the <LG> and open a <P>; when prose gives way to poetry, close the <P> and open an <LG>, even if the actual prose paragraph, or even the last sentence, is not finished.
Exceptions:
Aside from a few special tags (below), prose drama should be recorded like other prose (in <P>s, etc.) and verse drama like other verse (in <LG>s, <L>s, etc.), including the rules for interspersed poetry and prose.
Cast lists. Cast lists (DIV TYPE="dramatis personae") should be recorded like other lists, usually with the <LIST> tag. Cast lists will commonly appear as separate <DIV>s (within the <FRONT> matter of a book if the book contains one play). For complex cast lists, use nested lists and labels to indicate cast groupings.
Stage directions. Stage directions should be recorded with the <STAGE> element. Stage directions sometimes appear between the columns of a multicolumn text, or in the margin, where they look like notes. In other books, they may be centered (as if they were headings) or indented (as if they were little paragraphs). They are occasionally typographically distinct (it italics; within parentheses; or both).
Speakers. The name (sometimes abbreviated) of the speaker is recorded with <SPEAKER>. In print, these appear at the head of a speech: e.g. typically above the first line of the speech (sometimes centered), in the margin, in an indented line of its own at the head of the speech, or in italics at the beginning of the first line of the speech. Regardless of where it appears in print, the <SPEAKER> tag is tucked into the beginning of the appropriate <SP> ("speech") tag.
Additional text associated with the speaker's name may be included in the <SPEAKER> tag, if it cannot readily be disentangled and that is the most convenient way to do it, like this: <SPEAKER>Mr. Jones, chanting in unison with three butchers</SPEAKER>. Readily separable material perhaps belongs rather in STAGE: <SPEAKER>Mr. Jones</SPEAKER><STAGE>descending from garden whilst reading a letter aloud</STAGE>. Multiple names should be enclosed in a single set of <SPEAKER> tags, like this: <SPEAKER>Mr. Jones and Mrs. Smith.</SPEAKER>
Speeches. The basic unit of drama is the SPEECH (<SP>). A speech normally continues uninterrupted as long as the character speaking it is uninterrupted by another speaker or by the end of a division (act, scene, etc.). If a speech begins or ends in the middle of a verse line or stanza, break the line or stanza: i.e., treat it as two lines (or stanzas), one in one <SP> and one in the next. The PART attribute on L allows such broken lines to be reconstructed, but we do not normally require its use.
"Songs" and other material specially set off within a speech should not normally be given any special tagging; if they have headings, they may need to be recorded as a nested <LG>. In exceptional cases when they contain an elaborate structure they may be recorded as a quotation (<Q>).
Prologues and Epilogues should normally be treated as part of the play, recorded as <SP>s like any other speech, though they may sometimes require a numbered <DIV> of their own.
Acts and Scenes. The act/scene structure should be recorded with appropriately TYPEd and numbered <DIV>s (e.g., <DIV2 TYPE="act" N="3"><HEAD>ACT III</HEAD><DIV3 TYPE="scene" N="4"><HEAD>Scene iv</HEAD><SP>...).
Personal letters that appear as text divisions should be treated as <DIV>s just like any other text division (chapters, sections, etc.). Letters quoted within running text (e.g. a letter quoted within the chapter of a book) should be treated like any quoted and inserted document, using either <Q><TEXT><BODY><DIV1 TYPE="letter"> or with the equivalent shortcut, <INSERT TYPE="letter">.
Note that dedications frequently look like letters, since they contain salutations and signatures, but they're not: treat them as <DIV TYPE="dedication">. (You may, however, still use <OPENER> <CLOSER> <SIGNED> <SALUTE> etc. in such letter-like divisions, if they apply.)
Affaicter. To trim, tricke, decke, dresse curiously, make neat, spruce, fine; to refine; also, to tame, reclaime, breake, make gentle, bring to ciuilitie.
Affaicter vn oiseau. To man a hauke throughly.
Affaicterie: f. A trimming, tricking, decking, neat, quaint, or fine dressing; also, neatnesse, nicenesse, curiositie, quaintnesse; also, a breaking, taming, reclayming, ciuilizing, making gentle; (hence) also, the through manning of a hauke, &c.
...can be recorded like this. The encoding of the phrasal subentry for "Affaicter vn oiseau" with a <DIV2> is probably superfluous in this case (a new paragraph with a <HI> heading would do as well); it is encoded more thoroughly here as an example of what can be done with more complexe entries if necessary.
<DIV1 TYPE="entry"><HEAD>Affaicter.</HEAD> <P>To trim, tricke, decke, dresse curiously, make neat, spruce, fine; to refine; also, to tame, reclaime, breake, make gentle, bring to ciuilitie.</P> <DIV2 TYPE="subentry"> <HEAD>Affaicter vn oiseau.</HEAD> <P>To man a hauke throughly.</P> </DIV2></DIV1> <DIV1 TYPE="entry"> <HEAD>Affaicterie: f.</HEAD> <P>A trimming, tricking, decking, neat, quaint, or fine dressing; also, neatnesse, nicenesse, curiositie, quaintnesse; also, a breaking, taming, reclayming, ciuilizing, making gentle; (hence) also, the through manning of a hauke, &c.</P></DIV1>
In general punctuation should be retained, but its spacing somewhat regularized. When a colon, semicolon, comma, question mark, closing quotation mark, or period falls between words, place a space after it, but none before it (unless it is being used to set off a number, like this: .lxvi. or .45. in which case it should be spaced as shown; that is, the periods should "hug" the number at front and back, without spaces.). When an opening quotation mark falls between words, place a space before it, but none after it. When a virgule falls between words, place a space before and after it. In case of doubt, follow the spacing system of the original as best you can.
Record the various forms of colon, period, comma, semicolon, and virgule (slanted line) with their modern keyboard equivalents ( : . , / ); a vertical bar should be recorded using the | entity (since we have reserved the keyboard character for another purpose).
Question marks vary considerably in form (some of them looking like inverted semicolons); record them all with the standard "?"
Opening and closing double quotation marks should be captured consistently: either (preferably) both should be recorded using the ordinary keyboard double-quote character (" = HEX 22),; or the opening quotes should be distinguished from the closing quotes using the “ and ” entities.
Opening and closing single quotation marks, as well as apostrophes, should likewise be recorded consistently, either all with the same character, the ordinary keyboard single-quote character (' = HEX 27), or using ‘ etc.
Hyphens (and figure-dashes but not other dashes) should normally be recorded using the ordinary hyphen character.
Hyphens at the end of a line should be recorded as the ordinary keyboard "pipe" (vertical bar) character, unless they appear between numerals, when they should be recorded with the ordinary hyphen.
If there is no end-of-line hyphen, but you think that there should have been (i.e., that a single word has been broken across two lines), place a plus sign, instead of a space, between the two halves: "cro+wn" "pri+nce". We recognize that since this requires interpretation of the text, it must remain an optional instruction subject to the discretion of the vendor.
"Figure" dashes (dashes between numerals) may be recorded using the standard keyboard hyphen character
Other dashes should be recorded using the entity —, regardless of where they appear, or how long they are.
The "minus" sign (−), if it can be distinguished from the m-dash and hyphen, should be recorded with a character entity (−).
The "times" (multiplication) sign (×), if it can be distinguished from the "X", should be recorded with a character entity (×).
Ellipses, whether two characters or many--strings of dots or asterisks indicating omitted or missing text--should be recordedeither as ordinary text, using periods or asterisks as appropriate: . . . . . * * * * * . . . -- or as special ellipsis characters or entities (e.g. …), so long as one practice or the other is followed consistently
Some books mark extended quotations by placing quotation marks at the beginning of every quoted line. The same technique is used in other books to mark proverbs and other sententious remarks. E.g.,
he made reasons...seyenge: God made alle thynges " by reason, and governethe thynges " made by reason; the sterres be movede by reason; and so " oure naturalle lyfe excedynge from reason by slawthe and " ignoraunce awe to be reducede by lawes and reasons. " Wherefore thau3he there be somme thynges in the rule of " seynte Benedicte, the intellect of whom the dullenesse of my " mynde may not comprehende, y suppose hit be beste to 3iffe " credence to auctorite. Wherefore also he persuadeth hymselfe ... O no (said Cecropia) company confirmes reso- " lutions, & lonelines breeds a werines of ones thoughts, " and so a sooner consenting to reasonable profers.
In prose, record the first and last of the marginal quotation marks with the special entities &startq; (first mark) and &endq; (last mark). If there is only one such marginal quotation mark (as sometimes happens with short quotations or proverbs), use both entities in sequence (&startq;&endq;).
In verse, simply record the quotation marks using the " character as it appears in the print, preferably followed by a space to distinguish it from other uses of ".
The verse marks are left alone. The prose marks are removed and the marked block is resolved either into a block quotation using <Q> or into a highlighted section using <HI REND="marginal quotes">.
Braces and brackets that group multiple lines should be ignored if all they do is group portions of ordinary running text, such as poetry. But if they are used to link one piece of text to another, such as frequently in tables and lists, their meaning needs to be interpreted. Sometimes this will require entering text more than once, e.g. if the brace means "this word applies to all these other words," the easiest technique may be simply to apply the word to all of the other words by entering it as many times; sometimes it may require treating the single item as a head or label for a list containing the grouped items; sometimes it may involve attaching a ROWS or COLS attribute to a table <CELL>. Many variations are possible, which the following examples can only suggest.
chapter | 1 How to build a kite | ||||||
2 When to fly a kite | |||||||
3 Famous kite flyers of our time | |||||||
4 When not to fly a kite | |||||||
5 "I've flown it: now what?" | |||||||
(Brace used like "ditto" mark to associate one word repeatedly with a series of items; may be recorded as follows, by repeating the word:) | |||||||
<LIST> <LABEL>chapter 1</LABEL> <ITEM>How to build a kite</ITEM> <LABEL>chapter 2</LABEL> <ITEM>When to fly a kite</ITEM> <LABEL>chapter 3</LABEL> <ITEM>Famous kite flyers of our time</ITEM> <LABEL>chapter 4</LABEL> <ITEM>When not to fly a kite</ITEM> <LABEL>chapter 5</LABEL> <ITEM>"I've flown it: now what?"</ITEM> </LIST> | |||||||
Dramatis Personae | |||||||
---|---|---|---|---|---|---|---|
townspeople | Joe | ||||||
Mary | |||||||
Bothom | |||||||
Josephus | |||||||
Joan, a noblewoman | |||||||
John, a philosopher | |||||||
(Brace used to associate one item as a head of a set of other items; may be recorded as follows, placing the one item in <HEAD< tag and the list of items in <LIST> and <ITEM> tags:) | |||||||
<LIST> <HEAD>Dramatis Personae</HEAD> <ITEM>townspeople <LIST> <ITEM>Joe</ITEM> <ITEM>Mary</ITEM> <ITEM>Bothom</ITEM> <ITEM>Josephus</ITEM> </LIST> </ITEM> <ITEM>Joan, a COuntess</ITEM> <ITEM>John, a philosopher</ITEM> </LIST> | |||||||
| |||||||
(Brace used in a table to place one cell in conjunction with a set of other cells; may be recorded using the COLS or ROWS attribute of the <CELL> tag:) | |||||||
<TABLE> <ROW> <CELL>In apice trianguli.</CELL> <CELL ROWS="3">Triangulus.</CELL> </ROW> <ROW> <CELL>In basi praecedens 3.</CELL> </ROW> <ROW> <CELL>Sequens & vltima. 3.</CELL> </ROW> </TABLE> |
Basic letter forms. We assume that most letters encountered will belong to the modern standard Latin alphabet, though their appearance may be strange. Books from different periods will raise peculiar issues best addressed individually. Here, for example, are some considerations that apply especially to the capture of early printed books:
Ligatures. Ligatured characters may be variously treated, so long as firm control of the character inventory is maintained and all representations can be readily resolved to combinations of single characters.
This is an "oe": | |
These are all "ae": |
The common form of the "ss" ligature that consists of a tall-s followed by a short-s has sometimes caused problems in recognition. Here are two examples:
= possibility | |
= Passion |
Fractions. For the fifteen common fractions listed in either ISOpub or ISOnum (namely: ½, ¼, ¾, ⅛, ⅜, ⅝, ⅞, ⅓, ⅔, ⅕, ⅖, ⅗, ⅘, ⅙, ⅚, ) , use the entity. Otherwise, simply use the "front slash" (virgule) character between the numbers (e.g., 23/47).
NOTE: Some documents use dual dates (e.g. "12/22 Dec. 1635") because of the discrepancy of ten days between the calendars of different countries caused by the adoption of the Gregorian calendar. These are not really fractions at all, though they look like fractions; they should always be recorded using the "slash" method: 12/22. Likewise dual-year dates (e.g. 1651/2 or 1667/68) are frequently printed so that the end of the date looks like a fraction. Again, it is not; these should always be captured using the slash (1651/2; 1667/68).
Ampersands, whether shaped like & or like "7," should be recorded as &.
"Old-style" roman numerals. Of the letters used commonly in Roman numerals (I V X L C D M), two,
namely "M" and "D," can appear in a variant form that makes
use of an extra character that resembles a backwards-facing
letter "c," combined with "I" and regular "c". E.g., this means "M.D.C.":
(Since
I can't represent a backwards "c" on the keyboard, I'll
use "(" for "c" and ")" for backwards-c in what follows.) "(I)" is a variant form of "M"; "I)" is a variant form of "D" (If you look closely, you'll see that "(|)" almost looks like
an "M" and "I)" almost looks like a "D"). When you find this style of Roman numerals, represent the combination "(I)" as "M" and "I)" as "D". For further examples, see the
document on roman numerals.
Letters printed upside-down (a common printer's error), if recognized, should be recorded as if turned right side up. Displaced type of any sort should be put back where it belongs, if possible.
Reserved characters
Diacritics. Recognizable letters with diacritics should be recorded using the standard ISO character entity, if available; or if not, composed from the base character plus the appropriate diacritic(s) from the ISO diacritics set. [if using Unicode, prefer the precomposed characters to the multi-byte composed characters.]
An abbreviation stroke over two or more contiguous letters (whether or not it crosses an upright stroke on one of the letters) should be treated as a generic abbreviation mark; i.e., it should not be recorded as a character at all, but the entire word should be placed within <ABBR> tags. Roman numerals (as in dates) are sometimes "overlined" in whole or in part. Do not record the overlining as such, but place the entire numeral within <ABBR> tags. See also the special document on Roman numerals.
Abbreviation symbols. A number of abbreviation symbols, mostly based on ordinary letters, are distinctive enough and consistent enough in appearance to be recognized. Each should be recorded with its own character entity. These should be rare to nonexistent in books after the seventeenth century.
The following table illustrates the commonest abbreviation symbols. More may be added later. Note that some have conditions attached; e.g., the "q3"- or "q;"-like symbol illustrated below means "-que" when it appears at the end of a word, but means something quite different (e.g. "quam," especially if it has a stroke over it) when it stands alone. It should therefore be recorded as &abque; only when it appears at the end of a word.
Symbol | Record as: | Meaning | Examples: | conditions: |
---|---|---|---|---|
&abper; | per, par | |||
&abpro; | pro | |||
&abus; | -us |
| at the end of a word only | |
&abque; | -que | in Latin at the end of a word only; may appear as a separate word in French; to be distinguished from "Esq;" abbreviation for "esquire." | ||
&abquod; | quod/quoth | |||
&absed; | sed | only when forming a word by itself | ||
&abser; | ser | .. | ||
&abcon; | con- cum- | at the beginning of a word only | ||
&abrum; | -rum | at the end of a word only | ||
&abis; | -is | at the end of a word only |
Letters from other alphabets, e.g. Hebrew and Greek, when used singly (as opposed to in whole words or extended text) should be recorded with ISO standard character entities.
Other symbols include alchemical and astrological symbols, which will rarely if ever appear as part of words, but may appear in or as marginal notes, in designations of units of measure, in calendrical tables, etc.
A selection follows.
Symbol | Example | Meaning | Record as |
---|---|---|---|
Zodiacal signs | |||
Aries | &Aries; | ||
Taurus | &Taurus; | ||
Gemini | &Gemini; | ||
Cancer | &Cancer; | ||
Leo | &Leo; | ||
Virgo (may also appear as abbreviation for "minim" ('drop') in medical recipes) | &Virgo; | ||
Libra | &Libra; | ||
Scorpio (may also appear as abbreviation for "minim" ('drop') in medical recipes) | &Scorp; | ||
Sagittarius | &Sagitt; | ||
Capricorn | &Capri; | ||
Aquarius | &Aquar; | ||
Pisces | &Pisces; | ||
Planetary signs (used in alchemy also for corresponding metals) | |||
Sun (or gold) | &Sun; | ||
Moon (or silver) | &Moon; | ||
Mercury (the planet or the metal) | &Merc; | ||
Venus (or copper) | &Venus; | ||
Earth (the planet) | &Earth; | ||
Mars (or iron) | &Mars; | ||
Jupiter (or tin) | &Jupit; | ||
Saturn (or lead) | &Saturn; | ||
Apothecaries' symbols | |||
ounce (apothecaries' unit of measure) | &ounce; | ||
dram or drachm (apothecaries' unit of measure) | &dram; | ||
scruple (apothecaries' unit of measure) | &scruple; | ||
"Recipe" ('take ...') in recipes and prescriptions | ℞ (from ISOpub) | ||
"Semis" ('half') with units of measure | ss (not really a symbol, just the ordinary letter "s" doubled; the second, variant form is rare and should perhaps be marked by <ABBR> tags around the basic "ss" capture.) | ||
Alchemical signs | |||
antimony | &antimony; | ||
sal armoniac (in (al)chemical contexts only) | &salarmon; | ||
fire (in (al)chemical contexts only) | &fire; | ||
water | &water; | ||
earth (the element) | &earth; | ||
subli- (forming words like "sublimate") | &absubli; | ||
precip- (forming words like "precipitate") | &abprecipi; | ||
sulphur or sulphu- (forming words like 'sulphuris') | &sulphur; | ||
oil or oleum | &oil; | ||
tartar (tartrate? tartaric acid?) | &tartar; | ||
vitriol (sulphuric acid) or vitrio- (forming words like 'vitriolata') | &vitriol; | ||
salt | &salt; | ||
nitre or saltpetre (potassium nitrate) | &nitre; | ||
Other signs | |||
cross (any variety: Greek, Latin, Maltese) | ✗ | ||
capitulum (paragraph) | ¶ | ||
right-pointing index finger (left-pointing finger also found) | &rindx; &lindx; |
A number of options are available in the vendor dtd which are not described here except as entries in the summary list, since they are not part of normal coding specification. These include:
In addition, only minimal or severely restricted instructions are given above for the following elements, which are capable of much wider application than we are accustomed to giving them: