Key in the text of Knight's American Mechanical Dictionary at an accuracy rate of 1 error in 20,000 characters or better (99.995% accuracy rate). Mark it up in SGML as instructed, so as to produce a searchable textbase. Accuracy will be judged on the basis of samples (minimum: 5% of pages) proofread by University of Michigan staff.
The item to be converted is a three-volume alphabetically arranged encyclopedia of the mechanical arts, stored on our server as 400dpi gray-scale image files. Vendor access to the data to be arranged (see last section, below).
The conversion product will consist of one or more (as convenient) text files, marked up in valid SGML in accordance with our proprietary dtd and keying instructions, and delivered via ftp or CD-ROM (preferably the former).
Files will be returned with a consistent system doctype declaration or with this public doctype declaration:<!DOCTYPE KNIGHT PUBLIC "-//UMDLPS//DTD knight 1.0//EN">
The original consists of about 2800 images, one page per image. The output text will probably amount to 10-15 MB of data.
<KNIGHT> (the entire work) should be subdivided into three <VOL>s (with N attribute = "1" "2" or "3"). Each <VOL> will contain <FRONT> matter preceding the <BODY>. The entries go in <BODY>; everything else goes in <FRONT>.
The <BODY> of each volume (the run of entries) is subdivided into alphabetical sections marked by a simple heading (<HEAD>) "A" "B" etc. Each alphabetical section should be tagged as <PART>s of the <BODY> (with "N" attribute = "A" "B" etc.). If a section is interrupted by the volume break, simply close the <PART> and reopen it in where it resumes in the next volume. If a volume break interrupts an entry, move the tail end of the entry forward so that the entry is complete before the </PART> </BODY> </VOL>.
As an encyclopedia in dictionary form, Knight's Dictionaryis arranged primarily by the entry. The entry (<E>) therefore will be the chief element of the markup. Each entry is divided into:
Typical simple entries
Sample | Transcription
<E>
|
<E>
<HW>A`ces.</HW> <DEF> <P><LABEL>Nautical.</LABEL> Hooks for the chains.</P> </DEF> </E> <E> <HW>A-cet`i-fi-er.</HW> <DEF> <P>An apparatus for exposing cider, wort, or other wash to the air to hasten the acetification of the fermented liquor. See <REF>GRADUATOR</REF>.</P> </DEF> </E> <E> <HW>Ac`e-tim`e-ter.</HW> <DEF> <P>See <REF>ACIDIMETER</REF>.</P> </DEF> </E> <E> <HW>Ac`e-tom`e-ter.</HW> <DEF> <P>A hydrometer suitably graduated for ascertaining the strength of acetic acid and vinegar.</P> </DEF> </E> <E> <HW>Ach`ro-mat`ic Con-dens`er.</HW> <DEF> <P>An achromatic lens or combination used to concentrate rays upon an object in a microscope. See <BIBL><I>Carpenter on the Microscope,</I> pp. 117-119, ed. 1857.</BIBL></P> </DEF> </E> |
Simple entry containing simple list.
Sample
|
Transcription
<E>
|
More elaborate lists follow the same pattern. Follow this link:
http://www.umich.edu/~pfs/knight/dox/lists.html
to see examples of lists with recommended transcriptions.
Simple entry containing simple table.
Sample.
|
Transcription.
<E>
|
More elaborate tables follow the same pattern. Follow this link:
http://www.umich.edu/~pfs/knight/dox/tables.html
to see an example of a more complex set of tables, with recommended transcription.
Aside from the basic sequence of HEADWORD-DEFINITION, the internal arrangement of material in the entries can differ considerably. Labels always (?) appear within paragraphs, but lists, tables, and illustrations can appear within paragraphs or between them; lists can contain other lists; etc.
A number of entries contain block quotations, both in prose and in verse. These should be tagged with <Q>; if the source is given, this should be tagged as <BIBL>.
Entry containing block quotation in verse.
Sample.
|
Transcription.
<E>
|
A very few entries contain a more elaborate internal structure, e.g. sets of tables, where the sets themselves are organizational units with headings. For this (rare) case of complex objects inserted within an entry, we have provided an <INSERT> tag which may contain P LIST TABLE etc., or may be subdivided into <DIV1>s (and those if necessary into <DIV2>s).
Each volume of the book contains some 'front matter' that does not fall within entries: title page, preface, lists of plates, etc. This material should be tagged as <DIV1>s of <FRONT>, e.g. the title page as <DIV1 TYPE="title page">, the preface as <DIV1 TYPE="preface">, but the DIVs themselves are internally tagged using ordinary <P> <LIST> <TABLE> etc. as appropriate--no special tags for author or title, etc.
For sample capture of front matter, follow the following link:
http://www.umich.edu/~pfs/knight/dox/front.html
All information arising from the division of the book into pages should be recorded as attributes of the <PB> tag. The numerical designation of the image file should be captured in the REF attribute; the printed page number (if any) should be captured in the N attribute; and the running headers should be captured in the "HEAD" attribute. On most pages, running headers will consist of a pair of words or phrases, one over each column. Include both, separated by a space.
The <PB> tag itself should be placed as near as possible to the point at which the page actually 'breaks': that is, at the point at which the reader would turn the page, insert the <PB> tag for the new page.
When a new division of some kind begins on the new page, move the <PB> tag forward as far as it needs to go in order for the file to be valid. This means tucking it into the first content-bearing element (e.g. E or DIV1).
Placement oftags. The rules are: (1) "pages always break at the top"; that is, <PB> tags will be inserted in the text at the actual location of the page break, regardless of the location of the page number on the printed page. (2) "Words cannot break at page breaks"; that is, if a hyphenated word straddles a page break, finish the word and any attached punctuation, then insert the tag. And (3) "Divisions begin at page breaks; they don't end there"; that is, if a structural break of some kind coincides with the page break (e.g., if a new section, paragraph, etc., begins at the head of the new page), the <PB> tag should be tucked inside the opening tag for the new division, NOT inside the closing tag for the old division.
Provide attribute values (e.g. the three attributes of the PB element; the N attribute of FIGURE PLATE VOL and PART) only when instructed to and when there is specific information to supply. Do not supply values of this sort: TYPE="unknown" or TYPE="unspecified".
The work includes both full-page plates and over 5000 individual illustrations. Tag the full-page plates as <PLATE> and the other illustrations as <FIGURE>. If a figure (or more likely a plate) contains other numbered or captioned figures within it, tag each subordinate illustration as a <FIGURE> tag within the <PLATE> tag.
"Figure 45" ... Most of the illustrations and plates are headed by a number (roman numeral for plates, arabic numeral for figures). Capture this number as the N attribute of the <FIGURE> or <PLATE> tag.
Captions and other text. Capture captions or caption-like material attached to an illustration as a <HEAD> tag within the <FIGURE> (or <PLATE>) tag. Other text that pertains to the whole illustration (excluding, e.g., lettered references to various pieces and parts of the picture) may be tagged using <P> within <FIGURE> (or within <PLATE>). "See" references on a plate ("See truss, p. 43"), for example, should be captured in a <P>.
Follow this link for a sample capture of a simple plate in context:
http://www.umich.edu/~pfs/knight/dox/plate.html
Entry with multiple illustrations
Sample
|
Transcription
<E>
|
Another illustrated entry (excerpt)
Sample
|
Transcription.
<E><HW></HW><DEF>
|
Text sequence. All pages should be captured in the order they were intended to be read. In most cases this is straightforward: down the lefthand column from top to bottom and then down the righthand column from top to bottom. On some pages, however, especially where the page is interrupted by a large illustration or table, the sequence is disrupted. It may, for example, proceed halfway down the left column, then halfway down the right, then back to the lower half of the left column, followed by the lower half of the right. Follow the order that makes sense and that honors the integrity of the entries.
Typical problem with text sequence
Illustration placement. Most illustrations appear within the entry that they are intended to illustrate, and they should be recorded so. But often the physical arrangement of the page dictated that the illustration appear outside of the entry that it illustrated. In that case, if you can determine which entry it really belongs to (using the information in the caption), move the <FIGURE> tag so that it appears within the correct <E>ntry. It is important that so far as possible, the figures illustrating a given entry appear within that entry. (This applies to <FIGURE>s only: leave <PLATE>s where they are.
For samples of these two typical problems, follow this link:
http://www.umich.edu/~pfs/knight/dox/probs.html.
Occasionally, illustrations appear after a line of text that ends with a hyphenated word. Treat hyphenated words in this situation just as you when they appear at page breaks. That is: words should not be broken by <FIGURE>s. If a hyphenated word precedes an illustration, first finish the word and any attached punctuation, and then insert the FIGURE tag.
Italics. Capture text in italics using the <I> tag. Omit <I> tag if you are already capturing the distinctiveness of the italicized portion in question using some structural tag, e.g. <LABEL>
Small caps. Capture text in small caps as if it were in regular caps.
End-of-line hyphens. If feasible, selectively remove end-of-line hyphens where they were introduced simply in order to justify the line, leaving only the hyphens that mark real compounds. If this is not judged feasible, record all end-of-line hyphens as "|" (the keyboard 'pipe' character = HEX 7C) instead of "-", and reserve "-" for hyphens that appear other than at the end of the line. (Any real example of the vertical bar or 'pipe' character should in the latter case be captured with the entity |).
Dashes. Capture dashes as —.
Quotation marks, apostrophes. Use the simple upright keyboard quotation-mark and apostrophe characters (except as noted below under 'syllabification'. Do not attempt to distinguish between 'opening' and 'closing' quotation marks.
Non-ASCII characters. Capture non-ASCII characters using the standard ISO character entities (ISOpub, ISOtech, ISOnum, ISOlat1, ISOlat2).
Unrecognized symbols. Unrecognized symbols should be captured as "#".
Individual non-roman letters. Individual letters in Greek should be captured using the ISO Greek-1 entity set. E.g. (for the cases that I noticed):
Individual letters in other non-Roman alphabets or writing systems (Hebrew, Chinese, Arabic, etc.) should be recorded as unrecognized symbols with #.
Mathematical operators (+ - = etc.). Use the keyboard character for = . Otherwise, capture as special characters using standard character entities, e.g. as follows.
Illegibility. If you cannot read a letter, capture it as "$". If you cannot read an entire word, capture it as "$word$"; any larger illegible section should be marked captured as $span$.
Dollar signs. Since we use "$" to mark illegibilities, capture a literal dollar sign using the entity $
Syllabification and stress in head-words. Headwords (HWs) are printed using hyphens and vertical accent marks (') to indicate pronunciation. Capture the hyphens as ordinary hyphens (-), but capture the accent marks using the 'back-tick' character (`).
Punctuation placement and spacing. Normalize spacing around punctuation:
Font size. Changes in font size should be disregarded, except as a clue to structural divisions (e.g. a block quote).
Superscripts and subscripts. Capture subscripted and superscripted text by placing the character ^ before each superscript character and ^^ before each subscript character.
Braces (especially in tables). Braces are always difficult. As in EEBO, we'll say: interpret them as best you can. Sometimes this will require entering text more than once, e.g. if the brace means "this word applies to all these other words," the easiest technique may be simply to apply the word to all of the other words by entering it as many times; sometimes it may require treating the single item as a head or label for a list containing the grouped items; sometimes it may involve attaching a ROWS or COLS attribute to a table <CELL>. Unfortunately, many variations are possible.
Text in non-roman alphabet. Extended text (a word or more) in any non-Roman alphabet, Greek included, should be captured as <GAP DESC="foreign">.
Musical notation. Capture each piece of musical notation of any length as <GAP DESC="music">.
Mathematical formulas. Those that cannot be captured as simple text strings (e.g. 2 × 3 = 6 or √4 = 2) should be captured as <GAP DESC="math">.
Record paragraph breaks with the <P> tag.
Do not record text columns (except in tables) or column breaks
Do not record line breaks as such (except when the line breaks indicate a change from one item to another in a <LIST> or <TABLE> of course).
Record page breaks with the <PB> tag (see above)
Do not record indentation; instead use the indentation as a cue to structure: e.g., the indent may indicate a new paragraph (<P>), a block quote (<Q>), etc.
Handwritten notes or additions; and purely decorative typographic markings should not be recorded at all.
Though I have not noticed any footnotes or marginal notes in this book, the dtd contains a NOTE element that should be used to capture any that do occur. The PLACE attribute is used to capture the location of the note (PLACE="foot" for footnotes; PLACE="marg" for marginal notes); the N attribute is used to capture the footnote marker (e.g. "1" "*" etc.), which is otherwise to be omitted; and the note itself should be embedded in the text at the point to which it refers (usually at the location indicated by the note marker).
Please bring to our notice any other situations that the instructions failed to anticipate and we will work out a way to cope with them.
One head word vs. multiple headwords
Sample
|
Transcription
|
A few other elements are included in the dtd, but are unlikely to see much (or any) use, except perhaps in the preface: DATELINE*, DATE*, SALUTE*, SIGNED*, OPENER*, CLOSER*.
Elements marked with an asterisk above (*) are TEI tags and retain the meanings assigned them in TEI P3. (See http://www.hti.umich.edu/t/tei/ for a quick TEI lookup by element name.). Most of the other tags in the dtd are in fact shortcuts for TEI tags. Thus INSERT is essentially equivalent to the TEI sequence Q TEXT BODY DIV0; VOL is equivalent to the TEI 'TEXT' tag; PART is equivalent to the TEI 'DIV1 TYPE="part"'; E is equivalent to the TEI 'DIV2 TYPE="entry"'; HW is equivalent to 'DIV3 TYPE="headword"'; DEF is equivalent to 'DIV3 TYPE="definition"'; and I is equivalent to 'HI REND="ital"'.
Images of the book covers are included in the image set. Treat these as blank pages.
The samples supplied with this guide use invented values for the "REF" attribute of the PB tag. For the conversion project itself, we will supply id numbers for the images that should be referenced by this attribute.
Three matters are left to be arranged, subject to discussion and negotiation with the conversion vendor: