TEXT CONVERSION PROJECT:
Knight's American Mechanical Dictionary

Introduction

Scope of the job

Key in the text of Knight's American Mechanical Dictionary at an accuracy rate of 1 error in 20,000 characters or better (99.995% accuracy rate). Mark it up in SGML as instructed, so as to produce a searchable textbase. Accuracy will be judged on the basis of samples (minimum: 5% of pages) proofread by University of Michigan staff.

Source of data

The item to be converted is a three-volume alphabetically arranged encyclopedia of the mechanical arts, stored on our server as 400dpi gray-scale image files. Vendor access to the data to be arranged (see last section, below).

Output data

The conversion product will consist of one or more (as convenient) text files, marked up in valid SGML in accordance with our proprietary dtd and keying instructions, and delivered via ftp or CD-ROM (preferably the former).

Files will be returned with a consistent system doctype declaration or with this public doctype declaration:
<!DOCTYPE KNIGHT PUBLIC "-//UMDLPS//DTD knight 1.0//EN">

Volume of data

The original consists of about 2800 images, one page per image. The output text will probably amount to 10-15 MB of data.

Organization of the book

Overall high-level structure

<KNIGHT> (the entire work) should be subdivided into three <VOL>s (with N attribute = "1" "2" or "3"). Each <VOL> will contain <FRONT> matter preceding the <BODY>. The entries go in <BODY>; everything else goes in <FRONT>.

The <BODY> of each volume (the run of entries) is subdivided into alphabetical sections marked by a simple heading (<HEAD>) "A" "B" etc. Each alphabetical section should be tagged as <PART>s of the <BODY> (with "N" attribute = "A" "B" etc.). If a section is interrupted by the volume break, simply close the <PART> and reopen it in where it resumes in the next volume. If a volume break interrupts an entry, move the tail end of the entry forward so that the entry is complete before the </PART> </BODY> </VOL>.

Primary organization by entry

As an encyclopedia in dictionary form, Knight's Dictionaryis arranged primarily by the entry. The entry (<E>) therefore will be the chief element of the markup. Each entry is divided into:

One or more headwords (<HW>)
A definition (<DEF>), containing chiefly:
1. Paragraphs ()
2. Lists (<LIST>)
3. Tables (<TABLE>)
4. Illustrations (<FIGURE>)
5. Field labels, e.g. "Nautical" "Bridge-building" "Carpentry" (<LABEL>)
6. Cross references, e.g. "See SCREWDRIVER" (<REF>)

Typical simple entries

Sample

Transcription

<E>
<HW>Abb.</HW>
<DEF>
 <LABEL>Weaving.</LABEL> Yarn for the warp.
</DEF></E>
<E>
<HW>A-bee`.</HW>
<DEF>
 <LABEL>Fabric.</LABEL> A woven stuff of wool and cotton made in Aleppo.
</DEF>
</E>
<E>
<HW>A-beam`.</HW>
<DEF>
 Opposite the center of the ship's side; as, "the wind is abeam."
</DEF>
</E>
<E>
<HW>Ab`e-run`ca-tor.</HW>
<DEF>
 A weeding-machine.
</DEF>
</E>
<E>
<HW>A-but`ting-joint.</HW>
<DEF>
 <LABEL>Carpentry.</LABEL> A joint in which the fibers of one piece are perpendicular to those of the other.
 <LABEL>Machinery.</LABEL> A joint in which the pieces meet at a right angle.
</DEF>
</E>

<E>
<HW>A`ces.</HW>
<DEF>
 <LABEL>Nautical.</LABEL> Hooks for the chains.
</DEF>
</E>
<E>
<HW>A-cet`i-fi-er.</HW>
<DEF>
 An apparatus for exposing cider, wort, or other wash to the air to hasten the acetification of the fermented liquor. See <REF>GRADUATOR</REF>.
</DEF>
</E>
<E>
<HW>Ac`e-tim`e-ter.</HW>
<DEF>
 See <REF>ACIDIMETER</REF>.
</DEF>
</E>
<E>
<HW>Ac`e-tom`e-ter.</HW>
<DEF>
 A hydrometer suitably graduated for ascertaining the strength of acetic acid and vinegar.
</DEF>
</E>
<E>
<HW>Ach`ro-mat`ic Con-dens`er.</HW>
<DEF>
 An achromatic lens or combination used to concentrate rays upon an object in a microscope. See <BIBL>Carpenter on the Microscope, pp. 117-119, ed. 1857.</BIBL>
</DEF>
</E>

Simple entry containing simple list.

Sample

Transcription
<E>
<HW>Ag`ate.</HW>
<DEF>
1. <LABEL>Printing.</LABEL> A size of type between Pearl and Nonpareil; called Ruby in England.
<LIST>
<ITEM>Pearl.</ITEM>
<ITEM>Agate, or Ruby.</ITEM>
<ITEM>Nonpareil</ITEM>
</LIST>
2. The draw-plate of the gold-wire drawers; so called because the drilled eye is an agate.
3. The pivotal cup of the compass-card.
</DEF>
</E>

More elaborate lists follow the same pattern. Follow this link:

http://www.umich.edu/~pfs/knight/dox/lists.html

to see examples of lists with recommended transcriptions.

Simple entry containing simple table.

Sample.

Transcription.
<E>
<HW>Bind`ing-joist.</HW>
<DEF>
<LABEL>Carpentry.</LABEL> A joist whose ends rest upon the wall-plates, and which support the bridging or floor joists and the ceiling joists below. A binder. See <REF>JOIST</REF>; <REF>FLOOR</REF>.
The binding-joist is employed to carry common joists when the area of the floor or ceiling is so large that it is thrown into bays. With large floors the binding-joists are supported by girders. See <REF>GIRDER</REF>.
Binding-joists should have the following dimensions: —
<TABLE>
<ROW>
 <CELL ROLE="label">Length of Bearing. Feet.</CELL>
 <CELL ROLE="label">Depth. Inches.</CELL>
 <CELL ROLE="label">Width. Inches.</CELL>
</ROW>
<ROW>
 <CELL>6</CELL>
 <CELL>6</CELL>
 <CELL>4</CELL>
</ROW>
<ROW>
 <CELL>8</CELL>
 <CELL>7</CELL>
 <CELL>4½</CELL>
</ROW>
<ROW>
 <CELL>10</CELL>
 <CELL>8</CELL>
 <CELL>5</CELL>
</ROW>
<ROW>
 <CELL>12</CELL>
 <CELL>9</CELL>
 <CELL>5½</CELL>
</ROW>
<ROW>
 <CELL>14</CELL>
 <CELL>10</CELL>
 <CELL>6</CELL>
</ROW>
<ROW>
 <CELL>16</CELL>
 <CELL>11</CELL>
 <CELL>6½</CELL>
</ROW>
<ROW>
 <CELL>18</CELL>
 <CELL>12</CELL>
 <CELL>7</CELL>
</ROW>
<ROW>
 <CELL>20</CELL>
 <CELL>13</CELL>
 <CELL>7½</CELL>
</ROW>
</TABLE>

</DEF>
</E>

More elaborate tables follow the same pattern. Follow this link:

http://www.umich.edu/~pfs/knight/dox/tables.html

to see an example of a more complex set of tables, with recommended transcription.

Aside from the basic sequence of HEADWORD-DEFINITION, the internal arrangement of material in the entries can differ considerably. Labels always (?) appear within paragraphs, but lists, tables, and illustrations can appear within paragraphs or between them; lists can contain other lists; etc.

Other structural elements within entries

A number of entries contain block quotations, both in prose and in verse. These should be tagged with <Q>; if the source is given, this should be tagged as <BIBL>.

Entry containing block quotation in verse.

Sample.

Transcription.
<E>
<HW>Blank`et.</HW>
<DEF>
 1. <LABEL>Fabric.</LABEL> A coarse, heavy, open, woolen fabric, adapted for bed covering, and usually napped. It may be twilled or otherwise. A name applied to any coarse woolen robe used as a wrapping.
 <Q><L>Antiphanes, that witty man, says:</L>
 <L>'Cooks come from Elis, pots from Argos,</L>
 <L>Corinth blankets sends in barges.'</L>
 <BIBL>ATHENAEUS (A.D. 220)</BIBL>
 </Q>
 The poncho is a blanket with a hole in the center for the head to go through. It is worn by the South Americans, Mexicans, and Pueblo Indians.
 2. <LABEL>Printing.</LABEL> A piece of woolen, felt, or prepared rubber, placed between the inner and outer tympans, to form an elastic interposit between the face of the type and the descending platen.
</DEF>
</E>

A very few entries contain a more elaborate internal structure, e.g. sets of tables, where the sets themselves are organizational units with headings. For this (rare) case of complex objects inserted within an entry, we have provided an <INSERT> tag which may contain P LIST TABLE etc., or may be subdivided into <DIV1>s (and those if necessary into <DIV2>s).

Material not within entries

Each volume of the book contains some 'front matter' that does not fall within entries: title page, preface, lists of plates, etc. This material should be tagged as <DIV1>s of <FRONT>, e.g. the title page as <DIV1 TYPE="title page">, the preface as <DIV1 TYPE="preface">, but the DIVs themselves are internally tagged using ordinary <LIST> <TABLE> etc. as appropriate--no special tags for author or title, etc.

For sample capture of front matter, follow the following link:
http://www.umich.edu/~pfs/knight/dox/front.html

Page-breaks, page numbers, running headers

All information arising from the division of the book into pages should be recorded as attributes of the <PB> tag. The numerical designation of the image file should be captured in the REF attribute; the printed page number (if any) should be captured in the N attribute; and the running headers should be captured in the "HEAD" attribute. On most pages, running headers will consist of a pair of words or phrases, one over each column. Include both, separated by a space.

The <PB> tag itself should be placed as near as possible to the point at which the page actually 'breaks': that is, at the point at which the reader would turn the page, insert the <PB> tag for the new page.

When a new division of some kind begins on the new page, move the <PB> tag forward as far as it needs to go in order for the file to be valid. This means tucking it into the first content-bearing element (e.g. E or DIV1).

Placement of tags. The rules are: (1) "pages always break at the top"; that is, <PB> tags will be inserted in the text at the actual location of the page break, regardless of the location of the page number on the printed page. (2) "Words cannot break at page breaks"; that is, if a hyphenated word straddles a page break, finish the word and any attached punctuation, then insert the tag. And (3) "Divisions begin at page breaks; they don't end there"; that is, if a structural break of some kind coincides with the page break (e.g., if a new section, paragraph, etc., begins at the head of the new page), the <PB> tag should be tucked inside the opening tag for the new division, NOT inside the closing tag for the old division.

Attribute values

Provide attribute values (e.g. the three attributes of the PB element; the N attribute of FIGURE PLATE VOL and PART) only when instructed to and when there is specific information to supply. Do not supply values of this sort: TYPE="unknown" or TYPE="unspecified".

Illustrations

The work includes both full-page plates and over 5000 individual illustrations. Tag the full-page plates as <PLATE> and the other illustrations as <FIGURE>. If a figure (or more likely a plate) contains other numbered or captioned figures within it, tag each subordinate illustration as a <FIGURE> tag within the <PLATE> tag.

"Figure 45" ... Most of the illustrations and plates are headed by a number (roman numeral for plates, arabic numeral for figures). Capture this number as the N attribute of the <FIGURE> or <PLATE> tag.

Captions and other text. Capture captions or caption-like material attached to an illustration as a <HEAD> tag within the <FIGURE> (or <PLATE>) tag. Other text that pertains to the whole illustration (excluding, e.g., lettered references to various pieces and parts of the picture) may be tagged using within <FIGURE> (or within <PLATE>). "See" references on a plate ("See truss, p. 43"), for example, should be captured in a .

Follow this link for a sample capture of a simple plate in context:
http://www.umich.edu/~pfs/knight/dox/plate.html

Entry with multiple illustrations

Sample

Transcription

<E>
<HW>A-but`ment.</HW>
<DEF>
 A fixed point or surface, affording a relatively immovable object against which a body abuts or presses while resisting or moving in the contrary direction. See <REF>PIER</REF>; <REF>SKEWBACK</REF>.
 <FIGURE N="4"><HEAD>Pier Abutment.</HEAD></FIGURE>
 1. <LABEL>Building.</LABEL> A structure which receives the lateral thrust of an arch. The abutment may be a pier or wing walls forming a horizontal arch; or the arch may be continued to a piled or hewn foundation, which is then the abutment.
 2. <LABEL>Machinery.</LABEL> A solid or stationary surface against which a fluid reacts.
 a. The wedge which lifts the piston of one form of rotary steam-engine, and which forms a surface for the steam to react against as it presses the piston forward in its circular path.
 b. The wedge block in a rotary pump, where the piston traverses an annular chamber.
 <FIGURE N="5"><HEAD>Piled Abutment</HEAD></FIGURE>
 c. One of the cylinder heads of a steam-engine, receiving the back pressure of the steam which is made effective upon the piston.
 3. <LABEL>Carpentry.</LABEL> The junction of two pieces of timber, where the grain of one is at a right angle to that of the other, or nearly so.
 <FIGURE N="6"><HEAD>Movable Abutment.</HEAD></FIGURE>
 4. <LABEL>Fire-arms</LABEL> The block at the rear of the barrel of a fire-arm (especially a breech-loader), which receives the rearward force of the charge in firing
 It has the function of the breech-plug or breech-pin in the muzzle-loading fire-arm.
 <FIGURE N="7"><HEAD>Stationary Abutment</HEAD></FIGURE>
 A similar term is applied to the corresponding portion in breech-loading cannon.
 In Fig. 6, the abutment D is movable its axis so as to expose the rear of the bore for the insertion of the cartridge.
 In Fig. 7, the abutment D is stationary, relatively to the stock, and the barrel slips away from the abutment to allow the insertion of the cartridge. The variations in this arrangement are very numerous, and the different devices form the subjects of numerous patents in the United States and foreign countries. See <REF>FIRE-ARM</REF>; <REF>BREECH-LOADING</REF>.
 5. <LABEL>Suspension Bridge.</LABEL> The masonry or natural rock in and to which the ends of a supension cable are anchored.
 <FIGURE N="8"><HEAD>Suspension Bridge Abutment</HEAD></FIGURE>
 6. <LABEL>Hydraulic Engineering.</LABEL> A dam is in some sense an abutment, as it sustains the lateral thrust of water. See <REF>DAM.</REF>.
 </DEF></E>

Another illustrated entry (excerpt)

Sample

Transcription.
<E><HW></HW><DEF>
... 1864. This invention consists in the employment of a pair of suspending straps which pass over the
 <FIGURE N="19"><HEAD>Weber's Knapsack</HEAD></FIGURE> shoulder in connection with another shorter pair of straps attached to the top of the knapsack near its center, and also a pair of straps attached, one to each end of the knapsack, for the purpose of varying the position and shifting the weight of the same when desirable.
 WEBER, January 31, 1865. The frame of the knapsack is capable of being changed into a couch, and the cover forms a shelter. The central section has jointed and folding sides.
 RUSH, March 25, 1862. The frame of the knapsack is made of two parts, hinged together
 <FIGURE N="20"><HEAD>Rush's Knapsack.</HEAD></FIGURE> At the thick end of one part are pivoted two arms, which, when thrown out, rest upon the edge of the knapsack, and serve to hold the canvas for forming a bed.
 FRODSHAM AND LEVETT, October 1, 1861. This invention consists of an india-rubber ... 
 </DEF></E>

Special problems: text sequence and illustration placement.

Text sequence. All pages should be captured in the order they were intended to be read. In most cases this is straightforward: down the lefthand column from top to bottom and then down the righthand column from top to bottom. On some pages, however, especially where the page is interrupted by a large illustration or table, the sequence is disrupted. It may, for example, proceed halfway down the left column, then halfway down the right, then back to the lower half of the left column, followed by the lower half of the right. Follow the order that makes sense and that honors the integrity of the entries.

Typical problem with text sequence

Illustration placement. Most illustrations appear within the entry that they are intended to illustrate, and they should be recorded so. But often the physical arrangement of the page dictated that the illustration appear outside of the entry that it illustrated. In that case, if you can determine which entry it really belongs to (using the information in the caption), move the <FIGURE> tag so that it appears within the correct <E>ntry. It is important that so far as possible, the figures illustrating a given entry appear within that entry. (This applies to <FIGURE>s only: leave <PLATE>s where they are.

For samples of these two typical problems, follow this link:
http://www.umich.edu/~pfs/knight/dox/probs.html.

Occasionally, illustrations appear after a line of text that ends with a hyphenated word. Treat hyphenated words in this situation just as you when they appear at page breaks. That is: words should not be broken by <FIGURE>s. If a hyphenated word precedes an illustration, first finish the word and any attached punctuation, and then insert the FIGURE tag.

Character-level and word-level capture (italics, hyphens, character encoding, etc.)

Italics. Capture text in italics using the tag. Omit tag if you are already capturing the distinctiveness of the italicized portion in question using some structural tag, e.g. <LABEL>

Small caps. Capture text in small caps as if it were in regular caps.

End-of-line hyphens. If feasible, selectively remove end-of-line hyphens where they were introduced simply in order to justify the line, leaving only the hyphens that mark real compounds. If this is not judged feasible, record all end-of-line hyphens as "|" (the keyboard 'pipe' character = HEX 7C) instead of "-", and reserve "-" for hyphens that appear other than at the end of the line. (Any real example of the vertical bar or 'pipe' character should in the latter case be captured with the entity |).

Dashes. Capture dashes as —.

Quotation marks, apostrophes. Use the simple upright keyboard quotation-mark and apostrophe characters (except as noted below under 'syllabification'. Do not attempt to distinguish between 'opening' and 'closing' quotation marks.

Non-ASCII characters. Capture non-ASCII characters using the standard ISO character entities (ISOpub, ISOtech, ISOnum, ISOlat1, ISOlat2).

Unrecognized symbols. Unrecognized symbols should be captured as "#".

Individual non-roman letters. Individual letters in Greek should be captured using the ISO Greek-1 entity set. E.g. (for the cases that I noticed):

&ogr; (lower case omicron)
&Agr; (upper case Alpha)
&Bgr; (upper case Beta)
&Ggr; (upper case Gamma)

Individual letters in other non-Roman alphabets or writing systems (Hebrew, Chinese, Arabic, etc.) should be recorded as unrecognized symbols with #.

Mathematical operators (+ - = etc.). Use the keyboard character for = . Otherwise, capture as special characters using standard character entities, e.g. as follows.

÷ = divide sign
× = multiply sign
+ = plus sign
± = plus-or-minus sign
√ = square-root (radical) sign

Illegibility. If you cannot read a letter, capture it as "$". If you cannot read an entire word, capture it as "$word$"; any larger illegible section should be marked captured as $span$.

Dollar signs. Since we use "$" to mark illegibilities, capture a literal dollar sign using the entity $

Syllabification and stress in head-words. Headwords (HWs) are printed using hyphens and vertical accent marks (') to indicate pronunciation. Capture the hyphens as ordinary hyphens (-), but capture the accent marks using the 'back-tick' character (`).

Punctuation placement and spacing. Normalize spacing around punctuation:

no spaces before and one space after : ; . , ! ?
no spaces around —

Font size. Changes in font size should be disregarded, except as a clue to structural divisions (e.g. a block quote).

Superscripts and subscripts. Capture subscripted and superscripted text by placing the character ^ before each superscript character and ^^ before each subscript character.

Braces (especially in tables). Braces are always difficult. As in EEBO, we'll say: interpret them as best you can. Sometimes this will require entering text more than once, e.g. if the brace means "this word applies to all these other words," the easiest technique may be simply to apply the word to all of the other words by entering it as many times; sometimes it may require treating the single item as a head or label for a list containing the grouped items; sometimes it may involve attaching a ROWS or COLS attribute to a table <CELL>. Unfortunately, many variations are possible.

Material to flag but not capture

Text in non-roman alphabet. Extended text (a word or more) in any non-Roman alphabet, Greek included, should be captured as <GAP DESC="foreign">.

Musical notation. Capture each piece of musical notation of any length as <GAP DESC="music">.

Mathematical formulas. Those that cannot be captured as simple text strings (e.g. 2 × 3 = 6 or √4 = 2) should be captured as <GAP DESC="math">.

Block-level capture

Record paragraph breaks with the tag.

Do not record text columns (except in tables) or column breaks

Do not record line breaks as such (except when the line breaks indicate a change from one item to another in a <LIST> or <TABLE> of course).

Record page breaks with the <PB> tag (see above)

Do not record indentation; instead use the indentation as a cue to structure: e.g., the indent may indicate a new paragraph (), a block quote (<Q>), etc.

Text not to record at all

Handwritten notes or additions; and purely decorative typographic markings should not be recorded at all.

Notes

Though I have not noticed any footnotes or marginal notes in this book, the dtd contains a NOTE element that should be used to capture any that do occur. The PLACE attribute is used to capture the location of the note (PLACE="foot" for footnotes; PLACE="marg" for marginal notes); the N attribute is used to capture the footnote marker (e.g. "1" "*" etc.), which is otherwise to be omitted; and the note itself should be embedded in the text at the point to which it refers (usually at the location indicated by the note marker).

Things missed.

Please bring to our notice any other situations that the instructions failed to anticipate and we will work out a way to cope with them.

Individual tags: guide to use.

BACK*

Contains the back matter, if any, attached to each of the three volumes. Appears within: VOL

BIBL*

Contains bibliographical information (e.g. author or title or page number) associated with a quotation. Appears within Q P INSERT DIV1 DIV2 EPIGRAPH

BODY*

Contains the entries in each volume. Appears within: VOL

CELL*

Contains a data cell of a table (cp. HTML 'td' or 'th'). Multi-column or multi-row cells are designated using the COLS (cp. HTML 'COLSPAN') and ROWS (cp. HTML 'ROWSPAN') attributes. Cells used as headings for columns or rows are designated using the ROLE="label" attribute (cp. HTML 'TH' tag). Appears within: ROW.

DEF

Contains the definition section of an entry. Appears within: E

DIV1*

A division of the FRONT matter (or of an INSERTed text object). May appear within FRONT BACK INSERT.

DIV2*

A division of a DIV1. May appear within DIV1.

E

Contains an individual entry in the Dictionary. Appears within: PART

FIGURE*

Marks the location of an illustration occupying less than a full page. Contains captions (HEAD) and some other text, as above. Appears within: DEF, P, FIGURE, PLATE, HEAD, TRAILER, LIST, TABLE, CELL, ITEM

FRONT*

Contains the front matter attached to each of the three volumes. Appears within: VOL

GAP*

Empty tag marking the location of something that has not been captured, e.g. music, mathematical equations, or words in a non-roman script. In general, can appear wherever text data (PCDATA) can.

HEAD*

Contains a heading, e.g. of structural division (BODY PART DIV1 or DIV2), TABLE, LIST; or the caption of an illustration (FIGURE or PLATE). Appears within: FIGURE PLATE LIST TABLE DIV1 DIV2 INSERT BODY PART

HW

Contains the headword(s) of an entry (printed in bold). Phrases should be captured as a single HW, but two words separated by semicolon should be captured as two separate HWs.

One head word vs. multiple headwords

Sample

Transcription

<E>
<HW>A-but`ment Arch.</HW>
<DEF>
An end arch of a bridge.
</DEF>
</E>
<E>
<HW>A-can`tha-lus</HW>
<HW>A`can-tha`bo-lus</HW>
<DEF>
An instrument for extracting thorns or splinters from a wound.
</DEF>
</E>

Appears within: E

I

Contains text in italics. Omit tag if you are already capturing the distinctiveness of the italicized portion in question using some structural tag, e.g. <LABEL>. Appears: almost anywhere.

INSERT

Contains a textual object inserted in an entry too complex to be rendered using P TABLE or LIST. Appears within: DEF

ITEM*

Contains an item in a list. Indications of sequence (1. 2., etc.) should be recorded as plain text within the item. Appears within: LIST.

KNIGHT

Contains the entire work

L*

Contains a line of verse (poetry). Appears within: Q DEF LG INSERT

LABEL

Contains an indication of the field of industry to which a given term or definition belongs (printed in italics within parentheses). Omit the parentheses and capture just the text of the label. Appears within: DEF.

LG*

Contains an organized group of verse lines (poetic stanza). Appears within Q INSERT DIV1 DIV2

LIST*

Contains an ordered or unordered list of ITEMS (and an optional HEAD). Appears within: P DEF ITEM CELL DIV1 DIV2 INSERT etc.

NOTE*

Contains text of notes (especially footnotes and marginal notes). Appears in P L CELL ITEM HEAD etc.

PART

Contains an alphabetical section of entries. Appears within: BODY

PB*

Empty tag flagging page-break 'event'. Appears: almost anywhere, but not between entries, or between DIVs of the FRONT matter, or between FRONT and BODY.

PLATE

Marks the location of a full-page illustration or set of illustrations. Contains captions (HEAD) and some other text, as above. Appears within: FRONT BACK PART E DEF TABLE LIST

Q*

Contains a block quotation. Appears within: P DEF DIV1 DIV2 INSERT etc.

REF*

Contains a word or phrase intended as a cross-reference to another entry in the Dictionary. Restrict the contents of the tag to the word or phrase itself with following punctuation, i.e. See <REF>SCREWDRIVER.</REF>, not <REF>See SCREWDRIVER.</REF>. See under <REF>HAMMER.</REF>, note <REF>See under HAMMER.</REF>. Appears within: P CELL ITEM.

ROW*

Contains the row of a table. (cp. HTML 'TR' tag) Appears within: TABLE

TABLE*

Contains a table. Oriented by the row (like HTML). Contains ROWs (cp. HTML 'TRs') and an optional HEAD. Appears within: DEF P CELL ITEM DIV1 DIV2 INSERT etc.

TRAILER*

Same as HEAD, except that it appears at the end of section instead of the beginning.

VOL

Contains one volume of the entire work. Appears within: KNIGHT

A few other elements are included in the dtd, but are unlikely to see much (or any) use, except perhaps in the preface: DATELINE*, DATE*, SALUTE*, SIGNED*, OPENER*, CLOSER*.

Elements marked with an asterisk above (*) are TEI tags and retain the meanings assigned them in TEI P3. (See http://www.hti.umich.edu/t/tei/ for a quick TEI lookup by element name.). Most of the other tags in the dtd are in fact shortcuts for TEI tags. Thus INSERT is essentially equivalent to the TEI sequence Q TEXT BODY DIV0; VOL is equivalent to the TEI 'TEXT' tag; PART is equivalent to the TEI 'DIV1 TYPE="part"'; E is equivalent to the TEI 'DIV2 TYPE="entry"'; HW is equivalent to 'DIV3 TYPE="headword"'; DEF is equivalent to 'DIV3 TYPE="definition"'; and I is equivalent to 'HI REND="ital"'.

Misc.

Images of the book covers are included in the image set. Treat these as blank pages.

The samples supplied with this guide use invented values for the "REF" attribute of the PB tag. For the conversion project itself, we will supply id numbers for the images that should be referenced by this attribute.

To be arranged / subject to discussion

Three matters are left to be arranged, subject to discussion and negotiation with the conversion vendor:

The vendor's method of access to the image data remains to be worked out. It may be possible to work directly from the on-line system, or we may be able to provide the images themselves in some form (electronic or print). In the meantime, the page images can be viewed online by following THIS LINK and then following the link for image browse.
As discussed above, we are uncertain whether the conversion vendor will be willing or able to resolve end-of-line hyphens. Though this is a relatively modern book, its usage with respect to the hyphenation of words is unusual (in particular, it seems to hyphenate compounds much more than we would nowadays). The vendor is therefore offered the choice: (1) resolve EOL hyphens (i.e., selectively omit the hyphen, because it is used simply in order to break a word at line end, or retain it, if it represents a 'real' hyphen, i.e., a hyphen that would be present even if the line did not end at that point). Or (2) capture all EOL hyphens with a distinctive character (|) and leave us to resolve them.
Also to do with hyphens, we expect as part of our post-processing to provide 'normalized' forms of all the headwords, forms that omit pronunciation and syllabification information. A vendor willing and able to perform this step, however, is welcome to include it in their proposal. An attribute has been included on the HW element (NORM) designed for this purpose. The difficulty arises again from the use of hyphens, since the headwords as printed do not distinguish between hyphens used to separate elements of compounds (which we would want to retain) and hyphens used to separate syllables (which we would want to remove). Here are some sample headword elements with the attribute included:
- <HW NORM="abb">Abb.</HW>
- <HW NORM="abeam">A-beam`.</HW>
- <HW NORM="abee">A-bee`.</HW>
- <HW NORM="aberuncator">Ab`e-run`ca-tor.</HW>
- <HW NORM="abutment arch">A-but`ment Arch.</HW>
- <HW NORM="abutment">A-but`ment.</HW>
- <HW NORM="abutting-joint">A-but`ting-joint.</HW>
- <HW NORM="acanthabolus">A`can-tha`bo-lus</HW>
- <HW NORM="acanthalus">A-can`tha-lus</HW>
- <HW NORM="aces">A`ces.</HW>
- <HW NORM="acetifier">A-cet`i-fi-er.</HW>
- <HW NORM="acetimeter">Ac`e-tim`e-ter.</HW>
- <HW NORM="acetometer">Ac`e-tom`e-ter.</HW>
- <HW NORM="achromatic condenser">Ach`ro-mat`ic Con-dens`er.</HW>
- <HW NORM="agate">Ag`ate.</HW>
- <HW NORM="agitator">Ag`i-ta`tor.</HW>
- <HW NORM="bevel plumb-rule">Bev`el Plumb-rule.</HW>
- <HW NORM="bevel scroll-saw">Bev`el Scroll-saw.</HW>
- <HW NORM="beveling-machine">Bev`el-ing-ma-chine`.</HW>
- <HW NORM="bevel-square">Bev`el-square.</HW>
- <HW NORM="bevel-tool">Bev`el-tool.</HW>
- <HW NORM="binding-joist">Bind`ing-joist.</HW>
- <HW NORM="binnacle">Bin`na-cle.</HW>
- <HW NORM="binocle">Bin`o-cle</HW>
- <HW NORM="binocular eyepiece">Bi-noc`u-lar Eye-piece</HW>
- <HW NORM="blade">Blade.</HW>
- <HW NORM="blanket">Blank`et.</HW>
- <HW NORM="medallion">Me-dal`lion.</HW>
- <HW NORM="medal-machine">Med`al-ma-chine`.</HW>
- <HW NORM="medicator">Med`i-cat-or.</HW>
- <HW NORM="medicinal bath">Me-dic`i-nal Bath.</HW>
- <HW NORM="medicinal cup">Me-dic`i-nal Cup.</HW>
- <HW NORM="medicine-spoon">Med`i-cine-spoon.</HW>
- <HW NORM="medium">Me`di-um.</HW>
- <HW NORM="medley">Med`ley.</HW>
- <HW NORM="meerschaum">Meer`schaum.</HW>
- <HW NORM="shoe-sewing machine">Shoe-sew`ing Ma-chine`.</HW>
- <HW NORM="shoe-shave">Shoe-shave.</HW>
- <HW NORM="shoe-sole">Shoe-sole.</HW>

TEXT CONVERSION PROJECT:Knight's American Mechanical Dictionary

Introduction

Scope of the job

Source of data

Output data

Volume of data

Organization of the book

Overall high-level structure

Primary organization by entry

Other structural elements within entries

Material not within entries

Page-breaks, page numbers, running headers

Attribute values

Illustrations

Special problems: text sequence and illustration placement.

Character-level and word-level capture (italics, hyphens, character encoding, etc.)

Material to flag but not capture

Block-level capture

Text not to record at all

Notes

Things missed.

Individual tags: guide to use.

Misc.

To be arranged / subject to discussion

TEXT CONVERSION PROJECT:
Knight's American Mechanical Dictionary