Interim Progress Report
25 April 1996
Contents
- The Project
- The Objectives
- The Progress to Date
- The Decisions
- The Process
- Reflections
- On the decisions
- On the process
- Prospects
- Bibliography
- Attachments
- File ~pfs/codebook.html
Collecting point and index for working documents
- File ~pfs/cb/status.html
Table providing latest revision information on all
working documents.
- File ~pfs/cb/bb.html
Message and announcement archive for reference and
historical purposes.
- File ~pfs/cb/cb1.dtd
Slightly obsolete version of standalone CodeBook dtd.
- File ~pfs/cb/elements.html
List of elements declared in the CodeBook dtd, with
short description and section number in the dtd.
- File ~pfs/cb/descrips.html
List of code book elements rearranged in order of
descriptive caption.
- File ~pfs/cb/cb-tei.html
Documentation of steps taken to integrate standalone
CodeBook dtd into TEI as a "base tag set."
- File ~pfs/cb/tei/teicb1.dtd
First attempt (now obsolete) to convert standalone
CodeBook dtd to parameterized form required by TEI in its
modules.
The Project
The larger project of which the CodeBook DTD is a part is an attempt
by the International Consortium for Political and Social Research
(ICPSR), a consortium comprising some of the most significant
statistical archives and data-producing agencies,
to make uniformly searchable and accessible the statistical codebooks
issued by the agencies that produce, publish, or
archive statistical data sets. Many of the details, and even the general
shape, of this larger project and the larger SGML-based system supposed by it,
remain invisible to those of us who are
engaged with the dtd proper, especially as regards the distribution of
responsibility, the nature of the funding, and the possible diversity of
implementation among the member institutions of the ICPSR.
The more limited responsibilities assigned the dtd group include the
development of an SGML Document Type Description suitable for the
encoding of data-archive code books or data dictionaries; the creation
of all the ancillary files and style sheets required to make the dtd
usable; the creation of all the documentation required by taggers,
presumptively with little or no knowledge of SGML, in interpreting
the dtd while tagging codebooks; and cooperation with programmers in
integrating the dtd and its attachments into larger systems. The dtd
should be suitable not only for tagging new codebooks but also for
retrospective conversion of old ones; it should apply not only to print
codebooks but also to digital-only documentation. It should be usable
not only by American and Canadian but also by international data
archives. And it should not only accommodate the extremely variable
structures of existing codebooks, but impose enough constraints to
promote increased consistency among future codebooks.
Objectives
The objectives of the group, though modified by experience, have in
general comprised the following:
- Interpret the code book document analyses produced by the ICPSR
committee and written by David Barber (Michigan), John Brandt
(Michigan), and Ann Green (Yale). Clarify obscurities in the analyses
through meetings and correspondence. Elicit the continuing input of
data specialists from ICPSR institutions regarding faults in the
analyses.
- Revise these analyses in the light of our own analyses of existing
code books, the experience of tagging them with interim versions of the dtd,
and the experience of translating the analyses into SGML.
- Translate the analyses into a valid SGML dtd, being careful both
to honor the intentions of the analyzers and to exercise a discretion
schooled in the options available under SGML.
- Learn enough SGML to make the previous step possible. Achieve a
conceptual grasp of the options available, both as regards document
analysis and as regards SGML implementation.
- Map out and implement a practical procedure or work-flow management
scheme.
- Document the procedure and the process so that every decision and
every declaration remains transparent both among ourselves and to later
users.
- Determine the extent to which TEI (or EAD or MARC) SGML elements,
sections, entities, etc., may be borrowed for describing codebooks.
Should it prove feasible, incorporate CodeBook.dtd within TEI itself,
with the prospect of having it eventually accepted as a TEI module in
future releases of TEI.
- Tag sample codebooks for demonstration, tutorial, and experimental
purposes: to determine if the dtd is adequate, to show that it is (in
the face of informed experience), and to teach taggers how to implement
the dtd.
- Write a Panorama style sheet so that the samples can be
displayed.
- Create visual displays of the dtd hierarchy using Near and Far or
Near and Far Lite.
- Work with programmers to create scripts that will search the tagged
codebooks via a CGI call, and generate SGML and SGML-filtered-to-HTML
output online.
- Create full user documentation.
Progress to Date
A good start has been made on these objectives, but only that.
- Interpretation. The analyses have been interpreted.
Changes continue to be made both in the analyses and in the
interpreations through (frequent) correspondence, chiefly from Ann
Green at Yale, and (rare) meetings of the ICPSR Committee. Obscurities
nevertheless
remain, most of the annotated as such within the dtd, but some
probably unrealized and unacknowledged.
- Revision. Some minimal revision (or at least loose
interpretation) of the analyses appears in the dtd, but more the result
of general considerations of logic and flexibility than of specific
experiences with non-compliant codebooks. We have only just begun to
compare the dtd with codebooks. The revisions that have been made
in the course of translation, though thoroughly annotated in the dtd,
have not made their way back in to the analyses, since there is little
provision in our working procedures for changes flowing (as it were)
backward. There should be.
- Translation. As of last night, a "complete" dtd has
been created, complete in that every element has been declared, supplied
with a content model, and equipped with an attributes list--and every
piece of the document analyses has been represented in the dtd. More
recent changes in the analyses have yet to be incorporated, the
resultant version has yet to be converted to a TEI-compliant modular
form, and it has yet to parse. But the basic translation has been
done.
- Learning SGML. I for one am much more confident
with SGML/dtd basics than I was two months ago, and the process has certainly been
a genuine learning experience for all concerned. Whether our conceptual
grasp is adequate to the task has yet to be determined.
- Development of procedures. Though our procedures
are still largely dictated by the fact that all of us are contributing to the
project only our "spare" time after 100 or 120 hour/week work schedules,
and the governing principles are still flexibility and consensus, a more
rigorous delegation of tasks in the past few weeks has generated a
productivity noticeably greater than that apparent during the more
haphazardly arranged weeks that preceded. Duplicate effort has been
eliminated, and the results of each contributor's labors are available
in predictable ways. I have within the past few days created a version
control system that will govern and document every change in the dtd
and its derivative documents and do so transparently. A similar system
for the analyses is in the works.
- Documentation. Documentation has grown up with our
procedural maturity, and contributed to it. The dtd itself is physically
laid out in a way conducive to human legibility, nearly every tag is
annotated in some way, and all changes, either from one version to
another or from the analyses to the SGML translation, are fully
explained and and defended. Eventually, these comments, growing as
lengthy as they are, should be removed from the dtd and placed in a
separate file attached to each element (perhaps in the manner recommended by
Eve Maler, who suggests that a separate form should be filled out for
every element explaining its content and defending its necessity and
appropriateness.) I have created several web pages, some within the past
few days, that extend the explicit documentation further: documenting
the steps taken to integrate the CodeBook dtd into TEI, listing in one
convenient place the revision status of all documents, and archiving the
more important e-mail messages and announcements by date and subject.
The alphabetical elements list, created first simply in order to assure
consistency in naming practices, to avoid conflict with TEI names, to
generate the list necessary for inclusion in one of the TEI entity files, and
to act as an index to the dtd (by section number), will also serve as
the basis of the user documentation. When the dtd reaches sufficient
stability that we can begin on user documentation, the individual bits
of dtd code will be pasted into the alphabetical list of elements and
explained by example--the bulk of the necessary documentation, though
hardly all of it. Additional documentation (e.g. controlled vocabulary
lists and thesauri, attribute lists, etc.) are proposed but not yet
created.
- Incorporation of TEI elements. (Or determination otherwise
of the usefulness of TEI). I believe that the method that I proposed
to the group for incorporation of codeBook into TEI, and have since then
pursued, will prove a practical one. Another five to ten hours of
conversion time, and we should be able to see if the codebook/TEI
amalgam will parse as a single DTD. If it does, codeBook will have at
its disposal a host of well defined and thoroughly bug-tested and
well documented tags, mostly at the lowest levels, and will to that
extent be spared a very considerable amount of development effort and
time.
- Sample tagging. Others in the group have been responsible
this week for tagging sample codebooks with our tags. I have not yet
seen the results.
- Stylesheet-Display-SGML filtering-User Guidelines.
All of these are still in the planning stages; all depend on having a
valid, parsable, stable dtd to begin with and will have to be postponed
till that date.
Decisions
Countless small decisions have been made in the course of the
project. Among the most important in their influence on the SGML implementation
should be numbered the following:
- Content-oriented element selection and structure.
- If the choice in SGML development is between procedural, presentational,
structural, and informational markup, the choice has been (with the
ICPSR Committee's strong support) to use an exclusively informational
structure at all but the lowest levels. Whether this decision can be
sustained in practice, or must be modified by the inclusion of some upper-level
tags to represent divisions like chapters or even page numbers, remains to be
seen, but the sense of the Committee
is certainly that (in the words of one member) "our business is
metadata, not text." The immediate implications of this decision for the
dtd structure include not only the selection of exclusively
informational tags at the upper levels, but the abandonment of any sort
of constraints on sequence or repeatability for most of the elements.
The creator of a structurally-oriented dtd (like the default text structure
of TEI) has the luxury of knowing that in most texts some matter in the
front will be followed by some matter in the middle and then by some
matter at the back, and that there will be some sort of internal scheme
of chapters or books or stanzas to tag. An informational tagger, on the
other hand, can never count on the author appearing before the title, or
the data files description before the data description, or the study
description before the questionaire, or on any kind of information
appearing without being conmingled with other kinds. The lack of
predictable sequence rules out constraints on order; the distribution,
mingling, and interruption of given kinds of information by other kinds
exclude restraints on repeatability.
- Concentration on retrospective conversion.
- In keeping with this philosophy, it is the sense of the Committee
that it is vain to put much value on the possibility of creating a
normative codebook structure that entails constraints on sequence.
Even if such a beast were created, the institutional habits even of the
ICPSR member institutions are too deeply ingrained to allow them readily
to adopt an unfamiliar format. Instead, we are urged to make the
structure loose enough to accommodate the varied codebooks already in
print and on line, and not to worry that this retrospectively-induced
laxity will induce prospective laxity as well. My own suggestion in this
regard was to emulate the TEI "Dictionaries" module, with which I have
spent hundreds of hours, in its solution to the same problem. Older
dictionaries, which exist in a bewildering variety of structures, may
be tagged with what is effectively an entirely different dtd invoked
by use of the "entryFree" instead of the "entry" element at the entry
level of the dictionary. In entryFree mode, virtually any element
can appear virtually anywhere. Prospective encoders, on the other hand,
are encouraged to use the more restrictive "entry" tag and its contents.
Such a division may still be incorporated into CodeBook.dtd, but its
inclusion is of fairly low priority.
- Use of TEI parameter entities.
- The TEI parameter entities for low-level elements (chiefly
%phrase.seq; %paraContent; and %specialPara;) have, at my suggestion,
been used as the tip-of-the-branch content models for almost all of
the CodeBook trees. Each entity includes #PCDATA, but also includes
a wealth of other elements (presentational, structural, and
informational), that thereby become available to the tagger with little
effort from the dtd-creator. The lowest of the three, %phrase.seq; is
described in the TEI guidelines as consisting of "broth" (as opposed
to chunks or soup), or phrase-level elements like:
abbr abbreviation
address address
date date
dateRange dateRange
emph emphasis
foreign foreign
hi highlighted
name name
title title
ptr pointer
lang language
Moving up to the paragraph-content level with %paraContent; allows
inclusion of lists, tables, figures, captions, and notes, among many
other things. %specialPara; includes all of these plus paragraphs
themselves, and is capable of handling all the presentational and
structural tagging of most running prose found in a code book, as well
as that of any figures, tables, or graphs that interrupt it (though
most of those tend to be excluded from the digital versions of codebooks
in any case, which are mostly mere ASCII at present.)
- Creation of a TEI base tag set.
- In order to meet the simultaneous needs to (1) create an
informationally-based dtd (quite opposed to most of TEI's structurally-
oriented treatment of text, not data); and (2) make use of
the TEI facility with presentational and structural elements, I
proposed that we convert the codeBook dtd into a simulacrum of a
TEI base tag set, a close enough facsimile to convince TEI that
it belonged, and invoke it (using the teikey2.ent 'INCLUDE'/'IGNORE'
values) alongside at least the additional tag set for figures and
graphs (for obvious reasons!) and perhaps also the additional set
for names and dates. I have not yet decided whether to invoke the
TEI default text structure (front-body-back; div/div/div), mostly
because, while seeing its usefulness for the representation of such
features in the codebooks, I do not yet see a way to allow this structure
to coexist happily with the CodeBook's content-based "bibliographic info--study
description--data files description--data description" structure. Some
experimentation will be necessary. The TEI header should also be
mentioned: its advantages lie in its control of bibliographic
information and possible source for the automated generation of MARC
records; its corresponding drawbacks being its restrictiveness and
the obligation it imposes
on the tagger to catalog the item, both of which can be vexing.
- Limited use of generic tags
- The document analyses have been consistent in proposing distinct
tags for every kind of information, and even for the same kind of
information if its focus or hierarchical placement varied. A strict
and literal translation of the analyses would have required separate
tags even for general catchall elements like "other materials": there
would be an "other materials--study level" tag alongside an "other
materials--data level" tag. "Record label" is to be distinguished from
"variable label," and so on. While this is a defensible practice in part
(it is a mistake, as Maler points out, to think that adding an attribute
to an element entails any less overhead for the coder than adding a
completely new element: in many authoring systems it may well involve more),
in part it also results in a needless
proliferation of elements for what is basically the same kind of
information. It may be that a certain unfamiliarity or uneasiness with
SGML has left the analyzers unhappy with the thought of relying on
context for information, with the thought that "otherMat" nested inside
"stdyDscr" represents "Study-Description-Other-Materials" just as
clearly as a separate "S-D-O-M" tag would do, and therefore somewhat
resistant to exploiting the contextual powers of an SGML text base.
In any case, though hesitant to strike out too much on my own, I have
converted a half dozen tags or so to more generic tags (otherMat,
label, txt, stat[istic], etc.), though supplying each of them with an
optional "level" attribute for the timid.
- Limited use of attributes, nesting, and parameter
entities.
- Attributes have been used overwhelmingly for one purpose: either
to constrain possible values automatically (via declared values in
the ATTLIST), or to allow control of possible values through the
use of external authority control. In a few cases, the value in the
text itself is to be simply deleted and replaced by an empty element
with a controlled-value attribute; in most, the form in the text will
be allowed to stand, but will be supplemented by a controlled attribute
value. For all that, attributes are used relatively sparingly, being
reserved for technical specifications (file type, record length,
etc.), geographic and date information (date covered, country), and
bibliographic information (names), though much of the latter ground
is covered by the TEI header.
The TEI %a.global; attribute set is also used throughout, in order
to provide the basic rend, resp, lang, and id attributes.
- Consistent naming. Eight-character names.
- The first is a convenience to all, but especially to the tagger,
whose convenience above all must be considered (since manual tagging
represents the bulk of the costs of any SGML conversion). The latter
is the SGML default, adhered to on principle. Further consideration for
user convenience will have to wait on trials with tagging real codebooks.
The Process
The information management process, as it has developed, can be
more briefly described. John Brandt acts as supervisor, sets meeting
agendas, and funnels information into the group from the ICPSR Committee
and from Ann Green. Ann works chiefly with the document analyses,
revising them constantly against real-world documents. She posts her
changes to her own server at Yale, whence they are picked up by John
and contribute to the revised versions that he posts to his server
at Michigan. At the moment, this process is somewhat confused, and it
is difficult to know where at any one time one might find the latest
versions, or what changes have gone into them. I work primarily with
the dtd itself, its documentation, and its amalgamation into TEI;
prospectively also the creation of visual representations of the
dtd for presentation to ICPSR Committee meetings. In the meantime,
I post new versions of all the relevant documents to a page in my file
space (linked to John's and Ann's), and pick up material on which to
base revisions from their pages, as well as from e-mail from anyone
in the group. I have just begun to archive this e-mail at the same
site, and to implement a careful version-control scheme. A summary
of version information for my various documents is available here
as well. Nancy Vlahakis, after having contributed large chunks of
the DTD, has moved for the time being to test-tagging codebooks,
prospectively also to creating a Panorama style sheet. Though it
has proven difficult to coordinate our activities (since none of
us can give more than a few hours a week to it, usually in spurts),
the use of updated web pages and occasional meetings has by and large
worked. Movement of information against the flow (say from me to Ann)
has been more problematic.
Some Reflections
On the Decisions
- Content-based coding. The real problems with
this approach will only become visible when it is seriously tested
against real documents. Some features of it give one pause:
- The
fact that we have created a "loose" structure first will probably
make it more difficult to add a more constrained structure later.
(Maler, for example, counsels that if one is going to create both,
the strict structure should be constructed first.)
- The infinite
variability of the codebooks' prose and the lack of obvious
connection between what appears on the page and the available
tags may present the tagger with some very puzzling decisions,
and perhaps tempt him or her into "tag abuse" (use of inappropriate
tags for convenience' sake). The problem with informational, as
opposed to presentational or even structural, tagging, is that
it demands nothing short of full understanding on the part of
the tagger, understanding both of the dtd and its intended
application, and of the document being tagged. Everyone
recognizes a chapter or a list; a study-level range, weighting,
variable group, or universe is usually a little harder to pick out.
- The need to jump from one upper branch of the dtd to another
several times in a single page or paragraph is not inconceivable and
would make for some very laborious tagging, or for some difficult
decisions in deciding on the desirable granularity of the tagging.
- Mismatch between structural and informational items creates some
unpleasant problems for the tagger.
- Treating a text as a database ignores its textual nature, at the
risk of assuming a more logical (and vocabulary-controlled) text
than really exists; alternatively, it imposes burdens on the coder
by requiring them to modify or discard the text as printed.
- Some problems are created when information is repeated in different
forms. The title may for example be tagged as "titl" a dozen times,
each time slightly differently. The TEI header helps here.
- Should anyone wish to generate printed codebooks from the SGML
file, they may find it difficult to approximate the appearance of the
original.
- The actual placement of metadata elements has been entirely at the
discretion of the Committee and its appointed subject experts. I
have not been in a position (because of both my specific role and
my ignorance of statistics) to comment on them or assess their aptness.
- Use of TEI entities. Some advantages of using
the TEI entities lie in the attributes attached to
the tags. Using "date," for example, allows for control of the
date form with the "value" attribute regardless of the way that
the date is expressed in the text; similarly, using "abbr" allows
for control of the full form (say,
of an agency) using the "expan" attribute. This feature of the
TEI elements invoked by the TEI entities has been considered
in examining the question of authority control and consistency, and
in several such cases has been relied on to provide the necessary
control.
- Creation of TEI base tag set. This method seems
to be frowned on by both the TEI Guidelines themselves (Spergerg-McQueen
and Burnard) and by Maler, the former preferring that users take
advantage
of the TEI's built-in methods of adding elements, extending attribute
lists, and modifying content models, and warning that the creation
of an entire new tag set is possible but requires intimate knowledge
of the inner workings of TEI and the relationship between core, base,
and additional tag sets (Guidelines, 42). The latter suggests
that modification of an existing dtd is warranted when the purpose
is compatibility and interchange capability, not when it is simply convenience.
However, the particular approach that we have taken, midway between
writing a dtd from scratch and modifying an existing one, seems to
obviate both criticisms: simple extension of TEI would not suit
CodeBook's needs, and the gains obtained by using TEI are considerably
more than simply convenience. They include, for example, access to
TEI's customization scheme and documentation.
- Use of Attributes. The current policy seems a good
compromise, except that we should perhaps consider adding
"role" to the TEI %a.global; attribute as Maler suggests: a very handy
element extender.
On the Process
If one considers the personnel components of the dtd development
team in terms of Maler's ideal list of required team members (Maler, 72),
it is
apparent that some members are missing, or seldom present, and that
each of us has taken on different roles in passing. She suggests that
this is an acceptable situation, so long as each role is clearly
distinguished and consciously taken on. Here we have been remiss, and
it is partly in response to her suggestions that I have recently and
deliberately taken on the role of "recordist," alongside my basic
roles of "implementor," (through seeking outside advice) "guest
expert," and occasionally "facilitator." John has been "project leader"
and "facilitator," occasionally "recordist"; Ann Green perhaps chiefly
"user group" representative. The lack of a visible "project manager,"
and the lack of real input from users, reviewers, and guest experts
(again, using Maler's terminology) has been our weak point as regards
roles and personnel.
When we turn to her discussion of schedule and budget, however,
things become even more grim. The factors making for a slow expensive
project, she says, include:
- Long, complex documents.
- Complexity and variety of document structure.
- Constraints on the project.
- Incompetence of the project leader.
- Unavailability of members of the design team (ie. the document
analyzeers).
- Incompetence of the DTD implementor.
- Lack of discipline on the team in following methodology, documenting
everything, and seriously reviewing all documents and code delivered.
Our documents are often very long, very complex, and extremely various.
We are constrained in time and money (having essentially none of
either).
It is hard to say who the project leader is, exactly.
John and Ann are available from the design team, others more rarely.
The most competent dtd implementor on the staff is me!
The discipline is severely affected by lack of time, though we do
better than most. Considering that we would find it difficult to
come with the answers to any of her vital questions (some
of them because this is a consortium-based, not a corporate-based,
project), it is
amazing that we have gotten as far as we have (Maler, 71):
- Who makes decisions?
- Who funds the project?
- Who decides who is involved?
- Who is actually concerned?
- Who informs the managers of these people?
- Who is in charge and responsible for the results?
Prospects
The critical tests for this project all lie in the future, the most
important
within
the next few weeks:
- Can the CodeBook DTD + TEI be made to parse?
Have we understood TEI well enough to integrate this added module
into it?
- Can entire code books be readily marked up using CodeBook tags?
- Will the results parse?
If the answers to these are "yes," and I have every reason to suppose
that they are, most of the rest of the project will come along happily
and affirmatively in their wake. The basis of user documentation is
already
made; the display and style mechanisms are trivial to complete; and even the HTML
filtering, though hardly trivial, is an established technology and
technique. We should by summer's end be able to produce what Maler calls
the
minimum deliverables of a dtd development project:
- The document analysis report (most of this in hand, though all needs
and rationales are not yet expressed therein).
- The dtd code itself, and a demonstration of its validity.
- The DTD user and maintenance documentation.
- Test documents marked up with the DTD.
Bibliography
We have been most deeply engaged with the documents themselves: the
sample codebooks (especially ICPSR reprints of US Census data dictionaries),
the dtd, and the document analyses. The following have, however, also
been essential resources:
- Goldfarb, Charles. The SGML Handbook. Oxford, 1990.
- A standard book useful for handy access to the SGML
standard and its tricks.
- Maler, Eve, and Jeanne El Andaloussi. Developing SGML DTDs:
From Text to Model to Markup. Englewood Cliffs: Prentice Hall,
1995.
- Though given to a little self-indulgence and
silliness, this is not only a very readable and intelligent
book, but a very practical one aimed at almost exactly our level
of expertise and need. Apart from Dale Waldt and Brian Travis,
The SGML Implementation Guide (Springer Verlag, 1995),
a more technical and systems-oriented book, I know of no other
book that addresses the question of practical dtd development
at any length.
- Maler, Eve. Tutorial on DTD Construction. Meeting of the MidWest SGMl
Forum, Ann Arbor, MI, 25 January 1996.
- Sperberg-McQueen, C.M. and Lou Burnard. Guidelines for Electronic
Text Encoding and Interchange. [TEI P3] Chicago and Oxford, 1994.
- Has set the standard for gook SGML dtd documentation,
though any attempt to modify the TEI dtd still requires careful study of
the files themselves.
- Van Wijnen, Eric. Practical SGML. 2nd ed. Dordrecht:
Kluwer Academic, 1994.
- Despite some improvements from the first edition, Van
Wijnen still provides less than meets the eye. His discussions of
dtd development consist mostly of hints: useful hints but still hints.
He is better at some of the more esoteric topics: marked sections,
graphics, EDI, math notation, and references. For this reader at least
he has a nack of making the familiar unfamiliar and the simple opaque,
but his is still a handy second-choice book to have around.
I have also examined the EAD.dtd, rather cursorily, as well as
TEILITE.dtd, FINDAID dtd, and MARC.dtd. I hope to have a look at
the just-released EAD tagging guidelines soon, but have not yet been
able
to do so.
Attachments
See the following pages for illustrative documents.
Paul Schaffner :: 25 April 1996