CONTENT AND PROPERTIES OF "SET_B" DATA (the Bellcore archive, part 1)
---------------------------------------------------------------------
This data set was received by the LDC in a "run-off" format,
originally intended as input to a common typesetting process
(i.e. something like "nroff" or "troff"). Each individual file stores
material in one language from one day of parliamentary proceedings.
The file name reflects the date and language of the content (with date
rendered as "YYMMDD", e.g. 880901_f for the French version of
proceedings on September 1, 1988).
The text files are partitioned into sub-directories according to the
two-digit year ("86", "87", "88"). There are two additional
directories, which provide mappings for parallel content in the text
data: "tokn_map" and "para_map". These will be explained below.
The conversion to SGML format retained information about paragraph
boundaries and about the source language of various portions of text.
In addition, each paragraph within a file was assigned a sequential
index. The resulting SGML form is different from that of the "set_a"
data in two regards:
1. there are some lines in each file that contain only an SGML tag,
and no text data; these lines are all of the form:
where "#" represents the one- to three-digit sequence number
assigned to the paragraph; the first paragraph of every file is
identified by sequence number "1" -- that is, the paragraph index
values are NOT unique across files. Also, the paragraph id numbers
are not strictly sequential: there are occasional gaps in the
numbering of the paragraphs.
With regard to the " ", and to a French word that is
centered between bytes 362 and 363 in the line that begins with the
tag " "; these two words are purported to establish a
correspondence in the translation. The byte offset into the paragraph
is based on the first character of the line (the open-angle bracket
"<") being at position 1.
The token alignments may identify punctuation marks in the text, as
well as word tokens; that is, the character to be found at a given
offset may be a comma, period, colon or other non-alphanumeric. In
this case, the "tokens" that make up the alignment pairing are the
punction marks themselves, not the words that they are adjacent to.
(If the character found at an alignment offset is alphanumeric, it
should be the case that the given character position represents the
center of a word token that establishes an alignment pairing.)
Obviously, this mapping does not cover all tokens in either language,
but it does serve to establish a large quantity of reference points
for lexical correspondences.
In order to make it easier to use the mapping information, all the map
files and text data for set_b have been published without compression.
It should be pointed out that Melamed had set parameters in his
token-mapping algorithm to trade off some amount of accuracy for
greater execution speed when treating this data set. A more careful
application of the method (especially with a cleaner version of the
source texts) would likely yield a better set of correspondences.
-- Final corrections to text content
The token mapping and paragraph alignment processes revealed some
apparent corruptions in a subset of the text files; the origin of the
corruption is not known (it appeared in the materials received by the
LDC), but the symptom appeared as a "mis-filing" of (portions of) some
proceedings. The "mis-filing" showed signs of being due to some
software malfunction, whereby a final portion of one text, starting at
some arbitrary position, was appended to the end of some other text;
sometimes this would result in the same text content appearing in two
files, and sometimes the appended material was from the other
language. Often, the appended material began in mid-sentence.
Surprisingly, there were many cases where both the English and French
files for a given session were found to contain appended material that
was likewise parallel in content, though the starting points of the
appended material were not well aligned.
We have tried to locate these corruptions in the SGML text files, and
to eliminate material that was fragmented, duplicated elsewhere, or
clearly unrelated to material in the corresponding file in the other
language. Presumably, some instances of these problems may remain.
Because the token correspondences were computed before a number of
corruptions in the text files were discovered and corrected, it is
possible that some mapping files will contain ranges of false
correspondences. We have tried to identify and fix or remove faulty
mappings, but some residual errors are likely to have escaped notice.