Concurrent Markup Hierarchies (CMH) Resources

How to search multiple hierarchies? Here is our approach:

Case study example

  1. Encodings
  2. Overlapping markup
  3. Queries

1. Encodings

We consider an example of three hierachies of a text document markup: Note that the hierarchies presented in the example are likely to appear in real document encodings (except for underlines and italics, which may overlap, in general). However, the corresponding DTDs are given only for the purpose of grouping elements into hierarchies (and we simplify them by not defining attributes). There is no validation for individual hierarchies.

The table below contains the encodings for each hierarchy. Number attribute valueswere assigned to tags having the same label in order to differentiate themin the query evaluation results.
Text hierarchy Physical hierarchy Condition hierarchy
  <!ELEMENT doc (p)*>
  <!ELEMENT p (sentence)*>
  <!ELEMENT sentence (#PCDATA|w)*>
  <!ELEMENT w (#PCDATA)>
  <!ELEMENT doc (#PCDATA|page)*>
  <!ELEMENT page (#PCDATA|line)*>
  <!ELEMENT line (#PCDATA)>
  <!ELEMENT doc (#PCDATA|u|i)*>
  <!ELEMENT u (#PCDATA)>
  <!ELEMENT i (#PCDATA)>
<?xml version="1.0"?>
<doc id="CP56483">
...
<p>
<sentence n="13"> <w n="1">Where</w> <w n="2">there</w> <w n="3">are</w>
<w n="4">charges</w> <w n="5">that</w> <w n="6">by</w> one means 
of another the vote is being denied, we must 
find out all of the facts -- the extent, the 
methods, the results.</sentence> <sentence n="14">The 
same is true of substantial


<w n="7">charges</w> that unwarranted economic of other
pressures are being applied to deny
<w n="8">fundamental</w> <w n="9">rights</w> <w n="10">safe-
guarded</w> <w  n="11">by</w> the Constitution and laws of <w n="12">the</w>
United States.</sentence>
</p>...
</doc>
<?xml version="1.0"?>
<doc id="CP56483">
<page n="1">
...
<line n="20"> Where there are</line>
<line n="21">charges that by one means</line>
<line n="22">of another the vote is being denied, we must</line>
<line n="23">find out all of the facts -- the extent, the</line>
<line n="24">methods, the results. The</line>
<line n="25">same is true of substantial</line>
</page>
<page n="2">
<line n="1">charges that unwarranted economic of other</line>
<line n="2">pressures are being applied to deny</line>
<line n="3">fundamental rights safe-</line>
<line n="4">guarded by the Constitution and laws of the</line>
<line n="5">United States.</line>
...
</page>
</doc>
<?xml version="1.0"?>
<doc id="CP56483">
... 
Where there are
charges that by one means 
of another the vote is being denied, <u n="1">we must
find out all of the facts</u> -- <i>the extent, the
methods, the results.</i> The
same is true of substantial

charges that unwarranted economic of other
pressures are being applied to <u n="2">deny
fundamental rights </u>safe-
guarded by the Constitution and laws of the
United States
....
</doc>

2. Overlapping markup

Threre are several overlapping markup in the encodings above. Some of them are:

3. Queries

The answer of the following queries can be found out using the available test program:
  1. Find all words in line 21 of page 1.
    Query: //page[@n="1"]/line[@n="21"]/xdescendant::w
  2. Which are the sentences entirely or partially in page 1?
    Query: //sentence[xancestor::page[@n="1"] or overlapping::page[@n="1"]]
  3. Find all document lines that contain the word "safeguarded" (we consider that a line contains a certain word if either the word is completely or starts in the respective line).
    Query: //line[xdescendant::w[string(.)="safeguarded"] or following-overlapping::w[translate(string(.),"\n\r-","")="safeguarded"]]

Results:
Query: //page[@n="1"]/line[@n="21"]/xdescendant::w


Query: //sentence[xancestor::page[@n="1"] or overlapping::page[@n="1"]]


Query: //line[xdescendant::w[string(.)="safeguarded"] or following-overlapping::w[translate(string(.),"\n\r-","")="safeguarded"]]


Send comments and questions to Emil Iacob: ieiacob at georgiasouthern.edu, Last Modified: