The Extended XPath language (EXPath) for querying Concurrent Markup Hierarchies

This document provides semantics of the Extended XPath language (EXPath) for Concurrent Markup Hierarchies (CMH).

Table of contents

1 Introduction
2 Extended Axes
3 Expressions
    3.1 Node Tests
    3.2 Union Expression
    3.3 Relation Expression
4 Extended Core Function Library
    4.1 Node Sets Collection Functions
    4.2 String Functions
    4.3 Boolean Functions
    4.4 Number Functions
5 Data Model
    5.1 GODDAG
    5.2 Query Examples

1 Introduction

Extended XPath (EXPath) is an extension of regular XPath to provide selection of nodes in a GODDAG.

One key difference between EXPath and XPath is the return type of a location step evaluation: in EXPath a location step is evaluated to a node-set-collection: a node-set per each hierarchy. Consequently, the context of an expression evaluation is the same as the context of an XPath expression, with the following amendments:

2 Extended Axes

The following new axes are introduced:
  1. xdescendant: includes all nodes in GODDAG whose text ranges are included in the text range of the current context node, excluding the current context node.
  2. xdescendant-or-self: is the xdescendant set of the current context node plus the current context node.
  3. xancestor: includes all nodes in GODDAG whose text ranges include the text range of the current context node, excluding the current context node.
  4. xancestor-or-self: is the xancestor set of the current context node plus the current context node.
  5. xfollowing: includes all nodes in GODDAG whose text ranges follow the text range of the current context node.
  6. xpreceding: includes all nodes in GODDAG whose text ranges precede the text range of the current context node.
  7. preceding-overlapping: includes all nodes in GODDAG whose text ranges contain (not on the border) the start tag, but not the end tag, of the current context node.
  8. following-overlapping: includes all nodes in GODDAG whose text ranges contain (not on the border) the end tag, but not the start tag, of the current context node.
  9. overlapping: is the union of preceding-overlapping and following-overlapping sets of the current context node.

3 Expressions

3.1 Node Tests

The following extensions of the XPath node tests are added:

The following node test is added in EXPath:

3.2 Union Expression

A union ("|") operation of two node-set-collection yields a node-set-collection result containing a node-set per component hierarchy. Each node-set in the result is obtained from the union of the node sets in the same hierarchy of the operands.

3.3 Relation Expression

Under preparation!

4 Extended Core Function Library

This section provides the semantics of core XPath functions when used for CMH as well as new core library functions of EXPath.

4.1 Node Sets Collection Functions

Function: number last()

The last function returns a number equal to the context size from the EXPath expression evaluation context.

Function: number position()

The position function returns a number equal to the context position from the EXPath expression evaluation context.

Function: number count(node-set-collection)

The count function returns the number of nodes in the argument node-set-collection.

Function: node-set-collection id(object)

The id function selects elements by their unique ID as in id function in XPath. Note that the result type is node-set-collection.

Function: string local-name(node-set-collection?)

The local-name function applies the local-name function of XPath for each node-set in the node-set-collection argument and returns the string concatenation, using a blank space as separator, of all returned strings.

Function: string namespace-uri(node-set-collection?)

The namespace-uri function applies the namespace-uri function of XPath for each node-set in the node-set-collection argument and returns the string concatenation, using a blank space as separator, of all returned strings.

Function: string name(node-set-collection?)

The name function applies the name function of XPath for each node-set in the node-set-collection argument and returns the string concatenation, using a blank space as separator, of all returned strings.

Function: string hierarchy(), boolean hierarchy(String)

The hierarchy function returns the document hierarchy ID of the context node (the first version) or returns true if the context node belongs to the hierarchy given as parameter or false otherwise (the second version).

4.2 String Functions

Function: string string(object?)

The string function converts an object to a string as follows:

The semantics of the other string functions in the core functions library of XPath is unchanged.

Function: string toLowerCase(string)

The toLowerCase function returns the lower case string version of the string taken as parameter.

Function: string toUpperCase(string)

The toUpperCase function returns the upper case string version of the string taken as parameter.

4.3 Boolean Functions

Function: boolean boolean(object)

The boolean function converts its argument to a boolean as follows:

The semantics of the other boolean functions in the core functions library of XPath is unchanged.

Function: boolean matches(string, string)

The matches function returns true if and only if the first string argument matches the RE in the second string argument; otherwise (including the case of invalid RE) it returns false. For more information about the RE please check the Java.lang.String.matches() documentation.

4.4 Number Functions

Function: number number(object?)

The number function converts its argument to a number as follows:

If the argument is omitted, it defaults to a node-set-collection with a node-set containing the context node as its only member.

Function: number sum(node-set-collection)

The sum function returns the sum, for each node in the argument node-set-collection, of the result of converting the string-values of the node to a number.

The semantics of the other number functions in the core functions library of XPath is unchanged.

5 Data Model

5.1 GODDAG

For representing a distributed XML document we use the General Ordered-Descendant Directed Acyclic Graph (GODDAG) data structure proposed by Sperberg-McQueen and Huitfeldt. Informally, a GODDAG for a distributed XML document can be thought of as the graph that unites the DOM trees of individual components, by merging the root node and the text nodes. However, because of possible overlap in the scopes of XML elements from different component documents, GODDAGs will feature one more node type, that we call here leaf node, not found in DOM trees. In a GODDAG, leaf nodes are children of the text nodes, and they represent a consecutive sequence of content characters that is not broken by an XML tag in any of the components of the distributed XML document. While each CMH component will have its own text nodes in a GODDAG, the leaf nodes will be shared among all of them.

In a GODDAG we have the following types of nodes: root node (unique for GODDAG), element nodes, attribute nodes, text nodes, and leaf nodes (see the figure below). Note that, in the figure below, the root node at the bottom is the same with the root node at the top: for simplicity they were distinctly drawn.

The string-value of a node in GODDAG is evaluated as a string-value of the node in its respective hierarchy.

5.2 Query Examples

  1. Q: Find all damaged characters.
    A: /descendant::dmg/descendant::text()
  2. Q: Find all words containing damaged characters.
    A: /descendant::w[xancestor::dmg or xdescendant::dmg or overlapping::dmg]
  3. Q: Find all words containing damaged characters ONLY.
    A: /descendant::w[xancestor::dmg and xdescendant::dmg]
  4. Q: Find all damaged characters which have been restored from other manuscripts.
    A: /descendant::dmg/descendant::text()[xancestor::res]
  5. Q: Find all words containing damaged characters which have been restored from other manuscripts.
    A: /descendant::dmg/xdescendant::w[descendant::text()[xancestor::res]]

Implementation

Implementation and examples can be found here.

References


Send comments and questions to Emil Iacob: emil.iacob at gmail.com, Last Modified: