Research
Research interests
- Semistructured databases
- XML
- Query languages and processing
- Algorithms and complexity
- Digital libraries
- Computer graphics
- Programming languages
Current research
My current research focuses on engineering aspects of
document-centric XML problems stemming from text encodings in
humanities applications. I actively participate in two ongoing
projects: the Electronic Boethius Project
and the ARCHway Project (the work in these projects is materialized in an electronic edition development platform, the
Edition Production Technology - EPPT).
Here is a summary of some of the current research problems (joint work
with Dr. Alex Dekhtyar):
- Framework for processing XML documents with overlapping markup.
We have formally defined multiple markup hierarchies and designed
a framework for management of document-centric XML with
overlapping markup (here
is more about non-hierarchical XML), which is a generalization of traditional XML
processing framework: from representing multiple markup
hierarchies, to parsing, querying, and authoring complex
document-centric XML with overlapping markup hierarchies. One of
the advantages of the overlapping markup formalism we proposed is
its flexibility. Overlapping markup can, in theory, be imported
into and exported from our framework from/in a wide range of
alternative representations. We have designed, implemented and
tested algorithms for merging, filtering, and
updating documents with overlapping markup hierarchies.
- Algorithms
for concurrently parsing XML documents with overlapping markup.
We use the Kentucky General Ordered-Descendant Directed Acyclic Graph data
structure (KyGODDAG, based on GODDAG introduced as an abstract data structure by
C.M. Sperberg-McQueen and C. Huitfeldt) as an object model for
overlapping XML markup documents, in the same way DOM trees are
used to represent XML data. We described and implemented the
DOM-style API for the KyGODDAG representation, and developed and
implemented algorithms for parsing overlapping XML markup data
into KyGODDAG.
- Design and implementation of an XPath extension for querying
overlapping XML markup.
XPath and XQuery are inefficient for
expressing certain important information needs over overlapping
XML markup (e.g., requests for overlapping content given two
tags). In addition, XPath is defined on the DOM Tree structure,
whereas concurrent XML documents are modelled using KyGODDAG graphs.
We redefined the XPath semantics on KyGODDAG, and extended it with
features that are specific to processing of overlapping markup,
such as the overlapping axis. We developed efficient
algorithms (with similar performances as the algorithms developed
by Gottlob et al. for XPath) for processing extended XPath queries
over KyGODDAG structures.
- Potential validity of document-centric XML documents.
Document-centric XML is often created via prolonged manual editing
of the underlying textual content. In such cases, the document
might not become valid until very late in the editing process. At
the same time, human editors introducing markup into the document
need to know whether or not, the current state of the document
allows for a valid extension. We introduced the notion of
potentially valid XML documents and gave a preliminary
study of the problem of potential validity. We proved that
potentially valid XML documents form a context free language and
we provided efficient algorithms (linear time complexity) for
deciding potential validity.
- Data structures and algorithms for persistent storage and querying
image-based text encodings.
Electronic editions of manuscripts
are in general based on manuscript folio images. In such cases it
is important to store image features in text encodings and to be
able to efficiently query for these features. We developed data
structures for storing image-text and text-image mappings and
algorithms for efficiently querying them.
- Data structures for storing, querying, and publishing complex data
represented in XML.
Almost all research carried on so far for
storing and querying XML data has addressed data-centric XML
representations and, to a smaller degree, specific problems of
document-centric XML data. The importance of text in XML
representations was recently emphasized by W3C's working draft on
XQuery 1.0 and XPath 2.0 Full-Text. The problem of representing
non-hierarchical structures, in both data-centric and
document-centric XML data, has captured the attention of computer
scientists only recently. As we need to handle a huge variety of
data nowadays, and XML is the de facto standard for data
representation and interchange, we expect an increase in XML data
complexity: mixed data-centric -- document-centric content,
multiple XML hierarchies, etc. Consequently, representing,
querying, and publishing complex XML data need special attention.
- Storage of multi-hierarchical XML in relational databases.
Storage of concurrent XML data poses
challenges stemming from the fact that there is more than one
structure to be stored. The naive solution of storing each
structure independently is not satisfactory, as querying different
structures relationships is often required. The use of XPath and
XQuery (more powerful and programmatic than SQL for querying XML
data) for efficiently extracting whole documents or fragments of
documents, and XSLT for publishing to HTML, PDF, or some other
output shall be examined for multi-hierarchical XML repositories
in relational databases.
- Index structures for complex XML data.
Besides storing XML data in relational databases, current XML data
processing solutions include storing XML data in native XML
databases. XML databases are known as efficient solutions for
managing large collections of documents with non-regular and
non-homogeneous structures. In order to make search operations
efficient, special purpose index structures need to be designed to
support simple queries (over one hierarchy) as well as complex
queries over several hierarchies.