- Tech know how online

XML toolkit (XMLTK)

The XML Toolkit (XMLTK) is a collection of tools for the command line based processing of XML files. The structure of XMLTK is based on the tools commonly used under Unix, such as sort, tail or grep. In addition to functions such as sort, nesting or aggregation, other essential functions are XML stream index (SIX) and XML stream processor, with which the processing of XML data streams is possible. The tools can also be combined into pipelines for more complex processing of XML data. XPath expressions are used to identify the nodes in an XML structure. XMLTK defines a special, binary XML intermediate format for processing, so that high performance is achieved in processing.

XPath for Node Identification

Tools similar to those offered by the XML Toolkit are already known from the Unix world. There, these tools are also line-based but use regular expressions. XMLTK, on the other hand, uses special XPath expressions derived from a subset of XLST's pattern grammar to identify the nodes in the XML structure of files to be processed. The XMLTK tools do not act in terms of a general XML language for transformation. The goal of XMLTK is to provide programs that solve, in a simple way, basic tasks related to XML. Because the tools regard XML data as data streams, they can also process files of any size. Data stream here means the continuous sequence of data records. XMLTK, which is based on line-oriented commands, uses so-called pipe symbols to connect commands for more complex tasks directly via the command line.

An initial group of XMLTK programs include:

  • xsort for sort operations,
  • xdelete for deletingdata,
  • xagg for aggregating data,
  • xflatten for triggering levels in a hierarchy,
  • xpair for repeating certain nodes,
  • xnest for grouping data,
  • xhead for the beginning of a document,
  • xtail for the end of a document.
This allows simple operations on XML data to be controlled via Unix-type parameters on the command line.

The goal of processing large XML streams is to achieve the highest possible performance. This is the task of the XML stream processor, which supports an Application Programming Interface (API) written in the C programming language, which is also used by the above programs for evaluating XPath expressions. The XML stream processor checks the correspondence of so-called variables - these are the markers for the nodes of an XML request tree - with the XML data stream read in. Via the intermediate step of a non-deterministic finite automaton, the request tree is finally mapped by a deterministic finite automaton (DEA). The DEA applies a pattern matching algorithm for finding previously determined XML data within the XML data stream. A major advantage of this approach is not only that it achieves a constant throughput of XML data, but also that it is independent of the number of patterns to be checked.

Furthermore, the XML stream processor creates a so-called stream index (SIX), which contains the start and end position of a node in the request tree as a value pair. This allows the processing of very large sets of patterns - which are defined by XPath expressions. In fact, if none of the XPath expressions match the nodes of a subtree, then that subtree can generally be read over.

XMLTK supports the merging of XML tools by pipelines. For the parse step required for this, XMLTK defines a special binary intermediate format that assigns a unique number to each element name that occurs. This also allows comparisons between XML identifiers in a simple way.

The tools of the XMLTK are mainly specialized on the performant realization of simple XML transformations, and therefore do not consider any namespaces. In contrast, integration with other tools for XML processing is possible by exchanging XML text at the level of the respective operating system.

Informationen zum Artikel
Englisch: XML toolkit - XMLTK
Updated at: 29.10.2013
#Words: 991