Tox is a simple tokenizer for XML and HTML data written in C. Its main part is a finite state machine (FSM) that covers most of the basic states needed to parse XML and HTML data. The main design goal is to minize code size and RAM usage. Tox can be configured to handle different character types such as unsigned char or unsigned short. It is not a fully functional XML parser yet, handling of UTF-8 input, DTD processing and validation is missing.
Author: Eckhart Köppen
License: LGPL 2.0
The state of a parser is held in a parser_state structure. It should be cleared before usage, i.e. all fields should be set to zero. To parse some text, the following fields need to be set:
To capture parsing events and process the components of an XML document, callback functions need to be set. They are called with the current parser_state structure. Currently, the following callbacks can be set:
The value of the parsed component (tag name, attribute name etc) can be found in the buffer field of the parser_state structure. It will be in the same format (i.e. character width) as the input text.
The parser is started by calling tox_parse. The parser parses one character at a time, making interruptions of the parsing process at any time possible. The parser structure can be used repeatedly after calling tox_parse, e.g. parsing blocks of data coming over a socket connection.
HTML parsing is supported if the code is compiled with the macro HTML defined. In HTML mode, tox is much more forgiving and accepts otherwise incorrect XML code.
Tox is built around a table containing most of the relevant states for parsing XML documents (range_table). The parsing function tox_parse simply examines the current character in the given text and performs state transitions. Relevant data is captured in the buffer field of the parser_state structure. The capture is controlled by the pump field which gets toggled according to the information in the range_table.
Tox consists of just two files:
A simple example describing how to use tox can be found in main.c.