ntox is a Newton unit of the C-based tox XML/HTML tokenizer. It has a very small memory footprint (package size and heap usage) and a simple API to facilitate integration in other applications. Source code and binary packages are provided.
The file NTox.pkg provides a unit with the signature |NTox:40Hz|. This unit exports two functions named parse and parseHTML. However, this unit is only provided for easier reuse in more than one application on a Newton device. It is also possible to include the native module NTox.ntkc in an NTK project. This C module exports the function NTox.Parse. This leads to two ways of calling the parser:
Using the unit:
Using the native module:
The parameters for the parse functions are a parser frame, the text and the number of bytes or characters to be parsed (this can be nil to parse the whole text). The text can be passed in as a NewtonScript string, a binary object or a byte array. A binary object and a byte array are assumed to have one-byte elements as opposed to a NewtonScript string with two-byte elements. Binary objects and strings must include a trailing zero byte unless the length parameter is given.
The parser frame contains the callback functions to be invoked when certain parts of an XML document have been parsed:
elementCallback: Is called for start and end tag names. attrnameCallback: Gets called for attribute names. attrvalCallback: Is called for attribute values. wordCallback: Is called for words in the text between tags (separated by whitespace). wsCallback: Gets called for whitespace between words. These callback functions receive one argument containing the recently parsed part as a NewtonScript string, i.e. a tag name, an attribute value or a word. This buffer is limited to 512 bytes, resulting in a maximum of 512 characters for single-byte input streams and 256 characters for two-byte input streams. This number can however be changed by recompiling ntox. The receiver of the invokation is the parser frame.
A typical piece of code showing the usage of the parser would look like this:
As shown above, the parser frame can be reused after finishing parsing the given text since it contains state information about the last parsing process. This makes it possible to parse chunks of text as they are received over a socket connection and treat them as one document (The parser uses no lookahead).
The parsing process for HTML is very similar to the XML parsing process. It is invoked via the parseHTML function which is called in the same way as the parse function.
ntox transforms the HTML into a more compact form suitable for the Newton. It does not parse CSS, tables and any positional elements. Fonts and font styles are used in a very limited way. The result is a stream of text, anchor information and style data which can be displayed in simple text views.
The callbacks used by the HTML parsing process are:
elementCallback: Is called for the tags a and title attrnameCallback: Gets called for all attribute names of any a tag. attrnameCallback: Gets called for all attribute values of any a tag. wordCallback: Is called for words in the text between tags (including whitespace and entities). Implementation
NTox is a wrapper for the tox XML tokenizer. It takes care of the conversion between NewtonScript objects and C structures by checking the provided parser frame for previously set state information and saving this information after a completed parsing process.
It is important to know that ntox locks the passed string or binary oject on the NewtonScript heap for the duration of the parsing process (a byte array as the data to be parsed will not be locked). This could lead to heap fragmentation if to many parsers are invoked and too much text is processed at once. Besides this, the parser just uses a couple of bytes for the state information. This data resides on the C stack and gets cleaned automatically after the parser finishes. It is however present on the stack during the invokation of the callback functions, as well as the buffer containing the recently parsed parts of the XML document.