htmlparse(n) | HTML Parser | htmlparse(n) |
htmlparse - Procedures to parse HTML strings
package require Tcl 8.2
package require struct::stack 1.3
package require cmdline 1.1
package require htmlparse ?1.2?
::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html
::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag
::htmlparse::mapEscapes html
::htmlparse::2tree html tree
::htmlparse::removeVisualFluff tree
::htmlparse::removeFormDefs tree
The htmlparse package provides commands that allow libraries and applications to parse HTML in a string into a representation of their choice.
The following commands are available:
All information beyond the HTML string itself is specified via options, these are explained below.
To help understand the options, some more background information about the parser.
It is capable of detecting incomplete tags in the HTML string given to it. Under normal circumstances this will cause the parser to throw an error, but if the option -incvar is used to specify a global (or namespace) variable, the parser will store the incomplete part of the input into this variable instead. This will aid greatly in the handling of incrementally arriving HTML, as the parser will handle whatever it can and defer the handling of the incomplete part until more data has arrived.
Another feature of the parser are its two possible modes of operation. The normal mode is activated if the option -queue is not present on the command line invoking the parser. If it is present, the parser will go into the incremental mode instead.
The main difference is that a parser in normal mode will immediately invoke the command prefix for each tag it encounters. In incremental mode however the parser will generate a number of scripts which invoke the command prefix for groups of tags in the HTML string and then store these scripts in the specified queue. It is then the responsibility of the caller of the parser to ensure the execution of the scripts in the queue.
Note: The queue object given to the parser has to provide the same interface as the queue defined in tcllib -> struct. This means, for example, that all queues created via that tcllib module can be immediately used here. Still, the queue doesn't have to come from tcllib -> struct as long as the same interface is provided.
In both modes the parser will return an empty string to the caller.
The -split option may be given to a parser in incremental mode to specify the size of the groups it creates. In other words, -split 5 means that each of the generated scripts will invoke the command prefix for 5 consecutive tags in the HTML string. A parser in normal mode will ignore this option and its value.
The option -vroot specifies a virtual root tag. A parser in normal mode will invoke the command prefix for it immediately before and after it processes the tags in the HTML, thus simulating that the HTML string is enclosed in a <vroot> </vroot> combination. In incremental mode however the parser is unable to provide the closing virtual root as it never knows when the input is complete. In this case the first script generated by each invocation of the parser will contain an invocation of the command prefix for the virtual root as its first command. The following options are available:
In incremental mode, however, the generated scripts will invoke the command prefix with five arguments appended. The last four of these are the same which were mentioned above. The first is a placeholder string (@win@) for a clientdata value to be supplied later during the actual execution of the generated scripts. This could be a tk window path, for example. This allows the user of this package to preprocess HTML strings without committing them to a specific window, object, whatever during parsing. This connection can be made later. This also means that it is possible to cache preprocessed HTML. Of course, nothing prevents the user of the parser from replacing the placeholder with an empty string.
The first argument, clientdata, is optional and present only if this command is invoked by a parser in incremental mode. It contains whatever the user of this package wishes.
The second argument, tag, contains the name of the tag which is currently processed by the parser.
The third argument, slash, is either empty or contains a slash character. It allows the callback to distinguish between opening (slash is empty) and closing tags (slash contains a slash character).
The fourth argument, param, contains the un-interpreted list of parameters to the tag.
The fifth and last argument, textBehindTheTag, contains the text found by the parser behind the tag named in tag.
The internal callback does some basic checking of HTML validity and tries to recover from the most basic errors. The command returns the contents of its second argument. Side effects are the creation and manipulation of a tree object.
Each node in the generated tree represent one tag in the input. The name of the tag is stored in the attribute type of the node. Any html attributes coming with the tag are stored unmodified in the attribute data of the tag. In other words, the command does not parse html attributes into their names and values.
If a tag contains text its node will have children of type PCDATA containing this text. The text will be stored in the attribute data of these children.
This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such in the category htmlparse of the Tcllib SF Trackers [http://sourceforge.net/tracker/?group_id=12883]. Please also report any ideas for enhancements you may have for either package and/or documentation.
struct::tree
html, parsing, queue, tree
Text processing
1.2 | htmlparse |