From 97d5c458cfa039d857301e1ca7d5af3beb37131d Mon Sep 17 00:00:00 2001 From: Jacob McDonnell Date: Sun, 26 Apr 2026 16:38:00 -0400 Subject: build: Better Build System --- static/plan9-4e/man2/html.2 | 1420 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1420 insertions(+) create mode 100644 static/plan9-4e/man2/html.2 (limited to 'static/plan9-4e/man2/html.2') diff --git a/static/plan9-4e/man2/html.2 b/static/plan9-4e/man2/html.2 new file mode 100644 index 00000000..ef641e41 --- /dev/null +++ b/static/plan9-4e/man2/html.2 @@ -0,0 +1,1420 @@ +.TH HTML 2 +.SH NAME +parsehtml, +printitems, +validitems, +freeitems, +freedocinfo, +dimenkind, +dimenspec, +targetid, +targetname, +fromStr, +toStr +\- HTML parser +.SH SYNOPSIS +.nf +.PP +.ft L +#include +#include +#include +.ft P +.PP +.ta \w'\fLToken* 'u +.B +Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype, +.B + int chset, Docinfo** pdi) +.PP +.B +void printitems(Item* items, char* msg) +.PP +.B +int validitems(Item* items) +.PP +.B +void freeitems(Item* items) +.PP +.B +void freedocinfo(Docinfo* d) +.PP +.B +int dimenkind(Dimen d) +.PP +.B +int dimenspec(Dimen d) +.PP +.B +int targetid(Rune* s) +.PP +.B +Rune* targetname(int targid) +.PP +.B +uchar* fromStr(Rune* buf, int n, int chset) +.PP +.B +Rune* toStr(uchar* buf, int n, int chset) +.SH DESCRIPTION +.PP +This library implements a parser for HTML 4.0 documents. +The parsed HTML is converted into an intermediate representation that +describes how the formatted HTML should be laid out. +.PP +.I Parsehtml +parses an entire HTML document contained in the buffer +.I data +and having length +.IR datalen . +The URL of the document should be passed in as +.IR src . +.I Mtype +is the media type of the document, which should be either +.B TextHtml +or +.BR TextPlain . +The character set of the document is described in +.IR chset , +which can be one of +.BR US_Ascii , +.BR ISO_8859_1 , +.B UTF_8 +or +.BR Unicode . +The return value is a linked list of +.B Item +structures, described in detail below. +As a side effect, +.BI * pdi +is set to point to a newly created +.B Docinfo +structure, containing information pertaining to the entire document. +.PP +The library expects two allocation routines to be provided by the +caller, +.B emalloc +and +.BR erealloc . +These routines are analogous to the standard malloc and realloc routines, +except that they should not return if the memory allocation fails. +In addition, +.B emalloc +is required to zero the memory. +.PP +For debugging purposes, +.I printitems +may be called to display the contents of an item list; individual items may +be printed using the +.B %I +print verb, installed on the first call to +.IR parsehtml . +.I validitems +traverses the item list, checking that all of the pointers are valid. +It returns +.B 1 +is everything is ok, and +.B 0 +if an error was found. +Normally, one would not call these routines directly. +Instead, one sets the global variable +.I dbgbuild +and the library calls them automatically. +One can also set +.IR warn , +to cause the library to print a warning whenever it finds a problem with the +input document, and +.IR dbglex , +to print debugging information in the lexer. +.PP +When an item list is finished with, it should be freed with +.IR freeitems . +Then, +.I freedocinfo +should be called on the pointer returned in +.BI * pdi\f1. +.PP +.I Dimenkind +and +.I dimenspec +are provided to interpret the +.B Dimen +type, as described in the section +.IR "Dimension Specifications" . +.PP +Frame target names are mapped to integer ids via a global, permanent mapping. +To find the value for a given name, call +.IR targetid , +which allocates a new id if the name hasn't been seen before. +The name of a given, known id may be retrieved using +.IR targetname . +The library predefines +.BR FTtop , +.BR FTself , +.B FTparent +and +.BR FTblank . +.PP +The library handles all text as Unicode strings (type +.BR Rune* ). +Character set conversion is provided by +.I fromStr +and +.IR toStr . +.I FromStr +takes +.I n +Unicode characters from +.I buf +and converts them to the character set described by +.IR chset . +.I ToStr +takes +.I n +bytes from +.IR buf , +interpretted as belonging to character set +.IR chset , +and converts them to a Unicode string. +Both routines null-terminate the result, and use +.B emalloc +to allocate space for it. +.SS Items +The return value of +.I parsehtml +is a linked list of variant structures, +with the generic portion described by the following definition: +.PP +.EX +.ta 6n +\w'Genattr* 'u +typedef struct Item Item; +struct Item +{ + Item* next; + int width; + int height; + int ascent; + int anchorid; + int state; + Genattr* genattr; + int tag; +}; +.EE +.PP +The field +.B next +points to the successor in the linked list of items, while +.BR width , +.BR height , +and +.B ascent +are intended for use by the caller as part of the layout process. +.BR Anchorid , +if non-zero, gives the integer id assigned by the parser to the anchor that +this item is in (see section +.IR Anchors ). +.B State +is a collection of flags and values described as follows: +.PP +.EX +.ta 6n +\w'IFindentshift = 'u +enum +{ + IFbrk = 0x80000000, + IFbrksp = 0x40000000, + IFnobrk = 0x20000000, + IFcleft = 0x10000000, + IFcright = 0x08000000, + IFwrap = 0x04000000, + IFhang = 0x02000000, + IFrjust = 0x01000000, + IFcjust = 0x00800000, + IFsmap = 0x00400000, + IFindentshift = 8, + IFindentmask = (255< +element. +.B Background +is as described in the section +.IR "Background Specifications" , +and +.B backgrounditem +is set to be an image item for the document's background image (if given as a URL), +or else nil. +.B Text +gives the default foregound text color of the document, +.B link +the unvisited hyperlink color, +.B vlink +the visited hyperlink color, and +.B alink +the color for highlighting hyperlinks (all in 24-bit RGB format). +.B Target +is the default target frame id. +.B Chset +and +.B mediatype +are as for the +.I chset +and +.I mtype +parameters to +.IR parsehtml . +.B Scripttype +is the type of any scripts contained in the document, and is always +.BR TextJavascript . +.B Hasscripts +is set if the document contains any scripts. +Scripting is currently unsupported. +.B Refresh +is the contents of a +.B "" +tag, if any. +.B Kidinfo +is set if this document is a frameset (see section +.IR Frames ). +.B Frameid +is this document's frame id. +.PP +.B Anchors +is a list of hyperlinks contained in the document, +and +.B dests +is a list of hyperlink destinations within the page (see the following section for details). +.BR Forms , +.B tables +and +.B maps +are lists of the various forms, tables and client-side maps contained +in the document, as described in subsequent sections. +.B Images +is a list of all the image items in the document. +.SS Anchors +.PP +The library builds two lists for all of the +.B +elements (anchors) in a document. +Each anchor is assigned a unique anchor id within the document. +For anchors which are hyperlinks (the +.B href +attribute was supplied), the following structure is defined: +.PP +.EX +.ta 6n +\w'Anchor* 'u +typedef struct Anchor Anchor; +struct Anchor +{ + Anchor* next; + int index; + Rune* name; + Rune* href; + int target; +}; +.EE +.PP +.B Next +points to the next anchor in the list (the head of this list is +.BR Docinfo.anchors ). +.B Index +is the anchor id; each item within this hyperlink is tagged with this value +in its +.B anchorid +field. +.B Name +and +.B href +are the values of the correspondingly named attributes of the anchor +(in particular, href is the URL to go to). +.B Target +is the value of the target attribute (if provided) converted to a frame id. +.PP +Destinations within the document (anchors with the name attribute set) +are held in the +.B Docinfo.dests +list, using the following structure: +.PP +.EX +.ta 6n +\w'DestAnchor* 'u +typedef struct DestAnchor DestAnchor; +struct DestAnchor +{ + DestAnchor* next; + int index; + Rune* name; + Item* item; +}; +.EE +.PP +.B Next +is the next element of the list, +.B index +is the anchor id, +.B name +is the value of the name attribute, and +.B item +is points to the item within the parsed document that should be considered +to be the destination. +.SS Forms +.PP +Any forms within a document are kept in a list, headed by +.BR Docinfo.forms . +The elements of this list are as follows: +.PP +.EX +.ta 6n +\w'Formfield* 'u +typedef struct Form Form; +struct Form +{ + Form* next; + int formid; + Rune* name; + Rune* action; + int target; + int method; + int nfields; + Formfield* fields; +}; +.EE +.PP +.B Next +points to the next form in the list. +.B Formid +is a serial number for the form within the document. +.B Name +is the value of the form's name or id attribute. +.B Action +is the value of any action attribute. +.B Target +is the value of the target attribute (if any) converted to a frame target id. +.B Method +is one of +.B HGet +or +.BR HPost . +.B Nfields +is the number of fields in the form, and +.B fields +is a linked list of the actual fields. +.PP +The individual fields in a form are described by the following structure: +.PP +.EX +.ta 6n +\w'Formfield* 'u +typedef struct Formfield Formfield; +struct Formfield +{ + Formfield* next; + int ftype; + int fieldid; + Form* form; + Rune* name; + Rune* value; + int size; + int maxlength; + int rows; + int cols; + uchar flags; + Option* options; + Item* image; + int ctlid; + SEvent* events; +}; +.EE +.PP +Here, +.B next +points to the next field in the list. +.B Ftype +is the type of the field, which can be one of +.BR Ftext , +.BR Fpassword , +.BR Fcheckbox , +.BR Fradio , +.BR Fsubmit , +.BR Fhidden , +.BR Fimage , +.BR Freset , +.BR Ffile , +.BR Fbutton , +.B Fselect +or +.BR Ftextarea . +.B Fieldid +is a serial number for the field within the form. +.B Form +points back to the form containing this field. +.BR Name , +.BR value , +.BR size , +.BR maxlength , +.B rows +and +.B cols +each contain the values of corresponding attributes of the field, if present. +.B Flags +contains per-field flags, of which +.B FFchecked +and +.B FFmultiple +are defined. +.B Image +is only used for fields of type +.BR Fimage ; +it points to an image item containing the image to be displayed. +.B Ctlid +is reserved for use by the caller, typically to store a unique id +of an associated control used to implement the field. +.B Events +is the same as the corresponding field of the generic attributes +associated with the item containing this field. +.B Options +is only used by fields of type +.BR Fselect ; +it consists of a list of possible options that may be selected for that +field, using the following structure: +.PP +.EX +.ta 6n +\w'Option* 'u +typedef struct Option Option; +struct Option +{ + Option* next; + int selected; + Rune* value; + Rune* display; +}; +.EE +.PP +.B Next +points to the next element of the list. +.B Selected +is set if this option is to be displayed initially. +.B Value +is the value to send when the form is submitted if this option is selected. +.B Display +is the string to display on the screen for this option. +.SS Tables +.PP +The library builds a list of all the tables in the document, +headed by +.BR Docinfo.tables . +Each element of this list has the following format: +.PP +.EX +.ta 6n +\w'Tablecell*** 'u +typedef struct Table Table; +struct Table +{ + Table* next; + int tableid; + Tablerow* rows; + int nrow; + Tablecol* cols; + int ncol; + Tablecell* cells; + int ncell; + Tablecell*** grid; + Align align; + Dimen width; + int border; + int cellspacing; + int cellpadding; + Background background; + Item* caption; + uchar caption_place; + Lay* caption_lay; + int totw; + int toth; + int caph; + int availw; + Token* tabletok; + uchar flags; +}; +.EE +.PP +.B Next +points to the next element in the list of tables. +.B Tableid +is a serial number for the table within the document. +.B Rows +is an array of row specifications (described below) and +.B nrow +is the number of elements in this array. +Similarly, +.B cols +is an array of column specifications, and +.B ncol +the size of this array. +.B Cells +is a list of all cells within the table (structure described below) +and +.B ncell +is the number of elements in this list. +Note that a cell may span multiple rows and/or columns, thus +.B ncell +may be smaller than +.BR nrow*ncol . +.B Grid +is a two-dimensional array of cells within the table; the cell +at row +.B i +and column +.B j +is +.BR Table.grid[i][j] . +A cell that spans multiple rows and/or columns will +be referenced by +.B grid +multiple times, however it will only occur once in +.BR cells . +.B Align +gives the alignment specification for the entire table, +and +.B width +gives the requested width as a dimension specification. +.BR Border , +.B cellspacing +and +.B cellpadding +give the values of the corresponding attributes for the table, +and +.B background +gives the requested background for the table. +.B Caption +is a linked list of items to be displayed as the caption of the +table, either above or below depending on whether +.B caption_place +is +.B ALtop +or +.BR ALbottom . +Most of the remaining fields are reserved for use by the caller, +except +.BR tabletok , +which is reserved for internal use. +The type +.B Lay +is not defined by the library; the caller can provide its +own definition. +.PP +The +.B Tablecol +structure is defined for use by the caller. +The library ensures that the correct number of these +is allocated, but leaves them blank. +The fields are as follows: +.PP +.EX +.ta 6n +\w'Point 'u +typedef struct Tablecol Tablecol; +struct Tablecol +{ + int width; + Align align; + Point pos; +}; +.EE +.PP +The rows in the table are specified as follows: +.PP +.EX +.ta 6n +\w'Background 'u +typedef struct Tablerow Tablerow; +struct Tablerow +{ + Tablerow* next; + Tablecell* cells; + int height; + int ascent; + Align align; + Background background; + Point pos; + uchar flags; +}; +.EE +.PP +.B Next +is only used during parsing; it should be ignored by the caller. +.B Cells +provides a list of all the cells in a row, linked through their +.B nextinrow +fields (see below). +.BR Height , +.B ascent +and +.B pos +are reserved for use by the caller. +.B Align +is the alignment specification for the row, and +.B background +is the background to use, if specified. +.B Flags +is used by the parser; ignore this field. +.PP +The individual cells of the table are described as follows: +.PP +.EX +.ta 6n +\w'Background 'u +typedef struct Tablecell Tablecell; +struct Tablecell +{ + Tablecell* next; + Tablecell* nextinrow; + int cellid; + Item* content; + Lay* lay; + int rowspan; + int colspan; + Align align; + uchar flags; + Dimen wspec; + int hspec; + Background background; + int minw; + int maxw; + int ascent; + int row; + int col; + Point pos; +}; +.EE +.PP +.B Next +is used to link together the list of all cells within a table +.RB ( Table.cells ), +whereas +.B nextinrow +is used to link together all the cells within a single row +.RB ( Tablerow.cells ). +.B Cellid +provides a serial number for the cell within the table. +.B Content +is a linked list of the items to be laid out within the cell. +.B Lay +is reserved for the user to describe how these items have +been laid out. +.B Rowspan +and +.B colspan +are the number of rows and columns spanned by this cell, +respectively. +.B Align +is the alignment specification for the cell. +.B Flags +is some combination of +.BR TFparsing , +.B TFnowrap +and +.B TFisth +or'd together. +Here +.B TFparsing +is used internally by the parser, and should be ignored. +.B TFnowrap +means that the contents of the cell should not be +wrapped if they don't fit the available width, +rather, the table should be expanded if need be +(this is set when the nowrap attribute is supplied). +.B TFisth +means that the cell was created by the +.B +element (rather than the +.B +element), +indicating that it is a header cell rather than a data cell. +.B Wspec +provides a suggested width as a dimension specification, +and +.B hspec +provides a suggested height in pixels. +.B Background +gives a background specification for the individual cell. +.BR Minw , +.BR maxw , +.B ascent +and +.B pos +are reserved for use by the caller during layout. +.B Row +and +.B col +give the indices of the row and column of the top left-hand +corner of the cell within the table grid. +.SS Client-side Maps +.PP +The library builds a list of client-side maps, headed by +.BR Docinfo.maps , +and having the following structure: +.PP +.EX +.ta 6n +\w'Rune* 'u +typedef struct Map Map; +struct Map +{ + Map* next; + Rune* name; + Area* areas; +}; +.EE +.PP +.B Next +points to the next element in the list, +.B name +is the name of the map (use to bind it to an image), and +.B areas +is a list of the areas within the image that comprise the map, +using the following structure: +.PP +.EX +.ta 6n +\w'Dimen* 'u +typedef struct Area Area; +struct Area +{ + Area* next; + int shape; + Rune* href; + int target; + Dimen* coords; + int ncoords; +}; +.EE +.PP +.B Next +points to the next element in the map's list of areas. +.B Shape +describes the shape of the area, and is one of +.BR SHrect , +.B SHcircle +or +.BR SHpoly . +.B Href +is the URL associated with this area in its role as +a hypertext link, and +.B target +is the target frame it should be loaded in. +.B Coords +is an array of coordinates for the shape, and +.B ncoords +is the size of this array (number of elements). +.SS Frames +.PP +If the +.B Docinfo.kidinfo +field is set, the document is a frameset. +In this case, it is typical for +.I parsehtml +to return nil, as a document which is a frameset should have no actual +items that need to be laid out (such will appear only in subsidiary documents). +It is possible that items will be returned by a malformed document; the caller +should check for this and free any such items. +.PP +The +.B Kidinfo +structure itself reflects the fact that framesets can be nested within a document. +If is defined as follows: +.PP +.EX +.ta 6n +\w'Kidinfo* 'u +typedef struct Kidinfo Kidinfo; +struct Kidinfo +{ + Kidinfo* next; + int isframeset; + + // fields for "frame" + Rune* src; + Rune* name; + int marginw; + int marginh; + int framebd; + int flags; + + // fields for "frameset" + Dimen* rows; + int nrows; + Dimen* cols; + int ncols; + Kidinfo* kidinfos; + Kidinfo* nextframeset; +}; +.EE +.PP +.B Next +is only used if this structure is part of a containing frameset; it points to the next +element in the list of children of that frameset. +.B Isframeset +is set when this structure represents a frameset; if clear, it is an individual frame. +.PP +Some fields are used only for framesets. +.B Rows +is an array of dimension specifications for rows in the frameset, and +.B nrows +is the length of this array. +.B Cols +is the corresponding array for columns, of length +.BR ncols . +.B Kidinfos +points to a list of components contained within this frameset, each +of which may be a frameset or a frame. +.B Nextframeset +is only used during parsing, and should be ignored. +.PP +The remaining fields are used if the structure describes a frame, not a frameset. +.B Src +provides the URL for the document that should be initially loaded into this frame. +Note that this may be a relative URL, in which case it should be interpretted +using the containing document's URL as the base. +.B Name +gives the name of the frame, typically supplied via a name attribute in the HTML. +If no name was given, the library allocates one. +.BR Marginw , +.B marginh +and +.B framebd +are the values of the marginwidth, marginheight and frameborder attributes, respectively. +.B Flags +can contain some combination of the following: +.B FRnoresize +(the frame had the noresize attribute set, and the user should not be allowed to resize it), +.B FRnoscroll +(the frame should not have any scroll bars), +.B FRhscroll +(the frame should have a horizontal scroll bar), +.B FRvscroll +(the frame should have a vertical scroll bar), +.B FRhscrollauto +(the frame should be automatically given a horizontal scroll bar if its contents +would not otherwise fit), and +.B FRvscrollauto +(the frame gets a vertical scrollbar only if required). +.SH SOURCE +.B /sys/src/libhtml +.SH SEE ALSO +.IR fmt (1) +.PP +W3C World Wide Web Consortium, +``HTML 4.01 Specification''. +.SH BUGS +The entire HTML document must be loaded into memory before +any of it can be parsed. -- cgit v1.2.3