insertion-modes.csv represents the relative aggregate distribution of
invocations of the "in head", "in body", and "in table" insertion
modes, as seen when parsing a nominal ten billion documents. The data
was collected by Google in July 2007, using a parser based on r967 of
the HTML5 specification.
The file is a comma-delimited file with three columns. The first
column is the insertion mode block being counted. The second column is
the actual insertion mode as it was when the block was invoked. The
last column is the relative frequency.
So, for example:
in head,in caption,456020
...means that over ten billion documents, on average, the user agent
will be required to "process the token as if the insertion mode was
"in head"" while the insertion mode was really "in caption" a total of
about 456020 times. This could happen, for example, in the following
...as the start tag will be treated according to the "Anything
else" entry of the "in caption" section, which defers to the "in body"
section, which then defers to the "in head" section.
The numbers are approximations. When the parser specification requires
that a token be "reprocessed", it was usually counted as a second
invocation, but not always, due to some optimisations in the
parser. Similarly, fake start and end tags generated for parts of the
specification that require a user agent to "act as if" a token had
been seen count as separate invocations, except in certain cases where
the parser had those operations optimised away.
The sample data consisted of several billion documents from Google's
index. An attempt was made to eliminate duplicates from the
sample. Documents that did not appear, a priori, to be HTML documents
(e.g. PDF documents, Flash animations, Word documents) were not
included in the sample. Documents which, when parsed, resulted in DOMs
with more than 65535 elements were only parsed up to the creation of
the 65535th element.
The parser was run with scripting enabled but no scripts supported, so