DOCUMENTATION tokens.csv represents the relative aggregate distribution of tokens as seen when parsing a nominal ten billion documents. The data was collected by Google in July 2007, using a parser based on r967 of the HTML5 specification. The file is a comma-delimited file with five columns. The first column is the phase and the second column is the insertion mode, if the phase was the main phase. The third column is the token. The token names are self-explanatory, except that successive characters are combined into one token. Parse errors are not treated as tokens and do not figure in this data. Start tags and end tags have the tag name in the fourth column; the tag name "other" is used for elements that are in the phrasing category. The last column is the relative frequency of that token. Tokens were only counted when they were emitted from the tokeniser; tokens that are "reprocessed", and fake tokens that are emitted when the tree construction code is required to "act as if" a particular token had been seen, were not counted. The parser was run with scripting enabled but no scripts supported, so