public class Lexer
extends java.lang.Object
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections
Modifier and Type | Class and Description |
---|---|
private static class |
Lexer.W3CVersionInfo
document type.
|
Modifier and Type | Field and Description |
---|---|
protected short |
badAccess
for accessibility errors.
|
protected short |
badChars
for bad char encodings.
|
protected boolean |
badDoctype
set if html or PUBLIC is missing.
|
protected short |
badForm
for mismatched/mispositioned form tags.
|
protected short |
badLayout
for bad style errors.
|
private static int |
CDATA_ENDTAG |
private static int |
CDATA_INTERMEDIATE |
private static int |
CDATA_STARTTAG |
protected int |
columns
at start of current token.
|
protected Configuration |
configuration
configuration.
|
protected int |
doctype
version as given by doctype (if any).
|
protected short |
errors
count of errors.
|
protected java.io.PrintWriter |
errout
error output stream.
|
protected boolean |
excludeBlocks
Netscape compatibility.
|
protected boolean |
exiled
true if moved out of table.
|
static short |
IGNORE_MARKUP
state: ignore markup.
|
static short |
IGNORE_WHITESPACE
state: ignore whitespace.
|
protected StreamIn |
in
file stream.
|
protected Node |
inode
Inline stack for compatibility with Mosaic.
|
protected int |
insert
for inferring inline tags.
|
protected boolean |
insertspace
when space is moved after end tag.
|
protected java.util.Stack |
istack
stack.
|
protected int |
istackbase
start of frame.
|
protected boolean |
isvoyager
true if xmlns attribute on html element.
|
private static short |
LEX_ASP
getToken state: asp.
|
private static short |
LEX_CDATA
getToken state: cdata.
|
private static short |
LEX_COMMENT
getToken state: comment.
|
private static short |
LEX_CONTENT
getToken state: content.
|
private static short |
LEX_DOCTYPE
getToken state: doctype.
|
private static short |
LEX_ENDTAG
getToken state: endtag.
|
private static short |
LEX_GT
getToken state: gt.
|
private static short |
LEX_JSTE
getToken state: jste.
|
private static short |
LEX_PHP
getToken state: php.
|
private static short |
LEX_PROCINSTR
getToken state: procinstr.
|
private static short |
LEX_SECTION
getToken state: section.
|
private static short |
LEX_STARTTAG
getToken state: start tag.
|
private static short |
LEX_XMLDECL
getToken state: xml declaration.
|
protected byte[] |
lexbuf
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of
all of the elements.
|
protected int |
lexlength
allocated.
|
protected int |
lexsize
used.
|
protected int |
lines
lines seen.
|
static short |
MIXED_CONTENT
state: mixed content.
|
private java.util.List |
nodeList
node list.
|
static short |
PREFORMATTED
state: preformatted.
|
protected boolean |
pushed
true after token has been pushed back.
|
protected Report |
report
report.
|
protected Node |
root
Root node is saved here.
|
protected boolean |
seenEndBody
already seen end body tag?
|
protected boolean |
seenEndHtml
already seen end html tag?
|
protected short |
state
state of lexer's finite state machine.
|
protected Style |
styles
used for cleaning up presentation markup.
|
protected Node |
token
current node.
|
protected int |
txtend
end of current node.
|
protected int |
txtstart
start of current node.
|
protected short |
versions
bit vector of HTML versions.
|
private static java.lang.String |
VOYAGER_11
URI for XHTML 1.1.
|
private static java.lang.String |
VOYAGER_FRAMESET
URI for XHTML 1.0 frameset DTD.
|
private static java.lang.String |
VOYAGER_LOOSE
URI for XHTML 1.0 transitional DTD.
|
private static java.lang.String |
VOYAGER_STRICT
URI for XHTML 1.0 strict DTD.
|
private static Lexer.W3CVersionInfo[] |
W3CVERSION
lists all the known versions.
|
protected short |
warnings
count of warnings in this document.
|
protected boolean |
waswhite
used to collapse contiguous white space.
|
private static java.lang.String |
XHTML_NAMESPACE
xhtml namespace.
|
Constructor and Description |
---|
Lexer(StreamIn in,
Configuration configuration,
Report report)
Instantiates a new Lexer.
|
Modifier and Type | Method and Description |
---|---|
void |
addByte(int c)
Adds a byte to lexer buffer.
|
void |
addCharToLexer(int c)
Store char c as UTF-8 encoded byte stream.
|
boolean |
addGenerator(Node root)
Add meta element for Tidy.
|
void |
addStringLiteral(java.lang.String str)
calls addCharToLexer for any char in the string.
|
(package private) void |
addStringLiteralLen(java.lang.String str,
int len)
calls addCharToLexer for any char in the string till len is reached.
|
void |
addStringToLexer(java.lang.String str)
Adds a string to lexer buffer.
|
short |
apparentVersion()
Return the html version used in document.
|
boolean |
canPrune(Node element)
Can the given element be removed?
|
void |
changeChar(byte c)
Substitute the last char in buffer.
|
boolean |
checkDocTypeKeyWords(Node doctype)
Check system keywords (keywords should be uppercase).
|
AttVal |
cloneAttributes(AttVal attrs)
Clones an attribute value and add eventual asp or php node to node list.
|
Node |
cloneNode(Node node)
Clones a node and add it to node list.
|
(package private) void |
constrainVersion(int vers)
Constraint the html version in the document to the given one.
|
void |
deferDup()
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
|
boolean |
endOfInput()
Has end of input stream been reached?
|
short |
findGivenVersion(Node doctype)
Examine DOCTYPE to identify version.
|
boolean |
fixDocType(Node root)
Fixup doctype if missing.
|
void |
fixHTMLNameSpace(Node root,
java.lang.String profile)
Fix xhtml namespace.
|
void |
fixId(Node node)
duplicate name attribute as an id and check if id and name match.
|
boolean |
fixXmlDecl(Node root)
Ensure XML document starts with
<?XML version="1.0"?> . |
Node |
getCDATA(Node container)
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some
foo.
|
Node |
getToken(short mode)
Gets a token.
|
short |
htmlVersion()
Choose what version to use for new doctype.
|
java.lang.String |
htmlVersionName()
Choose what version to use for new doctype.
|
Node |
inferredTag(java.lang.String name)
Generates and inserts a new node.
|
int |
inlineDup(Node node)
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P,
TD, TH, DIV, PRE etc.
|
Node |
insertedToken() |
static boolean |
isCSS1Selector(java.lang.String buf)
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they
cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a
numeric code (see next item).
|
boolean |
isPushed(Node node)
Is the node in the stack?
|
static boolean |
isValidAttrName(java.lang.String attr)
Check if attr is a valid name.
|
Node |
newLineNode()
Adds a new line node.
|
Node |
newNode()
Creates a new node and add it to nodelist.
|
Node |
newNode(short type,
byte[] textarray,
int start,
int end)
Creates a new node and add it to nodelist.
|
Node |
newNode(short type,
byte[] textarray,
int start,
int end,
java.lang.String element)
Creates a new node and add it to nodelist.
|
(package private) Node |
newXhtmlDocTypeNode(Node root)
Put DOCTYPE declaration between the <:?xml version "1.0" ...
|
Node |
parseAsp()
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to
dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to
tailor the attribute value.
|
java.lang.String |
parseAttribute(boolean[] isempty,
Node[] asp,
Node[] php)
consumes the '>' terminating start tags.
|
AttVal |
parseAttrs(boolean[] isempty)
Parse tag attributes.
|
void |
parseEntity(short mode)
Parse an html entity.
|
Node |
parsePhp()
PHP is like ASP but is based upon XML processing instructions, e.g.
|
int |
parseServerInstruction()
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this
routine recognizes ' and " quoted strings.
|
char |
parseTagName()
Parses a tag name.
|
java.lang.String |
parseValue(java.lang.String name,
boolean foldCase,
boolean[] isempty,
int[] pdelim)
Parse an attribute value.
|
void |
popInline(Node node)
Pop a copy of an inline node from the stack.
|
protected boolean |
preContent(Node node)
Is content acceptable for pre elements?
|
void |
pushInline(Node node)
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones
generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
|
boolean |
setXHTMLDocType(Node root)
Adds a new xhtml doctype to the document.
|
void |
ungetToken() |
protected void |
updateNodeTextArrays(byte[] oldtextarray,
byte[] newtextarray)
Update
oldtextarray in the current nodes. |
public static final short IGNORE_WHITESPACE
public static final short MIXED_CONTENT
public static final short PREFORMATTED
public static final short IGNORE_MARKUP
private static final java.lang.String VOYAGER_LOOSE
private static final java.lang.String VOYAGER_STRICT
private static final java.lang.String VOYAGER_FRAMESET
private static final java.lang.String VOYAGER_11
private static final java.lang.String XHTML_NAMESPACE
private static final Lexer.W3CVersionInfo[] W3CVERSION
private static final short LEX_CONTENT
private static final short LEX_GT
private static final short LEX_ENDTAG
private static final short LEX_STARTTAG
private static final short LEX_COMMENT
private static final short LEX_DOCTYPE
private static final short LEX_PROCINSTR
private static final short LEX_CDATA
private static final short LEX_SECTION
private static final short LEX_ASP
private static final short LEX_JSTE
private static final short LEX_PHP
private static final short LEX_XMLDECL
protected StreamIn in
protected java.io.PrintWriter errout
protected short badAccess
protected short badLayout
protected short badChars
protected short badForm
protected short warnings
protected short errors
protected int lines
protected int columns
protected boolean waswhite
protected boolean pushed
protected boolean insertspace
protected boolean excludeBlocks
protected boolean exiled
protected boolean isvoyager
protected short versions
protected int doctype
protected boolean badDoctype
protected int txtstart
protected int txtend
protected short state
protected Node token
protected byte[] lexbuf
protected int lexlength
protected int lexsize
protected Node inode
protected int insert
protected java.util.Stack istack
protected int istackbase
protected Style styles
protected Configuration configuration
protected boolean seenEndBody
protected boolean seenEndHtml
protected Report report
protected Node root
private java.util.List nodeList
private static final int CDATA_INTERMEDIATE
private static final int CDATA_STARTTAG
private static final int CDATA_ENDTAG
public Lexer(StreamIn in, Configuration configuration, Report report)
in
- StreamInconfiguration
- configuation instancereport
- report instance, for reporting errorspublic Node newNode()
public Node newNode(short type, byte[] textarray, int start, int end)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionpublic Node newNode(short type, byte[] textarray, int start, int end, java.lang.String element)
type
- node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray
- array of bytes contained in the Nodestart
- start positionend
- end positionelement
- tag namepublic Node cloneNode(Node node)
node
- Nodepublic AttVal cloneAttributes(AttVal attrs)
attrs
- original AttValprotected void updateNodeTextArrays(byte[] oldtextarray, byte[] newtextarray)
oldtextarray
in the current nodes.oldtextarray
- previous text arraynewtextarray
- new text arraypublic Node newLineNode()
public boolean endOfInput()
true
if end of input stream been reachedpublic void addByte(int c)
c
- byte to addpublic void changeChar(byte c)
c
- new charpublic void addCharToLexer(int c)
c
- char to storepublic void addStringToLexer(java.lang.String str)
str
- String to addpublic void parseEntity(short mode)
mode
- modepublic char parseTagName()
public void addStringLiteral(java.lang.String str)
str
- input Stringvoid addStringLiteralLen(java.lang.String str, int len)
str
- input Stringlen
- length of the substring to be addedpublic short htmlVersion()
public java.lang.String htmlVersionName()
public boolean addGenerator(Node root)
root
- root nodetrue
if the tag has been addedpublic boolean checkDocTypeKeyWords(Node doctype)
doctype
- doctype nodepublic short findGivenVersion(Node doctype)
doctype
- doctype nodepublic void fixHTMLNameSpace(Node root, java.lang.String profile)
root
- root Nodeprofile
- current profileNode newXhtmlDocTypeNode(Node root)
html
tag. Should also work for any comments, etc. that may precede the html
tag.root
- root nodepublic boolean setXHTMLDocType(Node root)
root
- root nodetrue
if a doctype has been addedpublic short apparentVersion()
public boolean fixDocType(Node root)
root
- root nodefalse
if current version has not been identifiedpublic boolean fixXmlDecl(Node root)
<?XML version="1.0"?>
. Add encoding attribute if not using
ASCII or UTF-8 output.root
- root nodepublic Node inferredTag(java.lang.String name)
name
- tag namepublic Node getCDATA(Node container)
container
- container nodepublic void ungetToken()
public Node getToken(short mode)
mode
- one of the following:
MixedContent
-- for elements which don't accept PCDATAPreformatted
-- white spacepreserved as isIgnoreMarkup
-- for CDATA elements such as script, stylepublic Node parseAsp()
href='<%=rsSchool.Fields("ID").Value%>'
where the ASP that generates the attribute value is
masked from Tidy by the quotemarks.public Node parsePhp()
<?php ... ?>
.public java.lang.String parseAttribute(boolean[] isempty, Node[] asp, Node[] php)
isempty
- flag is passed as array so it can be modifiedasp
- asp Node, passed as array so it can be modifiedphp
- php Node, passed as array so it can be modifiedpublic int parseServerInstruction()
public java.lang.String parseValue(java.lang.String name, boolean foldCase, boolean[] isempty, int[] pdelim)
name
- attribute namefoldCase
- fold case?isempty
- is attribute empty? Passed as an array reference to allow modificationpdelim
- delimiter, passed as an array reference to allow modificationpublic static boolean isValidAttrName(java.lang.String attr)
attr
- String to check, must be non-nulltrue
if attr is a valid name.public static boolean isCSS1Selector(java.lang.String buf)
buf
- css selector nametrue
if the given string is a valid css1 selector namepublic AttVal parseAttrs(boolean[] isempty)
isempty
- is tag empty?public void pushInline(Node node)
<p><em> text <p><em> more text
Shouldn't be mapped to
<p><em> text </em></p><p><em><em> more text </em></em>
node
- Node to be pushedpublic void popInline(Node node)
node
- Node to be poppedpublic boolean isPushed(Node node)
node
- Nodetrue
is the node is found in the stackpublic int inlineDup(Node node)
<i><h1>italic heading</h1></i>
which is then treated as
equivalent to <h1><i>italic heading</i></h1>
This is implemented by setting the lexer
into a mode where it gets tokens from the inline stack rather than from the input stream.node
- original nodepublic Node insertedToken()
public boolean canPrune(Node element)
element
- nodetrue
if he element can be removedpublic void fixId(Node node)
node
- Node to check for name/it attributespublic void deferDup()
void constrainVersion(int vers)
vers
- html version codeprotected boolean preContent(Node node)
node
- contenttrue
if node is acceptable in pre elements