Class Clean
- java.lang.Object
-
- org.w3c.tidy.Clean
-
public class Clean extends java.lang.Object
Clean up misuse of presentation markup. Filters from other formats such as Microsoft Word often make excessive use of presentation markup such as font tags, B, I, and the align attribute. By applying a set of production rules, it is straight forward to transform this to use CSS. Some rules replace some of the children of an element by style properties on the element, e.g....
....
Such rules are applied to the element's content and then to the element itself until none of the rules more apply. Having applied all the rules to an element, it will have a style attribute with one or more properties. Other rules strip the element they apply to, replacing it by style properties on the contents, e.g....
... These rules are applied to an element before processing its content and replace the current element by the first element in the exposed content. After applying both sets of rules, you can replace the style attribute by a class value and style rule in the document head. To support this, an association of styles and class names is built. A naive approach is to rely on string matching to test when two property lists are the same. A better approach would be to first sort the properties before matching.
- Version:
- $Revision: 1125 $ ($Author: aditsu $)
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
addAlign(Node node, java.lang.String align)
Adds an align style.private void
addColorRule(Lexer lexer, java.lang.String selector, java.lang.String color)
Adds a css rule for color.private void
addFontColor(Node node, java.lang.String color)
Adds a font color style.private void
addFontFace(Node node, java.lang.String face)
Adds a font-family style.private void
addFontSize(Node node, java.lang.String size)
Adds a font size style.private void
addFontStyles(Node node, AttVal av)
Add style properties to node corresponding to the font face, size and color attributes.private java.lang.String
addProperty(java.lang.String style, java.lang.String property)
Creates a string with merged properties.private void
addStyleProperty(Node node, java.lang.String property)
Add style property to element, creating style attribute as needed and adding ; delimiter.private boolean
blockStyle(Lexer lexer, Node node)
Symptom: the only child of a block-level element is a presentation element such as B, I or FONT.void
bQ2Div(Node node)
Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.(package private) static void
bumpObject(Lexer lexer, Node html)
Where appropriate move object elements from head to body.private boolean
center2Div(Lexer lexer, Node node, Node[] pnode)
Symptom:private void
cleanBodyAttrs(Lexer lexer, Node body)
Move presentation attribs from body to style element.private Node
cleanNode(Lexer lexer, Node node)
Applies all matching rules to a node.void
cleanTree(Lexer lexer, Node doc)
Clean an html tree.void
cleanWord2000(Lexer lexer, Node node)
This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000.private StyleProp
createProps(StyleProp prop, java.lang.String style)
Create sorted linked list of properties from style string.private java.lang.String
createPropString(StyleProp props)
Create a css property.private void
createStyleElement(Lexer lexer, Node doc)
Create style element using rules from dictionary.private Node
createStyleProperties(Lexer lexer, Node node, Node[] prepl)
Special case: if the current node is destroyed by CleanNode() lower in the tree, this node and its parent no longer exist.private void
defineStyleRules(Lexer lexer, Node node)
Find style attribute in node content, and replace it by corresponding class attribute.private boolean
dir2Div(Lexer lexer, Node node)
Symptom:<dir><li>
where<li>
is only child.private void
discardContainer(Node element, Node[] pnode)
Used to strip font start and end tags.void
dropSections(Lexer lexer, Node node)
Drop if/endif sections inserted by word2000.void
emFromI(Node node)
Replace i by em and b by strong.(package private) Node
findEnclosingCell(Node node)
Find the enclosing table cell for the given node.private java.lang.String
findStyle(Lexer lexer, java.lang.String tag, java.lang.String properties)
Finds a css style.private void
fixNodeLinks(Node node)
Ensure bidirectional links are consistent.private boolean
font2Span(Lexer lexer, Node node, Node[] pnode)
Replace font elements by span elements, deleting the font element's attributes and replacing them by a single style attribute.private java.lang.String
fontSize2Name(java.lang.String size)
Map a % font size to a named font size.private java.lang.String
gensymClass(Lexer lexer)
Generates a new css class name.private boolean
inlineStyle(Lexer lexer, Node node, Node[] pnode)
If the node has only one b, i, or font child remove the child node and add the appropriate style attributes to parent.private StyleProp
insertProperty(StyleProp props, java.lang.String name, java.lang.String value)
Insert a css style property.boolean
isWord2000(Node root)
Check if the current document is a converted Word document.void
list2BQ(Node node)
Some people use dir or ul without an li to indent the content.private void
mergeClasses(Node node, Node child)
Merge class attributes from 2 nodes.private boolean
mergeDivs(Lexer lexer, Node node)
Symptom:<div><div>...</div></div>
Action: merge the two divs.private java.lang.String
mergeProperties(java.lang.String s1, java.lang.String s2)
Create new string that consists of the combined style properties in s1 and s2.private void
mergeStyles(Node node, Node child)
Merge style from 2 nodes.void
nestedEmphasis(Node node)
simplifies ...private boolean
nestedList(Lexer lexer, Node node, Node[] pnode)
Symptom: ...private boolean
niceBody(Lexer lexer, Node doc)
Check deprecated attributes in body tag.(package private) boolean
noMargins(Node node)
Used to hunt for hidden preformatted sections.private void
normalizeSpaces(Lexer lexer, Node node)
Map non-breaking spaces to regular spaces.Node
pruneSection(Lexer lexer, Node node)
node is<![if ...]>
prune up to<![endif]>
.void
purgeWord2000Attributes(Node node)
Remove word2000 attributes from node.(package private) boolean
singleSpace(Lexer lexer, Node node)
Does element have a single space as its content?private void
stripOnlyChild(Node node)
Used to strip child of node when the node has one and only one child.Node
stripSpan(Lexer lexer, Node span)
Word2000 uses span excessively, so we strip span out.private void
style2Rule(Lexer lexer, Node node)
Find style attribute in node, and replace it by corresponding class attribute.private void
tableBgColor(Node node)
private void
textAlign(Lexer lexer, Node node)
Symptom:<p align=center>
.
-
-
-
Field Detail
-
classNum
private int classNum
sequential number for generated css classes.
-
tt
private TagTable tt
Tag table.
-
-
Constructor Detail
-
Clean
public Clean(TagTable tagTable)
Instantiates a new Clean.- Parameters:
tagTable
- tag table instance
-
-
Method Detail
-
insertProperty
private StyleProp insertProperty(StyleProp props, java.lang.String name, java.lang.String value)
Insert a css style property.- Parameters:
props
- StyleProp instancename
- property namevalue
- property value- Returns:
- StyleProp containin the given property
-
createProps
private StyleProp createProps(StyleProp prop, java.lang.String style)
Create sorted linked list of properties from style string.- Parameters:
prop
- StylePropstyle
- style string- Returns:
- StyleProp with given style
-
createPropString
private java.lang.String createPropString(StyleProp props)
Create a css property.- Parameters:
props
- StyleProp- Returns:
- css property as String
-
addProperty
private java.lang.String addProperty(java.lang.String style, java.lang.String property)
Creates a string with merged properties.- Parameters:
style
- css styleproperty
- css properties- Returns:
- merged string
-
gensymClass
private java.lang.String gensymClass(Lexer lexer)
Generates a new css class name.- Parameters:
lexer
- Lexer- Returns:
- generated css class
-
findStyle
private java.lang.String findStyle(Lexer lexer, java.lang.String tag, java.lang.String properties)
Finds a css style.- Parameters:
lexer
- Lexertag
- tag nameproperties
- css properties- Returns:
- style string
-
style2Rule
private void style2Rule(Lexer lexer, Node node)
Find style attribute in node, and replace it by corresponding class attribute. Search for class in style dictionary otherwise gensym new class and add to dictionary. Assumes that node doesn't have a class attribute.- Parameters:
lexer
- Lexernode
- node with a style attribute
-
addColorRule
private void addColorRule(Lexer lexer, java.lang.String selector, java.lang.String color)
Adds a css rule for color.- Parameters:
lexer
- Lexerselector
- css selectorcolor
- color value
-
cleanBodyAttrs
private void cleanBodyAttrs(Lexer lexer, Node body)
Move presentation attribs from body to style element.background="foo" . body { background-image: url(foo) } bgcolor="foo" . body { background-color: foo } text="foo" . body { color: foo } link="foo" . :link { color: foo } vlink="foo" . :visited { color: foo } alink="foo" . :active { color: foo }
- Parameters:
lexer
- Lexerbody
- body node
-
niceBody
private boolean niceBody(Lexer lexer, Node doc)
Check deprecated attributes in body tag.- Parameters:
lexer
- Lexerdoc
- document root node- Returns:
true
is the body doesn't contain deprecated attributes, false otherwise.
-
createStyleElement
private void createStyleElement(Lexer lexer, Node doc)
Create style element using rules from dictionary.- Parameters:
lexer
- Lexerdoc
- root node
-
fixNodeLinks
private void fixNodeLinks(Node node)
Ensure bidirectional links are consistent.- Parameters:
node
- root node
-
stripOnlyChild
private void stripOnlyChild(Node node)
Used to strip child of node when the node has one and only one child.- Parameters:
node
- parent node
-
discardContainer
private void discardContainer(Node element, Node[] pnode)
Used to strip font start and end tags.- Parameters:
element
- original nodepnode
- passed in as array to allow modification. pnode[0] will contain the final node
-
addStyleProperty
private void addStyleProperty(Node node, java.lang.String property)
Add style property to element, creating style attribute as needed and adding ; delimiter.- Parameters:
node
- nodeproperty
- property added to node
-
mergeProperties
private java.lang.String mergeProperties(java.lang.String s1, java.lang.String s2)
Create new string that consists of the combined style properties in s1 and s2. To merge property lists, we build a linked list of property/values and insert properties into the list in order, merging values for the same property name.- Parameters:
s1
- first propertys2
- second property- Returns:
- merged properties
-
mergeClasses
private void mergeClasses(Node node, Node child)
Merge class attributes from 2 nodes.- Parameters:
node
- Nodechild
- Child node
-
mergeStyles
private void mergeStyles(Node node, Node child)
Merge style from 2 nodes.- Parameters:
node
- Nodechild
- Child node
-
fontSize2Name
private java.lang.String fontSize2Name(java.lang.String size)
Map a % font size to a named font size.- Parameters:
size
- size in %- Returns:
- font size name
-
addFontFace
private void addFontFace(Node node, java.lang.String face)
Adds a font-family style.- Parameters:
node
- Nodeface
- font face
-
addFontSize
private void addFontSize(Node node, java.lang.String size)
Adds a font size style.- Parameters:
node
- Nodesize
- font size
-
addFontColor
private void addFontColor(Node node, java.lang.String color)
Adds a font color style.- Parameters:
node
- Nodecolor
- color value
-
addAlign
private void addAlign(Node node, java.lang.String align)
Adds an align style.- Parameters:
node
- Nodealign
- align value
-
addFontStyles
private void addFontStyles(Node node, AttVal av)
Add style properties to node corresponding to the font face, size and color attributes.- Parameters:
node
- font tagav
- attribute list for node
-
textAlign
private void textAlign(Lexer lexer, Node node)
Symptom:<p align=center>
. Action:<p style="text-align: center">
.- Parameters:
lexer
- Lexernode
- node with center attribute. Will be modified to use css style.
-
tableBgColor
private void tableBgColor(Node node)
-
dir2Div
private boolean dir2Div(Lexer lexer, Node node)
Symptom:<dir><li>
where<li>
is only child. Action: coerce<dir> <li>
to<div>
with indent. The clean up rules use the pnode argument to return the next node when the original node has been deleted.- Parameters:
lexer
- Lexernode
- dir tag- Returns:
true
if a dir tag has been coerced to a div
-
center2Div
private boolean center2Div(Lexer lexer, Node node, Node[] pnode)
Symptom:<center>
.Action: replace
<center>
by<div style="text-align: center">
- Parameters:
lexer
- Lexernode
- center tagpnode
- pnode[0] is the same as node, passed in as an array to allow modification- Returns:
true
if a center tag has been replaced by a div
-
mergeDivs
private boolean mergeDivs(Lexer lexer, Node node)
Symptom:<div><div>...</div></div>
Action: merge the two divs. This is useful after nested <dir>s used by Word for indenting have been converted to <div>s.- Parameters:
lexer
- Lexernode
- first div- Returns:
- true if the divs have been merged
-
nestedList
private boolean nestedList(Lexer lexer, Node node, Node[] pnode)
Symptom:-
-
...
- Parameters:
lexer
- Lexernode
- Nodepnode
- passed in as array to allow modifications.- Returns:
true
if nested lists have been found and replaced
-
-
blockStyle
private boolean blockStyle(Lexer lexer, Node node)
Symptom: the only child of a block-level element is a presentation element such as B, I or FONT. Action: add style "font-weight: bold" to the block and strip the <b>element, leaving its children. example:<p> <b><font face="Arial" size="6">Draft Recommended Practice</font></b> </p>
becomes:<p style="font-weight: bold; font-family: Arial; font-size: 6"> Draft Recommended Practice </p>
This code also replaces the align attribute by a style attribute. However, to avoid CSS problems with Navigator 4, this isn't done for the elements: caption, tr and table
- Parameters:
lexer
- Lexernode
- parent node- Returns:
true
if the child node has been removed
-
inlineStyle
private boolean inlineStyle(Lexer lexer, Node node, Node[] pnode)
If the node has only one b, i, or font child remove the child node and add the appropriate style attributes to parent.- Parameters:
lexer
- Lexernode
- parent nodepnode
- passed as an array to allow modifications- Returns:
true
if child node has been stripped, replaced by style attributes.
-
font2Span
private boolean font2Span(Lexer lexer, Node node, Node[] pnode)
Replace font elements by span elements, deleting the font element's attributes and replacing them by a single style attribute.- Parameters:
lexer
- Lexernode
- font tagpnode
- passed as an array to allow modifications- Returns:
true
if a font tag has been dropped and replaced by style attributes
-
cleanNode
private Node cleanNode(Lexer lexer, Node node)
Applies all matching rules to a node.- Parameters:
lexer
- Lexernode
- original node- Returns:
- cleaned up node
-
createStyleProperties
private Node createStyleProperties(Lexer lexer, Node node, Node[] prepl)
Special case: if the current node is destroyed by CleanNode() lower in the tree, this node and its parent no longer exist. So we must jump back up the CreateStyleProperties() call stack until we have a valid node reference.- Parameters:
lexer
- Lexernode
- Nodeprepl
- passed in as array to allow modifications- Returns:
- cleaned Node
-
defineStyleRules
private void defineStyleRules(Lexer lexer, Node node)
Find style attribute in node content, and replace it by corresponding class attribute.- Parameters:
lexer
- Lexernode
- parent node
-
cleanTree
public void cleanTree(Lexer lexer, Node doc)
Clean an html tree.- Parameters:
lexer
- Lexerdoc
- root node
-
nestedEmphasis
public void nestedEmphasis(Node node)
simplifies ... ... etc.- Parameters:
node
- root Node
-
emFromI
public void emFromI(Node node)
Replace i by em and b by strong.- Parameters:
node
- root Node
-
list2BQ
public void list2BQ(Node node)
Some people use dir or ul without an li to indent the content. The pattern to look for is a list with a single implicit li. This is recursively replaced by an implicit blockquote.- Parameters:
node
- root Node
-
bQ2Div
public void bQ2Div(Node node)
Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.- Parameters:
node
- root Node
-
findEnclosingCell
Node findEnclosingCell(Node node)
Find the enclosing table cell for the given node.- Parameters:
node
- Node- Returns:
- enclosing cell node
-
pruneSection
public Node pruneSection(Lexer lexer, Node node)
node is<![if ...]>
prune up to<![endif]>
.- Parameters:
lexer
- Lexernode
- Node- Returns:
- cleaned up Node
-
dropSections
public void dropSections(Lexer lexer, Node node)
Drop if/endif sections inserted by word2000.- Parameters:
lexer
- Lexernode
- Node root node
-
purgeWord2000Attributes
public void purgeWord2000Attributes(Node node)
Remove word2000 attributes from node.- Parameters:
node
- node to cleanup
-
stripSpan
public Node stripSpan(Lexer lexer, Node span)
Word2000 uses span excessively, so we strip span out.- Parameters:
lexer
- Lexerspan
- Node span- Returns:
- cleaned node
-
normalizeSpaces
private void normalizeSpaces(Lexer lexer, Node node)
Map non-breaking spaces to regular spaces.- Parameters:
lexer
- Lexernode
- Node
-
noMargins
boolean noMargins(Node node)
Used to hunt for hidden preformatted sections.- Parameters:
node
- checked node- Returns:
true
if the node has a "margin-top: 0" or "margin-bottom: 0" style
-
singleSpace
boolean singleSpace(Lexer lexer, Node node)
Does element have a single space as its content?- Parameters:
lexer
- Lexernode
- checked node- Returns:
true
if the element has a single space as its content
-
cleanWord2000
public void cleanWord2000(Lexer lexer, Node node)
This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000. It doesn't yet know what to do with VML tags, but these will appear as errors unless you declare them as new tags, such as o:p which needs to be declared as inline.- Parameters:
lexer
- Lexernode
- node to clean up
-
isWord2000
public boolean isWord2000(Node root)
Check if the current document is a converted Word document.- Parameters:
root
- root Node- Returns:
true
if the document has been geenrated by Microsoft Word.
-
-