Package org.w3c.tidy

Class Clean


  • public class Clean
    extends java.lang.Object
    Clean up misuse of presentation markup. Filters from other formats such as Microsoft Word often make excessive use of presentation markup such as font tags, B, I, and the align attribute. By applying a set of production rules, it is straight forward to transform this to use CSS. Some rules replace some of the children of an element by style properties on the element, e.g.

    ...

    .

    ...

    Such rules are applied to the element's content and then to the element itself until none of the rules more apply. Having applied all the rules to an element, it will have a style attribute with one or more properties. Other rules strip the element they apply to, replacing it by style properties on the contents, e.g.
  • ...

  • .

    ... These rules are applied to an element before processing its content and replace the current element by the first element in the exposed content. After applying both sets of rules, you can replace the style attribute by a class value and style rule in the document head. To support this, an association of styles and class names is built. A naive approach is to rely on string matching to test when two property lists are the same. A better approach would be to first sort the properties before matching.

    Version:
    $Revision: 1125 $ ($Author: aditsu $)
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private int classNum
      sequential number for generated css classes.
      private TagTable tt
      Tag table.
    • Constructor Summary

      Constructors 
      Constructor Description
      Clean​(TagTable tagTable)
      Instantiates a new Clean.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void addAlign​(Node node, java.lang.String align)
      Adds an align style.
      private void addColorRule​(Lexer lexer, java.lang.String selector, java.lang.String color)
      Adds a css rule for color.
      private void addFontColor​(Node node, java.lang.String color)
      Adds a font color style.
      private void addFontFace​(Node node, java.lang.String face)
      Adds a font-family style.
      private void addFontSize​(Node node, java.lang.String size)
      Adds a font size style.
      private void addFontStyles​(Node node, AttVal av)
      Add style properties to node corresponding to the font face, size and color attributes.
      private java.lang.String addProperty​(java.lang.String style, java.lang.String property)
      Creates a string with merged properties.
      private void addStyleProperty​(Node node, java.lang.String property)
      Add style property to element, creating style attribute as needed and adding ; delimiter.
      private boolean blockStyle​(Lexer lexer, Node node)
      Symptom: the only child of a block-level element is a presentation element such as B, I or FONT.
      void bQ2Div​(Node node)
      Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.
      (package private) static void bumpObject​(Lexer lexer, Node html)
      Where appropriate move object elements from head to body.
      private boolean center2Div​(Lexer lexer, Node node, Node[] pnode)
      Symptom:
      private void cleanBodyAttrs​(Lexer lexer, Node body)
      Move presentation attribs from body to style element.
      private Node cleanNode​(Lexer lexer, Node node)
      Applies all matching rules to a node.
      void cleanTree​(Lexer lexer, Node doc)
      Clean an html tree.
      void cleanWord2000​(Lexer lexer, Node node)
      This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000.
      private StyleProp createProps​(StyleProp prop, java.lang.String style)
      Create sorted linked list of properties from style string.
      private java.lang.String createPropString​(StyleProp props)
      Create a css property.
      private void createStyleElement​(Lexer lexer, Node doc)
      Create style element using rules from dictionary.
      private Node createStyleProperties​(Lexer lexer, Node node, Node[] prepl)
      Special case: if the current node is destroyed by CleanNode() lower in the tree, this node and its parent no longer exist.
      private void defineStyleRules​(Lexer lexer, Node node)
      Find style attribute in node content, and replace it by corresponding class attribute.
      private boolean dir2Div​(Lexer lexer, Node node)
      Symptom: <dir><li> where <li> is only child.
      private void discardContainer​(Node element, Node[] pnode)
      Used to strip font start and end tags.
      void dropSections​(Lexer lexer, Node node)
      Drop if/endif sections inserted by word2000.
      void emFromI​(Node node)
      Replace i by em and b by strong.
      (package private) Node findEnclosingCell​(Node node)
      Find the enclosing table cell for the given node.
      private java.lang.String findStyle​(Lexer lexer, java.lang.String tag, java.lang.String properties)
      Finds a css style.
      private void fixNodeLinks​(Node node)
      Ensure bidirectional links are consistent.
      private boolean font2Span​(Lexer lexer, Node node, Node[] pnode)
      Replace font elements by span elements, deleting the font element's attributes and replacing them by a single style attribute.
      private java.lang.String fontSize2Name​(java.lang.String size)
      Map a % font size to a named font size.
      private java.lang.String gensymClass​(Lexer lexer)
      Generates a new css class name.
      private boolean inlineStyle​(Lexer lexer, Node node, Node[] pnode)
      If the node has only one b, i, or font child remove the child node and add the appropriate style attributes to parent.
      private StyleProp insertProperty​(StyleProp props, java.lang.String name, java.lang.String value)
      Insert a css style property.
      boolean isWord2000​(Node root)
      Check if the current document is a converted Word document.
      void list2BQ​(Node node)
      Some people use dir or ul without an li to indent the content.
      private void mergeClasses​(Node node, Node child)
      Merge class attributes from 2 nodes.
      private boolean mergeDivs​(Lexer lexer, Node node)
      Symptom: <div><div>...</div></div> Action: merge the two divs.
      private java.lang.String mergeProperties​(java.lang.String s1, java.lang.String s2)
      Create new string that consists of the combined style properties in s1 and s2.
      private void mergeStyles​(Node node, Node child)
      Merge style from 2 nodes.
      void nestedEmphasis​(Node node)
      simplifies ...
      private boolean nestedList​(Lexer lexer, Node node, Node[] pnode)
      Symptom: ...
      private boolean niceBody​(Lexer lexer, Node doc)
      Check deprecated attributes in body tag.
      (package private) boolean noMargins​(Node node)
      Used to hunt for hidden preformatted sections.
      private void normalizeSpaces​(Lexer lexer, Node node)
      Map non-breaking spaces to regular spaces.
      Node pruneSection​(Lexer lexer, Node node)
      node is <![if ...]> prune up to <![endif]>.
      void purgeWord2000Attributes​(Node node)
      Remove word2000 attributes from node.
      (package private) boolean singleSpace​(Lexer lexer, Node node)
      Does element have a single space as its content?
      private void stripOnlyChild​(Node node)
      Used to strip child of node when the node has one and only one child.
      Node stripSpan​(Lexer lexer, Node span)
      Word2000 uses span excessively, so we strip span out.
      private void style2Rule​(Lexer lexer, Node node)
      Find style attribute in node, and replace it by corresponding class attribute.
      private void tableBgColor​(Node node)  
      private void textAlign​(Lexer lexer, Node node)
      Symptom: <p align=center>.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • classNum

        private int classNum
        sequential number for generated css classes.
    • Constructor Detail

      • Clean

        public Clean​(TagTable tagTable)
        Instantiates a new Clean.
        Parameters:
        tagTable - tag table instance
    • Method Detail

      • insertProperty

        private StyleProp insertProperty​(StyleProp props,
                                         java.lang.String name,
                                         java.lang.String value)
        Insert a css style property.
        Parameters:
        props - StyleProp instance
        name - property name
        value - property value
        Returns:
        StyleProp containin the given property
      • createProps

        private StyleProp createProps​(StyleProp prop,
                                      java.lang.String style)
        Create sorted linked list of properties from style string.
        Parameters:
        prop - StyleProp
        style - style string
        Returns:
        StyleProp with given style
      • createPropString

        private java.lang.String createPropString​(StyleProp props)
        Create a css property.
        Parameters:
        props - StyleProp
        Returns:
        css property as String
      • addProperty

        private java.lang.String addProperty​(java.lang.String style,
                                             java.lang.String property)
        Creates a string with merged properties.
        Parameters:
        style - css style
        property - css properties
        Returns:
        merged string
      • gensymClass

        private java.lang.String gensymClass​(Lexer lexer)
        Generates a new css class name.
        Parameters:
        lexer - Lexer
        Returns:
        generated css class
      • findStyle

        private java.lang.String findStyle​(Lexer lexer,
                                           java.lang.String tag,
                                           java.lang.String properties)
        Finds a css style.
        Parameters:
        lexer - Lexer
        tag - tag name
        properties - css properties
        Returns:
        style string
      • style2Rule

        private void style2Rule​(Lexer lexer,
                                Node node)
        Find style attribute in node, and replace it by corresponding class attribute. Search for class in style dictionary otherwise gensym new class and add to dictionary. Assumes that node doesn't have a class attribute.
        Parameters:
        lexer - Lexer
        node - node with a style attribute
      • addColorRule

        private void addColorRule​(Lexer lexer,
                                  java.lang.String selector,
                                  java.lang.String color)
        Adds a css rule for color.
        Parameters:
        lexer - Lexer
        selector - css selector
        color - color value
      • cleanBodyAttrs

        private void cleanBodyAttrs​(Lexer lexer,
                                    Node body)
        Move presentation attribs from body to style element.
         background="foo" . body { background-image: url(foo) }
         bgcolor="foo" . body { background-color: foo }
         text="foo" . body { color: foo }
         link="foo" . :link { color: foo }
         vlink="foo" . :visited { color: foo }
         alink="foo" . :active { color: foo }
         
        Parameters:
        lexer - Lexer
        body - body node
      • niceBody

        private boolean niceBody​(Lexer lexer,
                                 Node doc)
        Check deprecated attributes in body tag.
        Parameters:
        lexer - Lexer
        doc - document root node
        Returns:
        true is the body doesn't contain deprecated attributes, false otherwise.
      • createStyleElement

        private void createStyleElement​(Lexer lexer,
                                        Node doc)
        Create style element using rules from dictionary.
        Parameters:
        lexer - Lexer
        doc - root node
      • fixNodeLinks

        private void fixNodeLinks​(Node node)
        Ensure bidirectional links are consistent.
        Parameters:
        node - root node
      • stripOnlyChild

        private void stripOnlyChild​(Node node)
        Used to strip child of node when the node has one and only one child.
        Parameters:
        node - parent node
      • discardContainer

        private void discardContainer​(Node element,
                                      Node[] pnode)
        Used to strip font start and end tags.
        Parameters:
        element - original node
        pnode - passed in as array to allow modification. pnode[0] will contain the final node
      • addStyleProperty

        private void addStyleProperty​(Node node,
                                      java.lang.String property)
        Add style property to element, creating style attribute as needed and adding ; delimiter.
        Parameters:
        node - node
        property - property added to node
      • mergeProperties

        private java.lang.String mergeProperties​(java.lang.String s1,
                                                 java.lang.String s2)
        Create new string that consists of the combined style properties in s1 and s2. To merge property lists, we build a linked list of property/values and insert properties into the list in order, merging values for the same property name.
        Parameters:
        s1 - first property
        s2 - second property
        Returns:
        merged properties
      • mergeClasses

        private void mergeClasses​(Node node,
                                  Node child)
        Merge class attributes from 2 nodes.
        Parameters:
        node - Node
        child - Child node
      • mergeStyles

        private void mergeStyles​(Node node,
                                 Node child)
        Merge style from 2 nodes.
        Parameters:
        node - Node
        child - Child node
      • fontSize2Name

        private java.lang.String fontSize2Name​(java.lang.String size)
        Map a % font size to a named font size.
        Parameters:
        size - size in %
        Returns:
        font size name
      • addFontFace

        private void addFontFace​(Node node,
                                 java.lang.String face)
        Adds a font-family style.
        Parameters:
        node - Node
        face - font face
      • addFontSize

        private void addFontSize​(Node node,
                                 java.lang.String size)
        Adds a font size style.
        Parameters:
        node - Node
        size - font size
      • addFontColor

        private void addFontColor​(Node node,
                                  java.lang.String color)
        Adds a font color style.
        Parameters:
        node - Node
        color - color value
      • addAlign

        private void addAlign​(Node node,
                              java.lang.String align)
        Adds an align style.
        Parameters:
        node - Node
        align - align value
      • addFontStyles

        private void addFontStyles​(Node node,
                                   AttVal av)
        Add style properties to node corresponding to the font face, size and color attributes.
        Parameters:
        node - font tag
        av - attribute list for node
      • textAlign

        private void textAlign​(Lexer lexer,
                               Node node)
        Symptom: <p align=center>. Action: <p style="text-align: center">.
        Parameters:
        lexer - Lexer
        node - node with center attribute. Will be modified to use css style.
      • tableBgColor

        private void tableBgColor​(Node node)
      • dir2Div

        private boolean dir2Div​(Lexer lexer,
                                Node node)
        Symptom: <dir><li> where <li> is only child. Action: coerce <dir> <li> to <div> with indent. The clean up rules use the pnode argument to return the next node when the original node has been deleted.
        Parameters:
        lexer - Lexer
        node - dir tag
        Returns:
        true if a dir tag has been coerced to a div
      • center2Div

        private boolean center2Div​(Lexer lexer,
                                   Node node,
                                   Node[] pnode)
        Symptom:
         <center>
         
        .

        Action: replace <center> by <div style="text-align: center">

        Parameters:
        lexer - Lexer
        node - center tag
        pnode - pnode[0] is the same as node, passed in as an array to allow modification
        Returns:
        true if a center tag has been replaced by a div
      • mergeDivs

        private boolean mergeDivs​(Lexer lexer,
                                  Node node)
        Symptom: <div><div>...</div></div> Action: merge the two divs. This is useful after nested <dir>s used by Word for indenting have been converted to <div>s.
        Parameters:
        lexer - Lexer
        node - first div
        Returns:
        true if the divs have been merged
      • nestedList

        private boolean nestedList​(Lexer lexer,
                                   Node node,
                                   Node[] pnode)
        Symptom:
          • ...
        Action: discard outer list.
        Parameters:
        lexer - Lexer
        node - Node
        pnode - passed in as array to allow modifications.
        Returns:
        true if nested lists have been found and replaced
      • blockStyle

        private boolean blockStyle​(Lexer lexer,
                                   Node node)
        Symptom: the only child of a block-level element is a presentation element such as B, I or FONT. Action: add style "font-weight: bold" to the block and strip the <b>element, leaving its children. example:
         <p>
         <b><font face="Arial" size="6">Draft Recommended Practice</font></b>
         </p>
         
        becomes:
         <p style="font-weight: bold; font-family: Arial; font-size: 6">
         Draft Recommended Practice
         </p>
         

        This code also replaces the align attribute by a style attribute. However, to avoid CSS problems with Navigator 4, this isn't done for the elements: caption, tr and table

        Parameters:
        lexer - Lexer
        node - parent node
        Returns:
        true if the child node has been removed
      • inlineStyle

        private boolean inlineStyle​(Lexer lexer,
                                    Node node,
                                    Node[] pnode)
        If the node has only one b, i, or font child remove the child node and add the appropriate style attributes to parent.
        Parameters:
        lexer - Lexer
        node - parent node
        pnode - passed as an array to allow modifications
        Returns:
        true if child node has been stripped, replaced by style attributes.
      • font2Span

        private boolean font2Span​(Lexer lexer,
                                  Node node,
                                  Node[] pnode)
        Replace font elements by span elements, deleting the font element's attributes and replacing them by a single style attribute.
        Parameters:
        lexer - Lexer
        node - font tag
        pnode - passed as an array to allow modifications
        Returns:
        true if a font tag has been dropped and replaced by style attributes
      • cleanNode

        private Node cleanNode​(Lexer lexer,
                               Node node)
        Applies all matching rules to a node.
        Parameters:
        lexer - Lexer
        node - original node
        Returns:
        cleaned up node
      • createStyleProperties

        private Node createStyleProperties​(Lexer lexer,
                                           Node node,
                                           Node[] prepl)
        Special case: if the current node is destroyed by CleanNode() lower in the tree, this node and its parent no longer exist. So we must jump back up the CreateStyleProperties() call stack until we have a valid node reference.
        Parameters:
        lexer - Lexer
        node - Node
        prepl - passed in as array to allow modifications
        Returns:
        cleaned Node
      • defineStyleRules

        private void defineStyleRules​(Lexer lexer,
                                      Node node)
        Find style attribute in node content, and replace it by corresponding class attribute.
        Parameters:
        lexer - Lexer
        node - parent node
      • cleanTree

        public void cleanTree​(Lexer lexer,
                              Node doc)
        Clean an html tree.
        Parameters:
        lexer - Lexer
        doc - root node
      • nestedEmphasis

        public void nestedEmphasis​(Node node)
        simplifies ... ... etc.
        Parameters:
        node - root Node
      • emFromI

        public void emFromI​(Node node)
        Replace i by em and b by strong.
        Parameters:
        node - root Node
      • list2BQ

        public void list2BQ​(Node node)
        Some people use dir or ul without an li to indent the content. The pattern to look for is a list with a single implicit li. This is recursively replaced by an implicit blockquote.
        Parameters:
        node - root Node
      • bQ2Div

        public void bQ2Div​(Node node)
        Replace implicit blockquote by div with an indent taking care to reduce nested blockquotes to a single div with the indent set to match the nesting depth.
        Parameters:
        node - root Node
      • findEnclosingCell

        Node findEnclosingCell​(Node node)
        Find the enclosing table cell for the given node.
        Parameters:
        node - Node
        Returns:
        enclosing cell node
      • pruneSection

        public Node pruneSection​(Lexer lexer,
                                 Node node)
        node is <![if ...]> prune up to <![endif]>.
        Parameters:
        lexer - Lexer
        node - Node
        Returns:
        cleaned up Node
      • dropSections

        public void dropSections​(Lexer lexer,
                                 Node node)
        Drop if/endif sections inserted by word2000.
        Parameters:
        lexer - Lexer
        node - Node root node
      • purgeWord2000Attributes

        public void purgeWord2000Attributes​(Node node)
        Remove word2000 attributes from node.
        Parameters:
        node - node to cleanup
      • stripSpan

        public Node stripSpan​(Lexer lexer,
                              Node span)
        Word2000 uses span excessively, so we strip span out.
        Parameters:
        lexer - Lexer
        span - Node span
        Returns:
        cleaned node
      • normalizeSpaces

        private void normalizeSpaces​(Lexer lexer,
                                     Node node)
        Map non-breaking spaces to regular spaces.
        Parameters:
        lexer - Lexer
        node - Node
      • noMargins

        boolean noMargins​(Node node)
        Used to hunt for hidden preformatted sections.
        Parameters:
        node - checked node
        Returns:
        true if the node has a "margin-top: 0" or "margin-bottom: 0" style
      • singleSpace

        boolean singleSpace​(Lexer lexer,
                            Node node)
        Does element have a single space as its content?
        Parameters:
        lexer - Lexer
        node - checked node
        Returns:
        true if the element has a single space as its content
      • cleanWord2000

        public void cleanWord2000​(Lexer lexer,
                                  Node node)
        This is a major clean up to strip out all the extra stuff you get when you save as web page from Word 2000. It doesn't yet know what to do with VML tags, but these will appear as errors unless you declare them as new tags, such as o:p which needs to be declared as inline.
        Parameters:
        lexer - Lexer
        node - node to clean up
      • isWord2000

        public boolean isWord2000​(Node root)
        Check if the current document is a converted Word document.
        Parameters:
        root - root Node
        Returns:
        true if the document has been geenrated by Microsoft Word.
      • bumpObject

        static void bumpObject​(Lexer lexer,
                               Node html)
        Where appropriate move object elements from head to body.
        Parameters:
        lexer - Lexer
        html - html node