Configuring Apache Tika’s HtmlParser

So in my previous post about Apache Tika, I showed off a small Hello World program that demonstrated how you can quickly use it to parse HTML files. One of the first issues you will probably encounter using Tika though is that its HtmlParser does not immediately handle all tags. For example, the code tag is not recognized. To deal with that, you need to create a custom HtmlMapper. In the code example below, I created an HtmlMapper that accepts all tags. In addition to expanding the number of tags that the HtmlParser can handle, custom HtmlMappers are great for isolate specific blocks that you are interested in by discarding ones that you do not care about.

import java.io.InputStream;
import java.net.URL;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlMapper;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.ToHTMLContentHandler;

public class HelloApacheTika2 {
    public static void main (String args[]) throws Exception {
        URL url = new URL("http://chrisjordan.ca/post/15345467825/configuring-apache-tikas-htmlparser");
        InputStream input = url.openStream();

        ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());
        HtmlParser parser = new HtmlParser();

        parser.parse(input, toHTMLHandler, metadata, parseContext);
        System.out.println(toHTMLHandler.toString());
    }
}

/**
 * A HtmlMapper that accepts all tags and tributes. 
 *
 */
class AllTagMapper implements HtmlMapper {
 
    @Override
    public String mapSafeElement(String name) {
        return name.toLowerCase();
    }

    @Override
    public boolean isDiscardElement(String name) {
        return false;
    }

    @Override
    public String mapSafeAttribute(String elementName, String attributeName) {
        return attributeName.toLowerCase();
    }

}

  1. verycrispy posted this
Short URL for this post: http://tmblr.co/Zd8FQxEIgK6n