Configuring Apache Tika’s HtmlParser

So in my previous post about Apache Tika, I showed off a small Hello World program that demonstrated how you can quickly use it to parse HTML files. One of the first issues you will probably encounter using Tika though is that its HtmlParser does not immediately handle all tags. For example, the code tag is not recognized. To deal with that, you need to create a custom HtmlMapper. In the code example below, I created an HtmlMapper that accepts all tags. In addition to expanding the number of tags that the HtmlParser can handle, custom HtmlMappers are great for isolate specific blocks that you are interested in by discarding ones that you do not care about.

import java.io.InputStream;
import java.net.URL;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlMapper;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.ToHTMLContentHandler;

public class HelloApacheTika2 {
    public static void main (String args[]) throws Exception {
        URL url = new URL("http://chrisjordan.ca/post/15345467825/configuring-apache-tikas-htmlparser");
        InputStream input = url.openStream();

        ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());
        HtmlParser parser = new HtmlParser();

        parser.parse(input, toHTMLHandler, metadata, parseContext);
        System.out.println(toHTMLHandler.toString());
    }
}

/**
 * A HtmlMapper that accepts all tags and tributes. 
 *
 */
class AllTagMapper implements HtmlMapper {
 
    @Override
    public String mapSafeElement(String name) {
        return name.toLowerCase();
    }

    @Override
    public boolean isDiscardElement(String name) {
        return false;
    }

    @Override
    public String mapSafeAttribute(String elementName, String attributeName) {
        return attributeName.toLowerCase();
    }

}

Parsing HTML with Apache Tika

Every now and then, I have to parse some HTML files. There are a lot of ways you can go about doing that. Recently, I have started using Apache Tika and it does a pretty reasonable job (i.e. better than what I have done before). There is not a lot of documentation on Tika so I had to do a bit of hacking to get my head around it.

A good start is this quick Hello World Tika program I put together. It parses this article. The TeeContentHandler passes data from the HtmlParser to ContentHandlers that it has been initialized with. For the purposes of this example, I am showing off three different handlers. The LinkContentHandler is great for extracting links; useful for crawlers. The ContentHandler strips out all the text on a page; useful for indexers. The ToHTMLContentHandler produces XHTML; useful for extracting specific blocks of text which is also good for indexers. One thing to be aware of when using the HtmlParser is that natively, it does not support all tags. For example, it currently skips over the code tag. My next post will explain how to configure the HtmlParser to not do that :-)

import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.LinkContentHandler;
import org.apache.tika.sax.TeeContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.ContentHandler;

public class HelloApacheTika {

    public static void main (String args[]) throws Exception {
        URL url = new URL("http://chrisjordan.ca/post/15219674437/parsing-html-with-apache-tika");
        InputStream input = url.openStream();
        LinkContentHandler linkHandler = new LinkContentHandler();
        ContentHandler textHandler = new BodyContentHandler();
        ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
        TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        HtmlParser parser = new HtmlParser();
        parser.parse(input, teeHandler, metadata, parseContext);
        System.out.println("title:\n" + metadata.get("title"));
        System.out.println("links:\n" + linkHandler.getLinks());
        System.out.println("text:\n" + textHandler.toString());
        System.out.println("html:\n" + toHTMLHandler.toString());
    }
}

If you are using Maven, you need to add the following dependencies:

<!-- Apache Tika -->
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.0</version>
</dependency>

<dependency>
     <groupId>org.apache.tika</groupId>
     <artifactId>tika-parsers</artifactId>
     <version>1.0</version>
</dependency>