Parsing HTML with Apache Tika
Every now and then, I have to parse some HTML files. There are a lot of ways you can go about doing that. Recently, I have started using Apache Tika and it does a pretty reasonable job (i.e. better than what I have done before). There is not a lot of documentation on Tika so I had to do a bit of hacking to get my head around it.
A good start is this quick Hello World Tika program I put together. It parses this article. The TeeContentHandler passes data from the HtmlParser to ContentHandlers that it has been initialized with. For the purposes of this example, I am showing off three different handlers. The LinkContentHandler is great for extracting links; useful for crawlers. The ContentHandler strips out all the text on a page; useful for indexers. The ToHTMLContentHandler produces XHTML; useful for extracting specific blocks of text which is also good for indexers. One thing to be aware of when using the HtmlParser is that natively, it does not support all tags. For example, it currently skips over the code tag. My next post will explain how to configure the HtmlParser to not do that :-)
import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.LinkContentHandler;
import org.apache.tika.sax.TeeContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.ContentHandler;
public class HelloApacheTika {
public static void main (String args[]) throws Exception {
URL url = new URL("http://chrisjordan.ca/post/15219674437/parsing-html-with-apache-tika");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
System.out.println("links:\n" + linkHandler.getLinks());
System.out.println("text:\n" + textHandler.toString());
System.out.println("html:\n" + toHTMLHandler.toString());
}
}
If you are using Maven, you need to add the following dependencies:
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.0</version>
</dependency>