Configuring Apache Tika’s HtmlParser
So in my previous post about Apache Tika, I showed off a small Hello World program that demonstrated how you can quickly use it to parse HTML files. One of the first issues you will probably encounter using Tika though is that its HtmlParser does not immediately handle all tags. For example, the code tag is not recognized. To deal with that, you need to create a custom HtmlMapper. In the code example below, I created an HtmlMapper that accepts all tags. In addition to expanding the number of tags that the HtmlParser can handle, custom HtmlMappers are great for isolate specific blocks that you are interested in by discarding ones that you do not care about.
import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlMapper;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.ToHTMLContentHandler;
public class HelloApacheTika2 {
public static void main (String args[]) throws Exception {
URL url = new URL("http://chrisjordan.ca/post/15345467825/configuring-apache-tikas-htmlparser");
InputStream input = url.openStream();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());
HtmlParser parser = new HtmlParser();
parser.parse(input, toHTMLHandler, metadata, parseContext);
System.out.println(toHTMLHandler.toString());
}
}
/**
* A HtmlMapper that accepts all tags and tributes.
*
*/
class AllTagMapper implements HtmlMapper {
@Override
public String mapSafeElement(String name) {
return name.toLowerCase();
}
@Override
public boolean isDiscardElement(String name) {
return false;
}
@Override
public String mapSafeAttribute(String elementName, String attributeName) {
return attributeName.toLowerCase();
}
}
Parsing HTML with Apache Tika
Every now and then, I have to parse some HTML files. There are a lot of ways you can go about doing that. Recently, I have started using Apache Tika and it does a pretty reasonable job (i.e. better than what I have done before). There is not a lot of documentation on Tika so I had to do a bit of hacking to get my head around it.
A good start is this quick Hello World Tika program I put together. It parses this article. The TeeContentHandler passes data from the HtmlParser to ContentHandlers that it has been initialized with. For the purposes of this example, I am showing off three different handlers. The LinkContentHandler is great for extracting links; useful for crawlers. The ContentHandler strips out all the text on a page; useful for indexers. The ToHTMLContentHandler produces XHTML; useful for extracting specific blocks of text which is also good for indexers. One thing to be aware of when using the HtmlParser is that natively, it does not support all tags. For example, it currently skips over the code tag. My next post will explain how to configure the HtmlParser to not do that :-)
import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.LinkContentHandler;
import org.apache.tika.sax.TeeContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.ContentHandler;
public class HelloApacheTika {
public static void main (String args[]) throws Exception {
URL url = new URL("http://chrisjordan.ca/post/15219674437/parsing-html-with-apache-tika");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
System.out.println("links:\n" + linkHandler.getLinks());
System.out.println("text:\n" + textHandler.toString());
System.out.println("html:\n" + toHTMLHandler.toString());
}
}
If you are using Maven, you need to add the following dependencies:
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.0</version>
</dependency>
So… new city, new website… again!
Hey all,
I have changed cities and websites again! A bit of a coincidence that I have done both at the same time again. I have migrated off of Drupal and onto Tumblr. Drupal has been good to me but I just do not have time to maintain it on my server anymore especially since I only had a single Website running on it. I really like Tumblr as a platform and it is super convenient so that is why I am using it for now.
You will notice that I was able to import all my old blog posts. Tumblr does not import blog posts natively so I wrote a Web app to do it for me. I posted it to my GitHub, Blog Teleporter. It is a pretty simple app that crawls Drupal blogs (for now) and posts them to Tumblr. There are a few things I need to tighten up on it but it runs end to end and can be easily extended to crawl other blog platforms. I think I am going to add in one more platform before I declare it a version 1.0. You might notice that all of my old posts have a new tag, teleporter. My app added it so you can easily see which posts you have imported. That allows for easy rollbacks on Tumblr in case there is a problem importing.
That is basically why I have not made any posts in the past few months. The combination of me getting a new job, handing off the old one, and promising myself to not post until I migrated off of Drupal did it. There are a lot of goodies in my Blog Teleporter though and I will be posting about them shortly.
If you want to migrate your blog onto Tumblr and my Blog Teleporter does not support your platform, just let me know. I am happy to add it as another feature.
Creating an Apache CXF Web Service Client
So in early blog posts, I discussed how to make an Axis 1.4 client and secure it. Axis 1.4, however, has been end of lifed; no more work is being done on it. Now, if you are thinking about using Axis 2 to make a client in order to stick with a package that is actively being maintained, I would strongly advise against it. Axis 2 is really designed to be used on a Java application server and it is not friendly at all for standalone clients. There are literally 2 dozen jars plus modules that you need to add to your classpath if you want your client to work.
Fortunately, there is another Apache Web Service project, CXF. From what I can see, CXF is really a continuation of where Axis 1.4 left off. Even for Java Web apps, CXF is much friendlier to use than Axis 2.
Below is an example of how to setup a CXF client
Step 1. Download CXF
Step 2. Download the wsdl that you want to use. For this post, similar to the previous ones, I am going to use the global weather wsdl
Step 3. Use the wsdl2java CXF utility to generate Java code representing the global weather Web Service. The command should look like the following:
wsdl2java globalweather.wsdl
Step 4. Copy the code from step 3 into your source code directory. If you are working in Eclipse, you can drag and drop the code into your src directory
Step 5. Add the following files from the CXF lib directory to your classpath. If you are working in Eclipse, you can drag and drop these files into your own lib directory
- commons-collections-3.2.1.jar
- commons-lang-2.4.jar
- commons-logging-1.1.1.jar
- cxf-2.2.12.jar
- geronimo-activation_1.1_spec-1.0.2.jar
- geronimo-annotation_1.0_spec-1.1.1.jar
- geronimo-javamail_1.4_spec-1.6.jar
- geronimo-jaxws_2.1_spec-1.0.jar
- geronimo-servlet_2.5_spec-1.2.jar
- geronimo-stax-api_1.0_spec-1.0.1.jar
- geronimo-ws-metadata_2.0_spec-1.1.2.jar
- jaxb-api-2.1.jar
- jaxb-impl-2.1.13.jar
- neethi-2.0.4.jar
- saaj-api-1.3.jar
- saaj-impl-1.3.2.jar
- wsdl4j-1.6.2.jar
- wss4j-1.5.10.jar
- wstx-asl-3.2.9.jar
- xml-resolver-1.2.jar
- XmlSchema-1.4.7.jar
- xmlsec-1.4.3.jar
Step 6. Create a Web Service client. Below is an example of a client that works with the global weather Web Service:
import java.net.MalformedURLException;
import java.net.URL;
import net.webservicex.*;
public class HelloCXF {
public static void main (String args[]) {
try {
GlobalWeather service = new GlobalWeather(new URL("globalweather.wsdl"));
GlobalWeatherSoap weatherSoap = service.getGlobalWeatherSoap();
System.out.println("Weather for New York");
System.out.println(weatherSoap.getWeather("New York", "United States"));
}
catch (MalformedURLException mue) {
System.err.println("problem with the wsdl url");
System.exit(1);
}
}
}
Axis 1.4 Web Service Client Side authentication
So in a previous post, I gave an example of how to setup an Axis 1.4 Web Service client. That is fine if you do not have to login to use the service. Often times however, you do have to authenticate your client especially if your work has paid for the third party Web Service that you want to use. Here you want to use WSS4J. Below are the steps to get yourself setup:
Step 1. Download WSS4J
Step 2. Add the contents of the WSS4J lib directory to your classpath. If you are working with a project in Eclipse, you probably want to copy the contents of WSS4J lib to your own lib directory
Step 3. Create a deployment descriptor file. Below is an example of one that we will refer to as sample.wsdd:
<deployment xmlns="http://xml.apache.org/axis/wsdd/" xmlns:java="http://xml.apache.org/axis/wsdd/providers/java">
<transport name="http" pivot="java:org.apache.axis.transport.http.HTTPSender"/>
<globalConfiguration >
<requestFlow >
<handler type="java:org.apache.ws.axis.security.WSDoAllSender" >
<parameter name="action" value="UsernameToken"/>
<parameter name="user" value="sampleUsername"/>
<parameter name="passwordCallbackClass" value="SamplePWCallback"/>
<parameter name="passwordType" value="PasswordDigest"/>
</handler>
</requestFlow >
</globalConfiguration >
</deployment>
Step 4. Create a CallbackHandler. In my example below, I use the Apache configuration package to get the username and password from a properties file in order to avoid hardcoding it.
import java.io.IOException;
import javax.security.auth.callback.Callback;
import javax.security.auth.callback.CallbackHandler;
import javax.security.auth.callback.UnsupportedCallbackException;
import org.apache.commons.configuration.Configuration;
import org.apache.commons.configuration.ConfigurationException;
import org.apache.commons.configuration.PropertiesConfiguration;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.ws.security.WSPasswordCallback;
public class SamplePWCallback implements CallbackHandler {
private Configuration _config;
private static Log _logger = LogFactory.getLog(SamplePWCallback.class);
public SamplePWCallback () {
super();
_logger.info("Initializing Sample PW Callback");
_config = null;
try {
_config = new PropertiesConfiguration("sample.properties");
}
catch (ConfigurationException ce) {
_logger.error("Error loading configuration file", ce);
}
}
public void handle(Callback[] callbacks) throws IOException,
UnsupportedCallbackException {
for (int i = 0; i < callbacks.length; i++) {
if (callbacks[i] instanceof WSPasswordCallback) {
WSPasswordCallback pc = (WSPasswordCallback)callbacks[i];
// set the password given a username
if (_config.getString("username").equals(pc.getIdentifier())) {
pc.setPassword(_config.getString("password"));
}
}
else {
throw new UnsupportedCallbackException(callbacks[i], "Unrecognized Callback");
}
}
}
}
Step 5. Access your wsdd in your Web Service client. Let’s assume the global weather Web Service that I used in the previous Axis 1.4 blog post required authentication. Here is what your client would look like.
import NET.webserviceX.www.*;
import org.apache.axis.configuration.FileProvider;
public class HelloWeather {
public static void main(String[] args) throws Exception {
FileProvider wsdd = new FileProvider("sample.wsdd");
wsdd.setSearchClasspath(true);
GlobalWeatherLocator locator = new GlobalWeatherLocator(wsdd);
GlobalWeatherSoap globalWeatherSoap = locator.getGlobalWeatherSoap();
System.out.println("Weather for New York");
System.out.println(globalWeatherSoap.getWeather("New York", "United States"));
}
}
Logging uncaught exceptions in a Java application
Logging exceptions is pretty important as it allows you to troubleshoot problems that come up with your production applications. If you have a regular Java application, there is a trick to logging uncaught exceptions. Those are the exceptions that do not require you to put your code in a try/catch block. To log these exceptions, you need to set a default uncaught exception handler. The example below defines one that uses log4j to write an ERROR log entry.
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class HelloWorld {
private static Log _logger = LogFactory.getLog(HelloWorld.class);
public static void main (String args[]) {
//setting uncaught exception logging
Thread.setDefaultUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler() {
public void uncaughtException(Thread t, Throwable e) {
_logger.error("An uncaught exception has been thrown", e); }
});
}
}
Java case insensitive String replaceAll
Here’s a quick handy tip for replacing all occurrences of a substring regardless of case in Java. Put “(?i)” at the beginning of your substring. For example:
String a = "hello HELLO hEllO";
String b = a.replaceAll("(?i)hello", "bye");
System.out.println(a);
System.out.println(b);
The above code will output:
hello HELLO hEllO
bye bye bye
How to install a jar locally in Maven
This topic is probably trivial to Maven veterans however, to people getting started, it is pretty important. As you probably already know if you got here from Googling, Maven is a neat build tool for Java that goes out and downloads all your jar dependencies for you. For example, if you need to use the apache commons lang package, you specify the dependency in your pom.xml Maven will download it for you. Under the hood, what is happening is that Maven will first check your local package repository, located in a directory called .m2 in your home directory, and if it cannot find the package there, it will check remote repositories.
This approach is great for Maven distributed packages and the most commonly used ones are available. If you happen to need to use a package that is not available though, then you will have to install it locally in order for Maven to build your applications with it. This situation happened to me when my work got a license for a new DBMS that had a closed source driver.
Installing a jar is relevantly painless. Simply use the install:install-file goal. Your command should look similar to the one below.
mvn install:install-file -Dfile=closed-source-driver.jar -DgroupId=com.closed-source -DartifactId=com.closed-source.driver -Dversion=1.0 -Dpackaging=jar
Once you have installed the jar, you can now include it as a dependency in any of your projects. Based on the example above, the dependency tag in your pom.xml should look like the following.
<dependency>
<groupId>com.closed-source</groupId>
<artifactId>com.closed-source.driver</artifactId>
<version>1.0</version>
<scope>compile</scope>
</dependency>
How to build an Axis 1.4 XML-RPC based Web Service client
So Web Services have been around for a while now and one would think that the popular packages out there would support most any service. Well unfortunately, that is a bad assumption to make. That is because of how long Web Services have existed. Some earlier versions employed a variation of the SOAP message that we use today called XML-RPC. Very few contemporary Web Service packages support XML-RPC yet even now, you will run into someone using it, like I did at work.
I found that Axis 1.4 works pretty much out of the box for these older style Web Services. Axis2, like the rest though, does not support XML-RPC.
To create a client, you will be best served by using the WSDL2Java utility. To use it, you will need to make sure the following jars are in your classpath; they come packaged in the Axis 1.4 download so you don’t have to google for them.
- axis.jar
- jaxrpc.jar
- saaj.jar
- commons-logging-1.0.4.jar
- commons-discovery-0.2.jar
- wsdl4j-1.5.1.jar
Once you have added those jars to your classpath, you can let Axis do most of the work for you in creating a client. For this example, we are going to build a client that gets the weather from the global weather Web Service. This Web Service is pretty basic so it is good for learning how to make a client however, its uptime can be spotty at times.
Here are the steps for creating a clients.
Step 1: Download the globalweather.wsdl. This file contains information describing the Web Service.
Step 2: Run the following command java org.apache.axis.wsdl.WSDL2Java globalweather.wsdl
This Axis utility will generate a Java package from the wsdl necessary for interacting with the Web Service. This code will be layout out in the directory where you ran the utility and at the time of writing this article should have to following structureNET/webserviceX/www/
Step 3: Copy the code you generated in Step 2 to your client’s working directory. You might not need to do this step if you are going to write your client in the same directory in which you generated the package. If you are using Eclipse or any IDE though, you will want to copy the NET directory into one of your source code folders. I use Eclipse so I literally dragged and dropped it.
Step 4: Create a client class. The code below is a simple client that gets the weather for New York.
import NET.webserviceX.www.*;
public class HelloWeather {
public static void main(String[] args) throws Exception {
GlobalWeatherLocator locator = new GlobalWeatherLocator();
GlobalWeatherSoap globalWeatherSoap = locator.getGlobalWeatherSoap();
System.out.println("Weather for New York");
System.out.println(globalWeatherSoap.getWeather("New York", "United States"));
}
}
Axis hides a lot of the boiler plate work involved in a Web Service and it is pretty simple to use. The only complexity that you will encounter is when the Web Services you use get complex and generally those ones have some support documentation that explains what all the end points are and how you should use them.
Building Self Contained Executable Jars - 2 ways
Building an executable jar file is generally pretty simple if you do not want to package any library jars files with in it. You simply have to insert a Main-Class entry into the manifest file that specifies which class has the main method that you want to execute. That should be sufficient for small Java applications where you do not need to use a jarred library however once you need to use one, problems arise.
One approach is to simple just put your external jars in your classpath however, that is only really acceptable if you are not really deploying your application anywhere. Plus having to push support jar files and update classpaths is a pain. A much preferable approach is you wrap everything up into a single jar file.
Now the Java class loader does not know how to load classes that are in a jar inside another jar which means just manipulating the manifest file will not be enough. Fortunately, there is a nifty utility called one-jar that helps us build these self contained executable jar files. Here I am going to show you 2 ways to use it, one using Maven and one using Ant.
Way 1: Maven
Using Maven is by far the easiest approach. You need to add the following to your pom.xml
</plugins>
<plugin>
<groupId>org.dstovall</groupId>
<artifactId>onejar-maven-plugin</artifactId>
<version>1.3.0</version>
<executions>
<execution>
<configuration>
<mainClass>com.chrisjordan.examples.HelloWorld</mainClass>
</configuration>
<goals>
<goal>one-jar</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
<pluginRepository>
<id>onejar-maven-plugin.googlecode.com</id>
<url>http://onejar-maven-plugin.googlecode.com/svn/mavenrepo</url>
</pluginRepository>
This Maven extension will literally package all the jar dependencies along with your code into a single executable jar file. Simple use the maven goal install or package to generate it. mvn install
mvn package
Way 2: Ant
Ant is still a really popular build tool but in order to use one-jar, it requires a little more work.
Step 1: Download one-jar
Step 2: Expand it some where that you can reference easily. It is probably the best if you expand it in your project directory since if you are working in a team and are committing your code to a repository, it will need to be there so others can use it without having to think about it.
Step 3: Create a task in your build.xml file similar to the one below
<property name="one-jar.dist.dir" value="one-jar-ant"/>
<import file="${one-jar.dist.dir}/one-jar-ant-task.xml" optional="true" />
<target name="jar" depends="clean, build, tests">
<!-- Construct the One-JAR file -->
<one-jar destfile="${target.dir}/hello-world.jar">
<manifest>
<attribute name="One-Jar-Main-Class" value="com.chrisjordan.examples.HelloWorld" />
</manifest>
<main>
<fileset dir="${build.main.dir}"/>
</main>
<lib>
<fileset dir="${target.dir}/lib" includes="*.jar"/>
</lib>
<fileset dir="${build.main.res}" includes="**"/>
</one-jar>
</target>
So as you can see from the above task, there are some gotchas with that using one-jar in ant. You are responsible for the downloading the one-jar package and expanding it in a directory that you reference from your build.xml. As well, you have to specify the jars and property files that you want packaged up.
Now to build your executable jar, you simple have to call ant with the one-jar task you created