After having worked at Gilt for a couple years, I think I can safely say that Seattle has a different sense of fashion.
So I have been in Seattle for about 24 hours now and I am starting to figure out the quirks about where I am living. There are drastically fewer people here than in New York which is great during the day but makes me worried about zombies at night (thank you AMC Walking Dead). I discovered this morning while shopping for home goods that there is a monorail that goes between the downtown core and my place. Magnets now hurdle me between where I sleep and where I shop.
From time to time, you need to figure out how much data is in a table. This bit of SQL, specific to Postgres, gets you the size of the one that you specify.
SELECT pg_size_pretty(pg_relation_size('your_table'));
Configuring Apache Tika’s HtmlParser
So in my previous post about Apache Tika, I showed off a small Hello World program that demonstrated how you can quickly use it to parse HTML files. One of the first issues you will probably encounter using Tika though is that its HtmlParser does not immediately handle all tags. For example, the code tag is not recognized. To deal with that, you need to create a custom HtmlMapper. In the code example below, I created an HtmlMapper that accepts all tags. In addition to expanding the number of tags that the HtmlParser can handle, custom HtmlMappers are great for isolate specific blocks that you are interested in by discarding ones that you do not care about.
import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlMapper;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.ToHTMLContentHandler;
public class HelloApacheTika2 {
public static void main (String args[]) throws Exception {
URL url = new URL("http://chrisjordan.ca/post/15345467825/configuring-apache-tikas-htmlparser");
InputStream input = url.openStream();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, AllTagMapper.class.newInstance());
HtmlParser parser = new HtmlParser();
parser.parse(input, toHTMLHandler, metadata, parseContext);
System.out.println(toHTMLHandler.toString());
}
}
/**
* A HtmlMapper that accepts all tags and tributes.
*
*/
class AllTagMapper implements HtmlMapper {
@Override
public String mapSafeElement(String name) {
return name.toLowerCase();
}
@Override
public boolean isDiscardElement(String name) {
return false;
}
@Override
public String mapSafeAttribute(String elementName, String attributeName) {
return attributeName.toLowerCase();
}
}
Parsing HTML with Apache Tika
Every now and then, I have to parse some HTML files. There are a lot of ways you can go about doing that. Recently, I have started using Apache Tika and it does a pretty reasonable job (i.e. better than what I have done before). There is not a lot of documentation on Tika so I had to do a bit of hacking to get my head around it.
A good start is this quick Hello World Tika program I put together. It parses this article. The TeeContentHandler passes data from the HtmlParser to ContentHandlers that it has been initialized with. For the purposes of this example, I am showing off three different handlers. The LinkContentHandler is great for extracting links; useful for crawlers. The ContentHandler strips out all the text on a page; useful for indexers. The ToHTMLContentHandler produces XHTML; useful for extracting specific blocks of text which is also good for indexers. One thing to be aware of when using the HtmlParser is that natively, it does not support all tags. For example, it currently skips over the code tag. My next post will explain how to configure the HtmlParser to not do that :-)
import java.io.InputStream;
import java.net.URL;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.LinkContentHandler;
import org.apache.tika.sax.TeeContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.xml.sax.ContentHandler;
public class HelloApacheTika {
public static void main (String args[]) throws Exception {
URL url = new URL("http://chrisjordan.ca/post/15219674437/parsing-html-with-apache-tika");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
System.out.println("links:\n" + linkHandler.getLinks());
System.out.println("text:\n" + textHandler.toString());
System.out.println("html:\n" + toHTMLHandler.toString());
}
}
If you are using Maven, you need to add the following dependencies:
<!-- Apache Tika -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.0</version>
</dependency>
I actually forgot I posted this answer on Stack Overflow until today when my cousin complimented me on it :-)
So… new city, new website… again!
Hey all,
I have changed cities and websites again! A bit of a coincidence that I have done both at the same time again. I have migrated off of Drupal and onto Tumblr. Drupal has been good to me but I just do not have time to maintain it on my server anymore especially since I only had a single Website running on it. I really like Tumblr as a platform and it is super convenient so that is why I am using it for now.
You will notice that I was able to import all my old blog posts. Tumblr does not import blog posts natively so I wrote a Web app to do it for me. I posted it to my GitHub, Blog Teleporter. It is a pretty simple app that crawls Drupal blogs (for now) and posts them to Tumblr. There are a few things I need to tighten up on it but it runs end to end and can be easily extended to crawl other blog platforms. I think I am going to add in one more platform before I declare it a version 1.0. You might notice that all of my old posts have a new tag, teleporter. My app added it so you can easily see which posts you have imported. That allows for easy rollbacks on Tumblr in case there is a problem importing.
That is basically why I have not made any posts in the past few months. The combination of me getting a new job, handing off the old one, and promising myself to not post until I migrated off of Drupal did it. There are a lot of goodies in my Blog Teleporter though and I will be posting about them shortly.
If you want to migrate your blog onto Tumblr and my Blog Teleporter does not support your platform, just let me know. I am happy to add it as another feature.
Creating and Managing Threads in Java
Threading in the early days of Java use to be a pain. Since Java 1.5 and the introduction of the ExecutorService, it is much easier to start up and manage them. Code that you want executed in a separate thread must be encapsulated in a class that implements Runnable. An instance of that class can then be passed to an ExecutorService that will handle its execution. Below is a trivial example but it demonstrates how to use the ExecutorService and thread pools.
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.apache.log4j.Logger;
public class ExecutorServiceExample {
public static void main (String args[]) {
final Logger logger = Logger.getLogger(ExecutorServiceExample.class);
ExecutorService threadExecutor = Executors.newFixedThreadPool( 10 );
for (int i = 0; i < 20; i++) {
threadExecutor.execute(new Runnable() {
@Override
public void run() {
try {
logger.info("This thread is going to sleep");
Thread.sleep((long)(Math.random() * 10000));
} catch (InterruptedException e) {
logger.error("Oh noes... interruptted thread?!",e);
}
}
});
}
threadExecutor.shutdown();
try {
threadExecutor.awaitTermination(5, TimeUnit.MINUTES);
logger.info("All threads done");
} catch (InterruptedException e) {
logger.error("Oh noes... interruptted thread?!",e);
}
}
}
In the example above, I created an ExecutorService that manages a pool of 10 threads. It then executes 20 anonymous classes which log an info message and sleep between 0 and 10 seconds. After that, it waits for the threads to finish executing and then logs another info message.
As you can see from the output below, the 10 threads in the pool execute one after the other. After that though, as different threads complete execution before others, they begin executing out of order due to their availability.
Output: 2011-09-13 22:07:19,511 [pool-1-thread-1] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,512 [pool-1-thread-3] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,511 [pool-1-thread-2] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,512 [pool-1-thread-4] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,512 [pool-1-thread-5] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,512 [pool-1-thread-6] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,512 [pool-1-thread-7] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,513 [pool-1-thread-8] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,513 [pool-1-thread-9] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,513 [pool-1-thread-10] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:19,619 [pool-1-thread-1] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:21,424 [pool-1-thread-6] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:21,815 [pool-1-thread-2] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:22,567 [pool-1-thread-10] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:23,474 [pool-1-thread-10] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:23,863 [pool-1-thread-9] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:24,473 [pool-1-thread-8] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:24,569 [pool-1-thread-7] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:25,506 [pool-1-thread-5] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:27,492 [pool-1-thread-4] INFO ExecutorServiceExample - This thread is going to sleep
2011-09-13 22:07:34,308 [main] INFO ExecutorServiceExample - All threads done
Java SFTP
Believe it or not, there are very few options currently when it comes to implementing SFTP in Java. I believe that the Apache Mina project will eventually provide support for it (http://mina.apache.org/sshd/index.html) however, the project is still at version 0.5 and I have not found any good documentation for it. jCraft’s JSch package is the most straightforward library that I have used so far and in this article, I am going to provide a sample implementation.
JSch: http://www.jcraft.com/jsch/
JSch download: http://sourceforge.net/projects/jsch/files/jsch.jar/0.1.44/jsch-0.1.44.jar/download
Something to keep in mind here is that SFTP is not FTPS. SFTP is SSH FTP hence why I am hoping that Apache Mina will eventually support it. FTPS is FTP SSL which you can interact with using Apache FtpServer (http://mina.apache.org/ftpserver/). Until there is a mature Apache project, I recommend JSch. Actually, I recommend that you first try using SSH Fuse (http://fuse.sourceforge.net/sshfs.html) to mount your SFTP account on your file system and if you cannot set it up, then use JSch. import com.jcraft.jsch.ChannelSftp;
import com.jcraft.jsch.JSch;
import com.jcraft.jsch.Session;
import com.jcraft.jsch.SftpProgressMonitor;
public class SFTPExample {
public static void main(String args[]) throws Exception {
String user = "enter user name";
String password = "enter password";
String host = "enter host";
int port = 22;
String knownHostsFilename = "/path/to/.ssh/known_hosts";
String sourcePath = "/path/to/test_file";
String destPath = "test_file";
JSch jsch = new JSch();
jsch.setKnownHosts(knownHostsFilename);
Session session = jsch.getSession(user, host, port);
session.setPassword(password);
session.connect();
ChannelSftp sftpChannel = (ChannelSftp) session.openChannel("sftp");
sftpChannel.connect();
System.out.println("Uploading test file");
sftpChannel.put(sourcePath, destPath, new ExampleProgressMonitor());
System.out.println("Downloading test file");
sftpChannel.get(destPath, sourcePath + ".new", new ExampleProgressMonitor());
System.out.println("Remove test file");
sftpChannel.rm(destPath);
sftpChannel.exit();
session.disconnect();
}
}
class ExampleProgressMonitor implements SftpProgressMonitor {
private double count;
private double max;
private String src;
private int percent;
private int lastDisplayedPercent;
ExampleProgressMonitor() {
count = 0;
max = 0;
percent = 0;
lastDisplayedPercent = 0;
}
public void init(int op, String src, String dest, long max) {
this.max = max;
this.src = src;
count = 0;
percent = 0;
lastDisplayedPercent = 0;
status();
}
public boolean count(long count) {
this.count += count;
percent = (int) ((this.count / max) * 100.0);
status();
return true;
}
public void end() {
percent = (int) ((count / max) * 100.0);
status();
}
private void status() {
if (lastDisplayedPercent <= percent - 10) {
System.out.println(src + ": " + percent + "% " + ((long) count) + "/" + ((long) max));
lastDisplayedPercent = percent;
}
}
}
Formatting JSON from command line
So when working on REST services that return JSON, I often hit them from command line using curl. If the JSON message that is returned is rather large, it can be a pain to read. Python provides a nice little tool though for formatting it. You can pipe a JSON message returned from a curl to it like so:curl http://my.rest.service.org | python -mjson.tool | less