File Type Detection with Apache Tika

Apache Tika is a free and open source Java framework for content analysis developed under the umbrella of the Apache Software Foundation. Like other Apache projects, it is distributed under the Apache License, Version 2.0.

Tika’s main function is to extract metadata and structured text content from various documents. (The list of supported document formats can be found here.) It’s implementation heavily relies on external parser libraries, for example, it uses Apache PDFBox for parsing PDF files.

Tika also provides file type detection functionality similar to that of the file utility available in Unix-like systems. The Detector interface represents file type detection capability. This interface has several different implementations. Fortunately, you do not have to worry about the details. The Tika class in the API provides convenient access to Tika’s functionality hiding technical details of the underlying implementation.

File type detection methods are all named detect(). The methods examine the filename extension (if available) and consume the content of the document in order to determine the file type, that is returned as an Internet media type string (eg. "image/jpeg").

The following code snippet demonstrates how to determine the file type of a resource identified by an URL:

import java.io.IOException;
import java.net.URL;

import org.apache.tika.Tika;

public class TikaDemo {

    public static void main(String[] args) {
        Tika tika = new Tika();
        try {
            String mediaType = tika.detect(new URL("http://tika.apache.org/tika.png"));
            System.out.println(mediaType);
        } catch(IOException e) {
            System.err.println(e.getMessage());
        }
    }

}

The above code will print image/png to the standard output. To compile and run the program download tika-app-1.0.jar from here and add it to the classpath. Alternatively, if your project is a Maven project, simply add the following dependency to the POM:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.0</version>
    <scope>compile</scope>
</dependency>

Note that the detect method() that takes an InputStream as a parameter consumes the content of the stream to determine the file type. If the stream supports the mark() and reset() methods then the current position is marked before the content of the stream is read. In this case, the method automatically repositions the stream to the marked position after the file type is determined.

However, if the stream does not support this feature, the entire content of the stream will not be available for further processing. It may be an undesired side effect. Remember, you can always wrap an InputStream in a BufferedInputStream to guarantee that the markSupported() method returns true. That is, replace a tika.detect(stream) method call with the following:

stream = new BufferedInputStream(stream);
tika.detect(stream)

tika-app-1.0.jar also provides both a command line and a graphical user interface to access Tika’s functionality. (See the TikaCLI and the TikaGUI classes.) For more information on running Tika execute the following command:

java -jar tika-app-1.0.jar --help

For example, the following command will return the file type of the specified resource:

java -jar tika-app-1.0.jar --detect http://tika.apache.org/tika.png

It should be noted that the tika-core Maven artifact does not contain the command line and graphical user interfaces. If you need them in your Maven project use the tika-app artifact instead.

About these ads
Tagged

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: