File Type Detection with Apache Tika

Apache Tika is a free and open source Java framework for content analysis developed under the umbrella of the Apache Software Foundation. Like other Apache projects, it is distributed under the Apache License, Version 2.0.

Tika’s main function is to extract metadata and structured text content from various documents. (The list of supported document formats can be found here.) It’s implementation heavily relies on external parser libraries, for example, it uses Apache PDFBox for parsing PDF files.

Tika also provides file type detection functionality similar to that of the file utility available in Unix-like systems. The Detector interface represents file type detection capability. This interface has several different implementations. Fortunately, you do not have to worry about the details. The Tika class in the API provides convenient access to Tika’s functionality hiding technical details of the underlying implementation.

File type detection methods are all named detect(). The methods examine the filename extension (if available) and consume the content of the document in order to determine the file type, that is returned as an Internet media type string (eg. "image/jpeg").

The following code snippet demonstrates how to determine the file type of a resource identified by an URL:

import java.io.IOException;
import java.net.URL;

import org.apache.tika.Tika;

public class TikaDemo {

    public static void main(String[] args) {
        Tika tika = new Tika();
        try {
            String mediaType = tika.detect(new URL("http://tika.apache.org/tika.png"));
            System.out.println(mediaType);
        } catch(IOException e) {
            System.err.println(e.getMessage());
        }
    }

}

The above code will print image/png to the standard output. To compile and run the program download tika-app-1.0.jar from here and add it to the classpath. Alternatively, if your project is a Maven project, simply add the following dependency to the POM:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.0</version>
    <scope>compile</scope>
</dependency>

Note that the detect method() that takes an InputStream as a parameter consumes the content of the stream to determine the file type. If the stream supports the mark() and reset() methods then the current position is marked before the content of the stream is read. In this case, the method automatically repositions the stream to the marked position after the file type is determined.

However, if the stream does not support this feature, the entire content of the stream will not be available for further processing. It may be an undesired side effect. Remember, you can always wrap an InputStream in a BufferedInputStream to guarantee that the markSupported() method returns true. That is, replace a tika.detect(stream) method call with the following:

stream = new BufferedInputStream(stream);
tika.detect(stream)

tika-app-1.0.jar also provides both a command line and a graphical user interface to access Tika’s functionality. (See the TikaCLI and the TikaGUI classes.) For more information on running Tika execute the following command:

java -jar tika-app-1.0.jar --help

For example, the following command will return the file type of the specified resource:

java -jar tika-app-1.0.jar --detect http://tika.apache.org/tika.png

It should be noted that the tika-core Maven artifact does not contain the command line and graphical user interfaces. If you need them in your Maven project use the tika-app artifact instead.

Advertisements
Tagged

Even Geeks Can Cook (Part 1): Fruit Pie Made Easy

This post is the first of my new series called Even Geeks Can Cook.

Cooking is a rather algorithmic activity. Thus, it can be fun even for geeks. Believe it or not, the following recipe can be implemented in 15 minutes (not including baking time) with no cooking skills required. You can even make use of your geek skills during the implementation process. (For example, being familiar with the concept of uniform distribution may come in handy.)

Ingredients:

  • 250 g flour
  • 200 g sugar
  • 10 g vanilla sugar
  • 12 g baking powder
  • 1 dl milk
  • 1 dl cooking oil
  • a pinch of cinnamon
  • 4 eggs
  • canned fruits (sour cherry, peach)

Mix the flour, sugar, vanilla sugar, baking powder, cinnamon, milk, cooking oil and the eggs (not including eggshells) in a bowl and stir them well to make a smooth paste. Put parchment paper in a pan then pour the paste into it.

Now it’s time to deal with the fruits! You can choose the amount of fruits to your own taste. Drain all fruits through a sieve to get rid of any liquid. Chop the peaches into small pieces. Roll the fruits in some flour (you can use 1 tablespoon of flour for both fruits). Add the fruits to the pan distributing them uniformly.

Ready for baking (observe the distribution of fruits!)

Preheat the oven to 180 celsius and bake the pan for 30 minutes. That’s all!

Output of the baking process

Output of the baking process

Yum, yum

Yum, yum

Thanks to my friends, Laci, Lívia and Zsuzsa for being my beta testers. (They seemed pretty satisfied with the results.) Special thanks go to Móni for the excellent recipe.

Tagged

Auto-generation of UML Class Diagrams in Maven Projects

The yWorks UML Doclet is a handy Javadoc extension that automatically creates good-looking UML diagrams from Java classes and embeds them into the generated API documentation.

Although the tool is not free sofware, it’s Community Edition is available for free of charge under the following conditions:

The Community Edition of the Software is licensed to you free of charge. It comes without support and warranties of any kind. The Community Edition of the Software inserts a web link into your output files that points to the yWorks website. You may not change that link or prevent either display of the link or the intended use as means of navigation to get to the yWorks website in any way.

These terms are quite reasonable and you don’t have to pay a high price for such a great tool.

The functionality offered by the product is also available in Apache Maven projects via the Maven Javadoc Plugin. The following is a minimal POM that demonstrates how to use the doclet in Maven projects:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>my</groupId>
  <artifactId>project</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-site-plugin</artifactId>
        <version>3.0</version>
        <configuration>
          <reportPlugins>
            <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-project-info-reports-plugin</artifactId>
            </plugin>
            <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-javadoc-plugin</artifactId>
              <version>2.8.1</version>
              <configuration>
                <doclet>ydoc.doclets.YStandard</doclet>
                <docletPath>${yDoc.path}/lib/ydoc.jar:${yDoc.path}/lib/styleed.jar:${yDoc.path}/resources</docletPath>
                <additionalparam>-umlautogen</additionalparam>
              </configuration>
            </plugin>
          </reportPlugins>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

Can’t wait to try it? Download and unpack the archive of the Community Edition, then run Maven with mvn site -DyDoc.path=PATH where PATH is the path of the directory that contains your yDoc installation. Alternatively, the path can also be given in the properties element of the POM:

<properties>
  <yDoc.path>PATH</yDoc.path>
</properties>

The screenshot below shows the appearance of the enhanced API documentation.

API documentation with auto-generated UML class diagram

Thanks to László Aszalós for recommending this excellent tool.

Tagged ,

Analyzing Web Server Log Files with RapidMiner (Part 3): from Sessions to Transactions

The Transform Log to Session operator of the Web Mining Extension transforms an example set returned by the Read Server Log operator to a set of transactions suitable for performing association analysis.

The Transform Log to Session operator in a process

The mandatory parameters of the operator are named session attribute and resource attribute. The former determines the attribute used for identifying sessions while the latter determines the attribute used for identifying resources.

Parameters of the Transform Log to Session operator

The result of the operator is an example set in which each session is represented by a single example. The examples have many integer valued attributes each of which corresponds to a resource. The value of such an attribute represents the number of times the resource has been requested during the session.

Result of the Transform Log to Session operator

Note that performing association analysis may require further processing. For example, integer attributes must be transformed to binomial ones in order that the FP-Growth operator can be applied.

Tagged ,

Analyzing Web Server Log Files with RapidMiner (Part 2): Sessions

The Read Server Log operator returns an example set. Each entry of the example set corresponds to an entry in the log file. The operator automatically associates a session attribute value with each of the examples. Many examples may share the same session attribute value. But hey, what exactly is a session?

Example set returned by the Read Server Log operator

The operator has an important parameter named session timeout whose value must be a non-negative integer. The value of this parameter is measured in milliseconds. A session is a series of log file entries that correspond to HTTP requests initiated by the same user agent from the same host, so that the time difference between the first and the last log file entry in the session is always less or equal to session timeout.

When the first hit by a specific user agent from a specific host is found in the log file, a new session attribute value is assigned to the corresponding example. All subsequent hits by the same user agent from the same site within session timeout are associated with the same session attribute value. Once session timeout is reached, a new session attribute value is generated for the same user agent from the same site.

The operator automatically converts date and time values to integers. The time attribute of the resulting example set stores the number of minutes since January 1, 1970, 00:00:00 GMT. (See the source code of the com.rapidminer.operator.io.loganalysis.LogFileSourceOperator class for implementation details.) Note that some information is lost during date and time conversion, since seconds in time values are discarded.

Tagged ,

Analyzing Web Server Log Files with RapidMiner (Part 1): Quick Start Guide and Bugfix

The RapidMiner Web Mining Extension provides the Read Server Log operator to read web server log files. Unfortunately, it is not straightforward to bring the operator to life based solely on the help text. Moreover, the operator does not work in the current version (5.1.4) of the extension.

Parameters of the Read Server Log operator

At first, you need some data files to play with the operator. Download the following ZIP file available from the KDnuggets website: http://www.kdnuggets.com/web_mining_course/kdlog.zip. The archive contains an anonymized Apache HTTP Server log file (da-11-16.ipntld.log). More information about the data can be found here. Place the log file under a separate directory and provide its path in the log dir parameter.

Here comes the tricky part. You must provide a configuration file for the logfile’s format in the config file parameter. The operator uses the polliwog Java library to process web server log files. Download the file polliwog-bin-stable-0.7.tar.gz from the projects website. The file apache-combined-log-entry-format.xml under the polliwog-0.7/data/ directory describes the Apache Combined Log Format in which our log file is stored. Place this file in your file system and provide its path in the config file parameter.

Now you can run the process that results in the following error:

Yes, it is a bug that I have already reported to the developers. The error is because of some missing classes. Fortunately, the problem can be fixed quite easily. Copy gentlyWEB.jar and jdom-1.0.jar from polliwog’s 3rd-party-jars/ directory to the lib/ directory of your RapidMiner installation. (That may require administrator privileges.) The operator will work fine after restarting RapidMiner. (The author has tested it on Linux.)

In order to fix the problem the content of these two JAR files must be added to the Web Mining Extension by the developers.

Tagged ,

We are Starting Soon

I have just created this blog, my first post is due tomorrow morning (CET).