Monthly Archives: March 2012

Using Web Services in RapidMiner

The Enrich Data by Webservice operator of the RapidMiner Web Mining Extension allows you to interact with web services in your RapidMiner process.

A web service can be invoked for each example of an example set. (Note that this may be time-consuming.) All strings of the form <%attribute%> in a request will be automatically replaced with the corresponding attribute value of the current example. The operator provides several different methods to parse the response, including the use of regular expressions and XPath location paths. Parsing the result you can add new attributes to your example set.

For demonstration purposes we will use the Google Geocoding API. This web service also offers reverse geocoding functionality, i.e. provides a human-readable address for a geographical location. To see how it works, click on the following link: http://maps.googleapis.com/maps/api/geocode/xml?latlng=47.555214,21.621423&sensor=false. Notice that latitude and longitude values are passed to the service in the latlng query string parameter.

We will use this data file for our experiment. The file contains earthquake data that originates from the Earthquake Search service provided by the United States Geological Survey (USGS). Consider the following RapidMiner process that is available from here:

A RapidMiner process that uses the Enrich Data by Webservice operator to interact with a web service

A RapidMiner process that uses the Enrich Data by Webservice operator to interact with a web service

First, the data file is read by the Read CSV operator. Then the Sort and Filter Example Range operators are used to filter the 50 highest magnitude earthquakes. Finally, the Enrich Data by Webservice operator invokes the web service to retrieve country names for the geographical locations of these 50 earthquakes. (Only a small subset of the entire data is used to prevent excessive network traffic.)

The parameters of the Enrich Data by Webservice operator should be set as follows (see the figure below):

  • Set the value of the query type parameter to XPath
  • Set the value of the attribute type parameter to Nominal
  • Uncheck the checkbox of the assume html parameter
  • Set the value of the request method parameter to GET
  • Set the value of the url parameter to http://maps.googleapis.com/maps/api/geocode/xml?latlng=<%Latitude%&gt;,<%Longitude%>&sensor=false
Parameters of the Enrich Data by Webservice operator

Parameters of the Enrich Data by Webservice operator

Finally, click on the Edit List button next to the xpath queries parameter that will bring up an Edit Parameter List window. Enter the string Country into the attribute name field and the string //result[type = 'country']/formatted_address/text() into the query expression field.

Setting of the xpath queries parameter

Setting of the xpath queries parameter

That’s all! Unfortunately, running the process results in the following error:

Process Failed

Process Failed


Well, this is a bug that I have already reported to the developers. (See the bug report here.) The following trick solves the problem: set the request method parameter of the Enrich Data by Webservice operator to POST, enter some arbitrary text into the parameter service method, then set the request method parameter to GET again.

The figure below shows the enhanced example set that contains country names provided by the web service (see the Country attribute).

Enhanced example set with country names

Enhanced example set with country names

Tagged , ,

File Type Detection with Apache Tika

Apache Tika is a free and open source Java framework for content analysis developed under the umbrella of the Apache Software Foundation. Like other Apache projects, it is distributed under the Apache License, Version 2.0.

Tika’s main function is to extract metadata and structured text content from various documents. (The list of supported document formats can be found here.) It’s implementation heavily relies on external parser libraries, for example, it uses Apache PDFBox for parsing PDF files.

Tika also provides file type detection functionality similar to that of the file utility available in Unix-like systems. The Detector interface represents file type detection capability. This interface has several different implementations. Fortunately, you do not have to worry about the details. The Tika class in the API provides convenient access to Tika’s functionality hiding technical details of the underlying implementation.

File type detection methods are all named detect(). The methods examine the filename extension (if available) and consume the content of the document in order to determine the file type, that is returned as an Internet media type string (eg. "image/jpeg").

The following code snippet demonstrates how to determine the file type of a resource identified by an URL:

import java.io.IOException;
import java.net.URL;

import org.apache.tika.Tika;

public class TikaDemo {

    public static void main(String[] args) {
        Tika tika = new Tika();
        try {
            String mediaType = tika.detect(new URL("http://tika.apache.org/tika.png"));
            System.out.println(mediaType);
        } catch(IOException e) {
            System.err.println(e.getMessage());
        }
    }

}

The above code will print image/png to the standard output. To compile and run the program download tika-app-1.0.jar from here and add it to the classpath. Alternatively, if your project is a Maven project, simply add the following dependency to the POM:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.0</version>
    <scope>compile</scope>
</dependency>

Note that the detect method() that takes an InputStream as a parameter consumes the content of the stream to determine the file type. If the stream supports the mark() and reset() methods then the current position is marked before the content of the stream is read. In this case, the method automatically repositions the stream to the marked position after the file type is determined.

However, if the stream does not support this feature, the entire content of the stream will not be available for further processing. It may be an undesired side effect. Remember, you can always wrap an InputStream in a BufferedInputStream to guarantee that the markSupported() method returns true. That is, replace a tika.detect(stream) method call with the following:

stream = new BufferedInputStream(stream);
tika.detect(stream)

tika-app-1.0.jar also provides both a command line and a graphical user interface to access Tika’s functionality. (See the TikaCLI and the TikaGUI classes.) For more information on running Tika execute the following command:

java -jar tika-app-1.0.jar --help

For example, the following command will return the file type of the specified resource:

java -jar tika-app-1.0.jar --detect http://tika.apache.org/tika.png

It should be noted that the tika-core Maven artifact does not contain the command line and graphical user interfaces. If you need them in your Maven project use the tika-app artifact instead.

Tagged

Even Geeks Can Cook (Part 1): Fruit Pie Made Easy

This post is the first of my new series called Even Geeks Can Cook.

Cooking is a rather algorithmic activity. Thus, it can be fun even for geeks. Believe it or not, the following recipe can be implemented in 15 minutes (not including baking time) with no cooking skills required. You can even make use of your geek skills during the implementation process. (For example, being familiar with the concept of uniform distribution may come in handy.)

Ingredients:

  • 250 g flour
  • 200 g sugar
  • 10 g vanilla sugar
  • 12 g baking powder
  • 1 dl milk
  • 1 dl cooking oil
  • a pinch of cinnamon
  • 4 eggs
  • canned fruits (sour cherry, peach)

Mix the flour, sugar, vanilla sugar, baking powder, cinnamon, milk, cooking oil and the eggs (not including eggshells) in a bowl and stir them well to make a smooth paste. Put parchment paper in a pan then pour the paste into it.

Now it’s time to deal with the fruits! You can choose the amount of fruits to your own taste. Drain all fruits through a sieve to get rid of any liquid. Chop the peaches into small pieces. Roll the fruits in some flour (you can use 1 tablespoon of flour for both fruits). Add the fruits to the pan distributing them uniformly.

Ready for baking (observe the distribution of fruits!)

Preheat the oven to 180 celsius and bake the pan for 30 minutes. That’s all!

Output of the baking process

Output of the baking process

Yum, yum

Yum, yum

Thanks to my friends, Laci, Lívia and Zsuzsa for being my beta testers. (They seemed pretty satisfied with the results.) Special thanks go to Móni for the excellent recipe.

Tagged