## Processing live data feeds with RapidMiner

The Open File operator has been introduced in the 5.2 version of RapidMiner. It returns a file object for reading content either from a local file, from an URL or from a repository blob entry. Many data import operators including Read CSV, Read Excel and Read XML has been extended to accept a file object as input. With this new feature, now you can process live data feeds directly in RapidMiner.

Many data import operators provide a wizard to guide users through the process of parameter setting. Unfortunately, wizards can not use file objects, they always present a file chooser dialog on start. When dealing with data from the web, you can make use of the wizards according to the following scenario: download the data file and pass your local copy to the wizard. After successful import you can even delete the local file. Data import operators ignore their file name parameter when they receive a file object as input.

In the following a simple use case is presented for demonstration purposes.

The United States Geological Survey’s (USGS) Earthquake Hazards Program provides real-time earthquake data. Real-time feeds are available here. Data is updated periodically and is available for download in multiple formats. For example, click here to get data in CSV format about all M2.5+ earthquakes of the past 30 days (the feed is updated every fifteen minutes).

Let’s see how to read this feed in a RapidMiner process. First, download the feed to your computer. The local copy is required only to set the parameters of the Read CSV operator by using the Import Configuration Wizard. For this purpose you can use a smaller data file, for example this one.

Import the local copy of the feed using the wizard. Select the following data types for the attributes:

• Src (source network): polynomial
• EqId: polynomial
• Version: integer
• Datetime: date_time
• Lat: real
• Lon: real
• Magintude: real
• NST (number of reporting stations): integer
• Region: text

Important: the value of the date format parameter must be set to `E, MMM d, yyyy H:mm:ss z` to ensure correct handling of the Datetime attribute. For details about date and time pattern strings consult the API documentation of the SimpleDateFormat class (see section titled Date and Time Patterns). It is also important to set the value of the locale parameter to one of the English locales.

Once the local file is imported successfully, drag the Open file operator into the process and connect its output port the input port of the Read CSV operator. Set the parameters of the Open file operator according to the following: set the value of the resource type parameter to `URL`, and provide the URL of the feed with the parameter url.

A RapidMiner process that uses the Open file operator to read a data feed from the web

Now you can delete the local data file, the operator will read the feed from the URL when the process is run.

Tagged ,

## Cross-validation in RapidMiner

Cross-validation is a standard statistical method to estimate the generalization error of a predictive model. In $k$-fold cross-validation a training set is divided into $k$ equal-sized subsets. Then the following procedure is repeated for each subset: a model is built using the other $(k - 1)$ subsets as the training set and its performance is evaluated on the current subset. This means that each subset is used for testing exactly once. The result of the cross-validation is the average of the performances obtained from the $k$ rounds.

This post explains how to interpret cross-validation results in RapidMiner. For demonstration purposes, we consider the following simple RapidMiner process that is available here:

The Read URL operator reads the yellow-small+adult-stretch.data file, a subset of the Balloons Data Set available from the UCI Machine Learning Repository. Since this data set contains only 16 examples, it is very easy to perform all calculations in your head.

The Set Role operator marks the last attribute as the one that provides the class labels. The number of validations is set to 3 on the X-Validation operator, that will result a 5-5-6 partitioning of the examples in our case.

In the training subprocess of the cross-validation process a decision tree classifier is built on the current training set. In the testing subprocess the accuracy of the decision tree is computed on the test set.

The result of the process is the following PerformanceVector:

74.44 is obviously the arithmetic mean of the accuracies obtained from the three rounds and 10.30 is their standard deviation. However, it is not clear how to interpret the confusion matrix below and the value labelled with the word makro. You may ask how a single confusion matrix is returned if several models are built and evaluated in the cross-validation process.

The Write as Text operator in the inner testing subprocess writes the performance vectors to a text file that helps us to understand the results above. The file contains the confusion matrices obtained from each round together with the corresponding accuracy values as shown below:

```13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [1]:
13.04.2012 22:07:35 PerformanceVector:
accuracy: 60.00%
ConfusionMatrix:
True:	T	F
T:	0	0
F:	2	3

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [2]:
13.04.2012 22:07:35 PerformanceVector:
accuracy: 83.33%
ConfusionMatrix:
True:	T	F
T:	2	0
F:	1	3

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [3]:
13.04.2012 22:07:35 PerformanceVector:
accuracy: 80.00%
ConfusionMatrix:
True:	T	F
T:	2	1
F:	0	2
```

Notice that the confusion matrix on the PerformanceVector (Performance) tab is simply the sum the three confusion matrices. The value labelled with the word mikro (75) is actually the accuracy computed from this aggregated confusion matrix. A performance calculated this way is called mikro average, while the mean of the averages is called makro average. Note that the confusion matrix behind the mikro average is constructed by evaluating different models on different test sets.

Tagged ,

## Using Web Services in RapidMiner

The Enrich Data by Webservice operator of the RapidMiner Web Mining Extension allows you to interact with web services in your RapidMiner process.

A web service can be invoked for each example of an example set. (Note that this may be time-consuming.) All strings of the form `<%attribute%>` in a request will be automatically replaced with the corresponding attribute value of the current example. The operator provides several different methods to parse the response, including the use of regular expressions and XPath location paths. Parsing the result you can add new attributes to your example set.

For demonstration purposes we will use the Google Geocoding API. This web service also offers reverse geocoding functionality, i.e. provides a human-readable address for a geographical location. To see how it works, click on the following link: http://maps.googleapis.com/maps/api/geocode/xml?latlng=47.555214,21.621423&sensor=false. Notice that latitude and longitude values are passed to the service in the `latlng` query string parameter.

We will use this data file for our experiment. The file contains earthquake data that originates from the Earthquake Search service provided by the United States Geological Survey (USGS). Consider the following RapidMiner process that is available from here:

A RapidMiner process that uses the Enrich Data by Webservice operator to interact with a web service

First, the data file is read by the Read CSV operator. Then the Sort and Filter Example Range operators are used to filter the 50 highest magnitude earthquakes. Finally, the Enrich Data by Webservice operator invokes the web service to retrieve country names for the geographical locations of these 50 earthquakes. (Only a small subset of the entire data is used to prevent excessive network traffic.)

The parameters of the Enrich Data by Webservice operator should be set as follows (see the figure below):

• Set the value of the query type parameter to `XPath`
• Set the value of the attribute type parameter to `Nominal`
• Uncheck the checkbox of the assume html parameter
• Set the value of the request method parameter to `GET`
• Set the value of the url parameter to `http://maps.googleapis.com/maps/api/geocode/xml?latlng=<%Latitude%&gt;,<%Longitude%>&sensor=false`

Parameters of the Enrich Data by Webservice operator

Finally, click on the Edit List button next to the xpath queries parameter that will bring up an Edit Parameter List window. Enter the string `Country` into the attribute name field and the string `//result[type = 'country']/formatted_address/text()` into the query expression field.

Setting of the xpath queries parameter

That’s all! Unfortunately, running the process results in the following error:

Process Failed

Well, this is a bug that I have already reported to the developers. (See the bug report here.) The following trick solves the problem: set the request method parameter of the Enrich Data by Webservice operator to `POST`, enter some arbitrary text into the parameter service method, then set the request method parameter to `GET` again.

The figure below shows the enhanced example set that contains country names provided by the web service (see the Country attribute).

Enhanced example set with country names

## Analyzing Web Server Log Files with RapidMiner (Part 3): from Sessions to Transactions

The Transform Log to Session operator of the Web Mining Extension transforms an example set returned by the Read Server Log operator to a set of transactions suitable for performing association analysis.

The Transform Log to Session operator in a process

The mandatory parameters of the operator are named session attribute and resource attribute. The former determines the attribute used for identifying sessions while the latter determines the attribute used for identifying resources.

Parameters of the Transform Log to Session operator

The result of the operator is an example set in which each session is represented by a single example. The examples have many integer valued attributes each of which corresponds to a resource. The value of such an attribute represents the number of times the resource has been requested during the session.

Result of the Transform Log to Session operator

Note that performing association analysis may require further processing. For example, integer attributes must be transformed to binomial ones in order that the FP-Growth operator can be applied.

Tagged ,

## Analyzing Web Server Log Files with RapidMiner (Part 2): Sessions

The Read Server Log operator returns an example set. Each entry of the example set corresponds to an entry in the log file. The operator automatically associates a session attribute value with each of the examples. Many examples may share the same session attribute value. But hey, what exactly is a session?

Example set returned by the Read Server Log operator

The operator has an important parameter named session timeout whose value must be a non-negative integer. The value of this parameter is measured in milliseconds. A session is a series of log file entries that correspond to HTTP requests initiated by the same user agent from the same host, so that the time difference between the first and the last log file entry in the session is always less or equal to session timeout.

When the first hit by a specific user agent from a specific host is found in the log file, a new session attribute value is assigned to the corresponding example. All subsequent hits by the same user agent from the same site within session timeout are associated with the same session attribute value. Once session timeout is reached, a new session attribute value is generated for the same user agent from the same site.

The operator automatically converts date and time values to integers. The time attribute of the resulting example set stores the number of minutes since January 1, 1970, 00:00:00 GMT. (See the source code of the `com.rapidminer.operator.io.loganalysis.LogFileSourceOperator` class for implementation details.) Note that some information is lost during date and time conversion, since seconds in time values are discarded.

Tagged ,

## Analyzing Web Server Log Files with RapidMiner (Part 1): Quick Start Guide and Bugfix

The RapidMiner Web Mining Extension provides the Read Server Log operator to read web server log files. Unfortunately, it is not straightforward to bring the operator to life based solely on the help text. Moreover, the operator does not work in the current version (5.1.4) of the extension.

At first, you need some data files to play with the operator. Download the following ZIP file available from the KDnuggets website: http://www.kdnuggets.com/web_mining_course/kdlog.zip. The archive contains an anonymized Apache HTTP Server log file (`da-11-16.ipntld.log`). More information about the data can be found here. Place the log file under a separate directory and provide its path in the log dir parameter.

Here comes the tricky part. You must provide a configuration file for the logfile’s format in the config file parameter. The operator uses the polliwog Java library to process web server log files. Download the file `polliwog-bin-stable-0.7.tar.gz` from the projects website. The file `apache-combined-log-entry-format.xml` under the `polliwog-0.7/data/` directory describes the Apache Combined Log Format in which our log file is stored. Place this file in your file system and provide its path in the config file parameter.

Now you can run the process that results in the following error:

Yes, it is a bug that I have already reported to the developers. The error is because of some missing classes. Fortunately, the problem can be fixed quite easily. Copy `gentlyWEB.jar` and `jdom-1.0.jar` from polliwog’s `3rd-party-jars/` directory to the `lib/` directory of your RapidMiner installation. (That may require administrator privileges.) The operator will work fine after restarting RapidMiner. (The author has tested it on Linux.)

In order to fix the problem the content of these two JAR files must be added to the Web Mining Extension by the developers.

Tagged ,