Monthly Archives: September 2012

Processing live data feeds with RapidMiner

The Open File operator has been introduced in the 5.2 version of RapidMiner. It returns a file object for reading content either from a local file, from an URL or from a repository blob entry. Many data import operators including Read CSV, Read Excel and Read XML has been extended to accept a file object as input. With this new feature, now you can process live data feeds directly in RapidMiner.

Many data import operators provide a wizard to guide users through the process of parameter setting. Unfortunately, wizards can not use file objects, they always present a file chooser dialog on start. When dealing with data from the web, you can make use of the wizards according to the following scenario: download the data file and pass your local copy to the wizard. After successful import you can even delete the local file. Data import operators ignore their file name parameter when they receive a file object as input.

In the following a simple use case is presented for demonstration purposes.

The United States Geological Survey’s (USGS) Earthquake Hazards Program provides real-time earthquake data. Real-time feeds are available here. Data is updated periodically and is available for download in multiple formats. For example, click here to get data in CSV format about all M2.5+ earthquakes of the past 30 days (the feed is updated every fifteen minutes).

Let’s see how to read this feed in a RapidMiner process. First, download the feed to your computer. The local copy is required only to set the parameters of the Read CSV operator by using the Import Configuration Wizard. For this purpose you can use a smaller data file, for example this one.

Import the local copy of the feed using the wizard. Select the following data types for the attributes:

  • Src (source network): polynomial
  • EqId: polynomial
  • Version: integer
  • Datetime: date_time
  • Lat: real
  • Lon: real
  • Magintude: real
  • NST (number of reporting stations): integer
  • Region: text

Important: the value of the date format parameter must be set to E, MMM d, yyyy H:mm:ss z to ensure correct handling of the Datetime attribute. For details about date and time pattern strings consult the API documentation of the SimpleDateFormat class (see section titled Date and Time Patterns). It is also important to set the value of the locale parameter to one of the English locales.

Once the local file is imported successfully, drag the Open file operator into the process and connect its output port the input port of the Read CSV operator. Set the parameters of the Open file operator according to the following: set the value of the resource type parameter to URL, and provide the URL of the feed with the parameter url.

A RapidMiner process that uses the Open file operator to read a data feed from the web

Now you can delete the local data file, the operator will read the feed from the URL when the process is run.

You can download the complete RapidMiner process here.

Tagged ,

Free and open source XSD 1.1 validation tool?

Currently, Xerces2 Java seems to be the one and only free and open source solution for XSD 1.1 validation. You can download Xerces2 Java here. Be careful to pick the right version that comes with XSD 1.1 support. (The binary distribution is in the file, and the file contains the sources.) This release of the package has complete support for XSD 1.1.

Unfortunately, the distribution does not provide any command line validation tool, you have to write your own from scratch. I provide a simple but handy implementation in xsd11-validator.jar. This JAR also contains Xerces2 Java with all of its dependencies.

You can run the JAR with the command java -jar xsd11-validator.jar to display usage information:

usage: java hu.unideb.inf.validator.SchemaValidator -if <file> | -iu <url>
       [-sf <file> | -su <url>]
 -if,--instanceFile <file>   read instance from the file specified
 -iu,--instanceUrl <url>     read instance from the URL specified
 -sf,--schemaFile <file>     read schema from the file specified
 -su,--schemaUrl <url>       read schema from the URL specified

You most provide an instance document to be validated using either the -if or the -iu option. (Option -if requires a file path as an argument, option -iu requires an URL.) Similarly, you can specify a schema document using either the -sf or -su option. The -sf and -su options are not mandatory, if they are omitted the value of the xsi:schemaLocation attribute is considered in the instance. The following is an example of how to use the program:

java -jar xsd11-validator.jar -sf schema.xsd -if instance.xml

From a developer’s standpoint, there is a minor flaw of Xerces2 Java: you will not find the required beta release in any of the publicly available Maven repositories. You must use your own local copy of xercesImpl.jar in your Maven projects. The good news is that its dependencies are available from repositories. Take a look at the source distribution of the command line tool to see how Xerces2 Java can be used in your Maven projects.


Exploring the new features of XML Schema 1.1 (Part 1)

XML Schema 1.1 has just been promoted to Recommendation by the W3C in this year’s April. It’s time to explore the changes compared to the previous version.

First, the name of the standard has been changed to W3C XML Schema Definition Language (XSD). Beyond that, XSD 1.1 offers exciting new features, while preserving backward compatibility. This post is the first in a series of posts that will demonstrate some of the new features of XSD 1.1.

One of the two newly introduced constraining facets is called assertion (the other one is called explicitTimezone). As you will see, it is a powerful new feature that comes handy for defining datatypes. The facet constrains the value space by a user-provided logical expression that must be satisfied.

The following simple example demonstrates how to use the assertion facet:

<?xml version="1.0"?>
<xs:schema xmlns:xs="">

    <xs:element name="number">
            <xs:restriction base="xs:integer">
                <xs:assertion test="abs($value mod 2) eq 1"/>


Note that the above just looks like as a plain old schema document, except for the assertion element. There is no way to explicitly indicate that XSD 1.1 is being used here.

The test attribute of the assertion element contains an XPath 2.0 expression that will be evaluated as true or false. (The boolean function is used to convert the value of the expression to a boolean.) In the XPath expression $value can be used to refer to the value being checked.

As mod stands for the modulo operation, the value space of the datatype defined is clearly the set of odd integers. Note that, an equivalent solution is to use regular expression matching that is also available in XML Schema 1.0. Replacing the assertion element in line 7 with

    <xs:pattern value=".*[13579]"/>

also results in the same value space.

However, there are situations in which regular expressions can not help. For example, consider the case of palindromes. Let’s try to define a new datatype whose value space is the set of palindrome strings. You may recall that from your computational theory class, this is not possible by using regular expressions. The good news is that we can do it by using XPath functions.

Since there is an XPath function called reverse,

<xs:simpleType name="palindromeString">
    <xs:restriction base="xs:string">
        <xs:assertion test="$value eq reverse($value)/">

seems to be a reasonable initial solution. Unfortunately, the function operates on sequences and can not be used to reverse strings directly.

The following trick will do the job. First, we will turn the string being checked into a sequence of Unicode codepoints (ie. a sequence of integers) using the string-to-codepoints function. Then the reverse function is applied to the resulting sequence. Finally, the codepoints-to-string function is used to turn it back into a string. Thus, our solution is now the following:

<xs:simpleType name="palindromeString">
    <xs:restriction base="xs:string">
        <xs:assertion test="$value eq codepoints-to-string(reverse(string-to-codepoints($value)))"/>

One more step is necessary to complete our job: comparison must be performed ignoring case and any punctuation characters. In order to do that we must replace both occurrences of $value with lower-case(replace($value, '[\s\p{P}]', '')) in the test attribute. Here we use the replace function to remove any whitespace and punctuation characters from the string.

Our final solution is the following:

?xml version="1.0"?>
<xs:schema xmlns:xs="">

    <xs:simpleType name="palindromeString">
        <xs:restriction base="xs:string">
            <xs:assertion test="lower-case(replace($value, '[\s\p{P}]', '')) eq codepoints-to-string(reverse(string-to-codepoints(lower-case(replace($value, '[\s\p{P}]', '')))))"/>

    <xs:element name="palindrome" type="palindromeString"/>


For example, the following are all valid instances of the palindrome element:

<palindrome>never odd or even</palindrome>
<palindrome>Madam, I'm Adam</palindrome>
    A man, a plan, a canal - Panama!

You can download the examples above in a ZIP archive here.