Monthly Archives: April 2012

Source code syntax highlighting in LaTeX using Pygments

A number of LaTeX packages provide environments for displaying source code, a comprehensive list of them is available here. For a long time I have preferred to use the fancyvrb package to format computer source code. I have just found an excellent alternative, called minted (it is also available at CTAN).

Minted uses Pygments, a general purpose syntax highlighter written in the Python programming language. Python and Pigments have to be installed in order to use minted, and a few other LaTeX packages are also required. (See the documentation for detailed installation instructions.)

The following simple example shows how to use minted:

\documentclass{article}
\usepackage{minted}

\begin{document}

\begin{minted}[frame=single,linenos,mathescape]{java}
public class Fibonacci {

  // The golden ratio $\phi = \frac{1 + \sqrt{5}}{2}$.
  public static final double PHI = (1.0 + Math.sqrt(5.0)) / 2.0;

  public static double fibonacci(long n) {
    if (n < 0) throw new IllegalArgumentException();
    return Math.floor(Math.pow(PHI, n) / Math.sqrt(5.0) + 0.5);
  }

}
\end{minted}

\end{document}

Click here to see the result.

The language java in line 6 can be replaced with many other languages, such as c, c++, sql, tex or xml. Pygments currently supports more than 200 programming, template and markup languages, see the output of the command

pygmentize -L lexers

for an exhaustive list of them. It is very important that LaTeX source files using the minted package must be compiled with the -shell-escape option, such as

pdflatex -shell-escape file.tex

Minted provides a number of options to customize formatting. For convenience, you can choose any of the styles provided with Pygments (you can also write your own style).

A major limitation of the package is that it supports only the Latin-1 character encoding. To overcome this problem the documentation suggests to use the command xelatex instead of the command pdflatex. (xelatex is part of XeTeX, an extension of TeX that comes with built-in support for Unicode.) Unfortunately, this solution does not work for me. If I try to compile the above LaTeX file with xelatex I always get the following error:

! Undefined control sequence.
l.27 \ifnum\pdf@shellescape
                           =1\relax

Thanks to Jabba Laci the great pythonist for introducing me to minted.

Tagged ,

Readline style command line editing with JLine

JLine 2.x is a free and open source console I/O library written in Java and distributed under the Modified BSD License

It offers line-editing and history capabilities for console applications, that are similar to the functions provided by the GNU readline library. For a complete list of its main features see the wiki page of the project.

Since JLine is available from the Maven Central Repository, the easiest way to get it is to add the following dependency to your project’s POM:

<dependency>
    <groupId>jline</groupId>
    <artifactId>jline</artifactId>
    <version>2.6</version>
    <scope>compile</scope>
</dependency>

The following is a simple example that demonstrates how to use the library:

import java.io.IOException;

import jline.TerminalFactory;
import jline.console.ConsoleReader;

public class ConsoleDemo {

    public static void main(String[] args) {
        try {
            ConsoleReader console = new ConsoleReader();
            console.setPrompt("prompt> ");
            String line = null;
            while ((line = console.readLine()) != null) {
                console.println(line);
            }
        } catch(IOException e) {
            e.printStackTrace();
        } finally {
            try {
                TerminalFactory.get().restore();
            } catch(Exception e) {
                e.printStackTrace();
            }
        }
    }

}

The program uses the ConsoleReader class to read lines from the console until end-of-file is encountered (press control-D to signal end-of-file). The lines read are simply echoed back to the console. Command line history is enabled by default, you can recall and edit lines that have been previously entered.

JLine supports command line completion that is bound to the TAB key by default. For example, to enable automatic file name completion simply add a FileNameCompleter instance to the console object with the following line of code:

console.addCompleter(new FileNameCompleter());

You can add more completers, such as a StringsCompleter with a collection of strings:

console.addCompleter(
    new StringsCompleter(
        IOUtils.readLines(new GZIPInputStream(ConsoleDemo.class.getResourceAsStream("wordlist.txt.gz")))
    )
);

Here we use a compressed wordlist from the file wordlist.txt.gz that is loaded by the IOUtils class from the Commons IO library.

Command line editing with JLine

Command line editing with JLine

The TerminalFactory.get().restore() call in the finally block does some cleanup and restores the original terminal configuration. This cleanup is performed automatically, if the jline.shutdownhook system property is set to true.

It’s a bit odd that the API documentation is not available online, however, you can grab the javadoc in a JAR from Maven Central. It should also be noted that the API documentation could be better. (Some of the methods are completely undocumented.) Despite these minor shortcomings, it is an excellent library that deserves attention.

You can download the above example program from here.

Tagged

Cross-validation in RapidMiner

Cross-validation is a standard statistical method to estimate the generalization error of a predictive model. In k-fold cross-validation a training set is divided into k equal-sized subsets. Then the following procedure is repeated for each subset: a model is built using the other (k - 1) subsets as the training set and its performance is evaluated on the current subset. This means that each subset is used for testing exactly once. The result of the cross-validation is the average of the performances obtained from the k rounds.

This post explains how to interpret cross-validation results in RapidMiner. For demonstration purposes, we consider the following simple RapidMiner process that is available here:

The Read URL operator reads the yellow-small+adult-stretch.data file, a subset of the Balloons Data Set available from the UCI Machine Learning Repository. Since this data set contains only 16 examples, it is very easy to perform all calculations in your head.

The Set Role operator marks the last attribute as the one that provides the class labels. The number of validations is set to 3 on the X-Validation operator, that will result a 5-5-6 partitioning of the examples in our case.

In the training subprocess of the cross-validation process a decision tree classifier is built on the current training set. In the testing subprocess the accuracy of the decision tree is computed on the test set.

The result of the process is the following PerformanceVector:

74.44 is obviously the arithmetic mean of the accuracies obtained from the three rounds and 10.30 is their standard deviation. However, it is not clear how to interpret the confusion matrix below and the value labelled with the word makro. You may ask how a single confusion matrix is returned if several models are built and evaluated in the cross-validation process.

The Write as Text operator in the inner testing subprocess writes the performance vectors to a text file that helps us to understand the results above. The file contains the confusion matrices obtained from each round together with the corresponding accuracy values as shown below:

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [1]: 
13.04.2012 22:07:35 PerformanceVector:
accuracy: 60.00%
ConfusionMatrix:
True:	T	F
T:	0	0
F:	2	3

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [2]: 
13.04.2012 22:07:35 PerformanceVector:
accuracy: 83.33%
ConfusionMatrix:
True:	T	F
T:	2	0
F:	1	3

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [3]: 
13.04.2012 22:07:35 PerformanceVector:
accuracy: 80.00%
ConfusionMatrix:
True:	T	F
T:	2	1
F:	0	2

Notice that the confusion matrix on the PerformanceVector (Performance) tab is simply the sum the three confusion matrices. The value labelled with the word mikro (75) is actually the accuracy computed from this aggregated confusion matrix. A performance calculated this way is called mikro average, while the mean of the averages is called makro average. Note that the confusion matrix behind the mikro average is constructed by evaluating different models on different test sets.

Tagged ,