Monthly Archives: July 2012

Breaking the silence: a note about the Apriori algorithm

It’s been a long time since my last post, here I am back again.

Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach and Vipin Kumar [1] is a good introductory textbook in Data Mining. The book has been translated into Hungarian and will hopefully be published in my country this year. Actually, I am one of the translators of the Hungarian edition.

Apriori is a classic algorithm for mining association rules. Chapter 6 of the book discusses the Apriori algorithm. Unfortunately, I found that the pseudocodes for the rule generation step (see Algorithm 6.2 and 6.3 on pages 351 and 352) do not work as expected. These two pseudocodes are the following:

Here \sigma(X) denotes the support count of the itemset X and the function apriori-gen generates the set of frequent (k + 1)-itemsets from the set of frequent k-itemsets.

The main problem is that Algorithm 6.2 and 6.3 above will never generate rules with 1-item consequents. In the original paper that introduces the Apriori algorithm [2] the set H_1 on line 2 of Algorithm 6.2 is defined as the set of consequents of rules derived from f_k with one item in the consequent. However, this implicitly assumes that rules with 1-item consequents are already available. [2] also states that a separate algorithm is required to generate these rules (see page 14):

The rules having one-item consequents in step 2 of this algorithm can be found by using a modified version of the preceding genrules function in which steps 8 and 9 are deleted to avoid the recursive call.

It should also be noted that line 2 of Algorithm 6.2 is simply equivalent to H_1 = f_k.

Finally, the formula on line 2 of Algorithm 6.3 is misleading, since vertical bars are traditionally used to denote cardinality. (In our case m is not the cardinality of set H_m.) I think that the first two lines of Algorithm 6.3 are unnecessary and can be omitted.

Since the book is widely used as a textbook the above problems should be corrected. I have reported the problems to the authors, I hope that they will update the errata of the book accordingly.

Algorithm 6.2 and 6.3 can be modified as follows:

These modified algorithms work as expected and will generate all rules including the ones with 1-item consequents.

References

  1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Addison-Wesley, 2005.
  2. Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, September 1994.
Tagged