Mining association rules

I flickr photo by joannapoe shared under a Creative Commons (BY-SA) license

To find such rules, you would have to execute the rule-induction procedure once for every possible combination of attributes, with every possible combination of values, on the right-hand side. That would result in an enormous number of association rules, which would then have to be pruned down on the basis of their coverage (the number of instances that they predict correctly) and their accuracy (the same number expressed as a proportion of the number of instances to which the rule applies). This approach is quite infeasible. Instead, we capitalize on the fact that we are only interested in association rules with high coverage. We ignore, for the moment, the distinction between the left- and right-hand sides of a rule and seek combinations of attribute–value pairs that have a prespecified minimum coverage. These are called item sets: an attribute–value pair is an item.

Association rules

Once all item sets with the required coverage have been generated, the next step is to turn each into a rule, or set of rules, with at least the specified minimum accuracy. Some item sets will produce more than one rule; others will produce none.

Generating rules efficiently

The first stage proceeds by generating all one-item sets with the given
minimum coverage and then using this to generate the two-item sets, three-item sets (third column), and so on.
Each operation involves a pass through the dataset to count the items in each set, and after the pass the surviving item sets are stored in a hash table.  From the one-item sets, candidate two-item sets are generated, and then a pass is made through the dataset, counting the coverage of each two-item set; at the end the candidate sets with less than minimum coverage are removed from the table. The candidate two-item sets are simply all of the one-item sets taken in pairs, because a two-item set cannot have the minimum coverage unless both its constituent one-item sets have minimum coverage, too. This applies in general: a three-item set can only have the minimum coverage if all three of its two-item subsets have minimum coverage as well, and similarly for four-item sets.


Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Covering algorithms: Constructing rules

In “God’s Own Country” flickr photo by shared under a Creative Commons (BY-ND) license

An alternative approach to a decision tree is to take each class in turn and seek a way of covering all instances in it, at the same time excluding instances not in the class. This is called a covering approach because at each stage you identify a rule that “covers” some of the instances. By its very nature, this covering approach leads to a set of rules rather than to a decision tree.

A difference between the covering algorithms in comparison with recursive divide and conquer decision trees is that, in the multi class case, a decision tree split takes all classes into account, trying to maximize the purity of the split, whereas the rule-generating method concentrates on one class at a time, disregarding what happens to the other classes.

A simple covering algorithm

Covering algorithms operate by adding tests to the rule that is under construction, always striving to create a rule with maximum accuracy. In contrast, divide and-conquer algorithms operate by adding tests to the tree that is under construction, always striving to maximize the separation among the classes. Each of these involves finding an attribute to split on. But the criterion for the best attribute is different in each case. Whereas divide-and-conquer algorithms such as ID3 choose an attribute to maximize the information gain, the covering algorithm we will describe chooses an attribute–value pair to maximize the probability of the desired classification.


The PRISM method for constructing rules. It generates only correct or “perfect” rules. It measures the success of a rule by the accuracy formula p/t. Any rule with accuracy less than 100% is “incorrect” in that it assigns cases to the class in question that actually do not have that class.

The outer loop iterates over the classes, generating rules for each class in turn. Note that we reinitialize to the full set of examples each time round. Then we create rules for that class and remove the examples from the set until there are none of that class left.  Whenever we create a rule, start with an empty rule (which covers all the examples), and then restrict it by adding tests until it covers only examples of the desired class. At each stage choose the most promising test, that is, the one that maximizes the accuracy of  the rule. Finally, break ties by selecting the test with greatest coverage.

Methods such as PRISM can be described as separate-and-conquer algorithms: you identify a rule that covers many instances in the class (and excludes ones not in the class), separate out the covered instances because they are already taken care of by the rule, and continue the process on those that are left. This contrasts nicely with the divide-and-conquer approach of decision trees. The separate step greatly increases the efficiency of the method because the instance set continually shrinks as the operation proceeds.


Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Divide-and-conquer: Constructing decision trees

A Pollinator in Pink……..HFDF! flickr photo by The Manic Macrographer shared under a Creative Commons (BY) license

The problem of constructing a decision tree can be expressed recursively. First, select an attribute to place at the root node and make one branch for each possible value. This splits up the example set into subsets, one for every value of the attribute. Now the process can be repeated recursively for each branch, using only those instances that actually reach the branch. If at any time all instances at a node have the same classification, stop developing that part of the tree.

The only thing left to decide is how to determine which attribute to split on, given a set of examples with different classes. If we had a measure of the purity of each node, we could choose the attribute that produces the purest daughter nodes. The measure of purity that we will use is called the information and is measured in units called bits. Associated with a node of the tree, it represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node.



For outlook We can calculate the average information value of these, taking into account the number of instances that go down each branch—five down the first and third and four down the second:


This average represents the amount of information that we expect would be necessary to specify the class of a new instance, given the tree structure in Figure 4.2(a) Before we created any of the nascent tree structures in Figure 4.2, the training examples at the root comprised  line yes and five no nodes, corresponding to an information value of


Thus the tree in Figure is responsible for an information gain of


The way forward is clear. We calculate the information gain for each attribute and choose the one that gains the most information to split on. In the situation of Figure 4.2,


Calculating information

Before examining the detailed formula for calculating the amount of information required to specify the class of an example given that it reaches a tree node with a certain number of yes’s and no’s, consider first the kind of properties we would expect this quantity to have:


1. When the number of either yes’s or no’s is zero, the information is zero.

2. When the number of yes’s and no’s is equal, the information reaches a maximum

3. The information must obey the multistage property illustrated previously.

Remarkably, it turns out that there is only one function that satisfies all these properties, and it is known as the information value or entropy:


The arguments p1, p2, . . . of the entropy formula are expressed as fractions that add up to one, so that, for example, Info([2,3,4]) = entropy(2/3,3/9,4/9) Thus the multistage decision property can be written in general as


Because of the way the log function works, you can calculate the information measure   without having to work out the individual fractions:


The method that has been described using the information gain criterion is essentially the same as one known as ID3. (

A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called C4.5 ( These improvements include methods for dealing with numeric attributes, missing values, noisy data, and generating rules from trees.



Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier

Algorithms: The Basic Methods 1R and Statistical modeling

In the following 2 post I’m gonna go around the 9 most common data mining algorithms.

Inferring rudimentary rules

Here’s an easy way to find very simple classification rules from a set of instances. Called 1R for 1-rule, it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. The idea is this: we make rules that test a single attribute and branch accordingly. Each branch corresponds to a different value of the attribute. It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data. Then the error rate of the rules can easily be determined. Just count the errors that occur on the training data, that is, the number of instances that do not have the majority class.


Missing values and numeric attributes

The good! Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes. It deals with these in simple but effective ways. Missing is treated as just another attribute value so that, for example, if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunny, overcast, and rainy and a fourth for missing.

The bad!

this procedure tends to form a large number of categories. The 1R method will naturally gravitate toward choosing an attribute that splits into many categories, because this will partition the dataset into many classes, making it more likely that instances will have the same class as the majority in their partition. This phenomenon is known as overfitting. For 1R, overfitting is likely to occur whenever an attribute has a large number of possible values

Surprisingly, despite its simplicity 1R did astonishingly—even embarrassingly—well in comparison with state-of-the-art learning methods, and the rules it produced turned out to be just a few percentage points less accurate, on almost all of the datasets, than the decision trees produced by a state-of-the-art decision tree induction scheme.

Statistical modeling

Another simple technique is to use all attributes and allow them to make contributions to the decision that are equally important and independent of one another, given the class.


shows a summary of the weather data obtained by counting how many times each attribute–value pair occurs with each value (yes and no) for play.

In the lower part of the table, we rewrote the same information in the form of fractions, or observed probabilities. For example, of the nine days that play is yes, outlook is sunny for two, yielding a fraction of 2/9. For play the fractions are different: they are the proportion of days that play is yes and no, respectively Now suppose we encounter a new example with the values that are shown in The first table We treat the five features in outlook, temperature, humidity, windy, and the overall likelihood that play is yes or no—as equally important, independent pieces of evidence and multiply the corresponding fractions.

likelihood of yes = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053

likelihood of no = 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206


This method goes by the name of Naïve Bayes, because it’s based on Bayes’s rule and “naïvely” assumes  independence—it is only valid to multiply probabilities when the events are independent. The assumption that attributes are independent (given the class) in real life certainly is a simplistic one.

The good!

Despite the disparaging name, Naïve Bayes works very well when tested on actual datasets, particularly when combined with some of the attribute selection procedures

The bad!

One thing that can go wrong with Naïve Bayes is that if a particular attribute value does not occur in the training set in conjunction with every class value, things go badly awry. Although But the bug is easily fixed by minor adjustments to the method of calculating pro For example, the upper part of Table 4.2 shows that for play = yes, outlook is sunny for two examples, overcast for four, and rainy for three, and the lower part gives these events probabilities of 2/9, 4/9, and 3/9, respectively. Instead, we could add 1 to each numerator and compensate by adding 3 to the denominator, giving probabilities of 3/12, 5/12, and 4/12, respectively. This will ensure that an attribute value that occurs zero times receives a probability which is nonzero, albeit small. The strategy of adding 1 to each count is a standard technique called the Laplace estimator after the great eighteenth-century French mathematician Pierre Laplace. Although it works well in practice, there is no particular reason for adding 1 to the counts: we could instead choose a small constant μ and use probabilities from frequencies.


Finally, there is no particular reason for dividing m into three equal parts in the numerators: we could use


In practice, the prior probabilities make little difference provided that there are a reasonable number of training instances, and people generally just estimate frequencies using the Laplace estimator by initializing all counts to one instead of to zero.

Missing values and numeric attributes

One of the really nice things about the Bayesian formulation is that missing values are no problem at all. For example, if the value of outlook were missing in the example of Table 4.3, the calculation would simply omit this attribute, yielding Numeric values are usually handled by assuming that they have a “normal”or “Gaussian”  probability distribution. Then, whereas we normalized the counts for the nominal attributes into probabilities, we calculated the mean and standard deviation for each class and each numeric attribute. Thus the mean value of temperature over the yes instances is 73, and its standard deviation is 6.2. The mean is simply the average of the preceding values, that is, the sum divided by the number of values. The standard deviation is the square root of the sample variance, which we can calculate as follows: subtract the mean from each value, square the result, sum them together, and then divide by one less than the number of values.



Naïve Bayes gives a simple approach, with clear semantics, to representing, using, and learning probabilistic knowledge. Impressive results can be achieved using it. It has often been shown that Naïve Bayes rivals, and indeed outperforms, more sophisticated classifiers on many datasets. The moral is, always try the simple things first. Repeatedly in machine learning people have eventually, after an extended struggle, obtained good results using sophisticated learning methods only to discover years later that simple methods such as 1R and Naïve

Bayes do just as well—or even better.

here are many datasets for which Naïve Bayes does not do so well, however, and it is easy to see why. Because attributes are treated as though they were completely independent, the addition of redundant ones skews the learning process.


Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier

Output: Knowledge Representation

Decision trees

A “divide-and-conquer” approach to the problem of learning from a set of independent instances leads naturally to a style of representation called a decision tree.

If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible values of the attribute. In this case, because there is one branch for each possible value, the same attribute will not be retested further down the tree.

If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a predetermined constant, giving a two-way split.

The bad!

Missing values pose an obvious problem. It is not clear which branch should be taken when a node tests an attribute whose value is missing. Sometimes, as described in Section 2.4, missing value is treated as an attribute value in its own right. If this is not the case, missing values should be treated in a special way rather than being considered as just another possible value that the attribute might take. A simple solution is to record the number of elements in the training set that go down each branch and to use the most popular branch if the value for a test instance is missing.


Classification rules

Generally, the preconditions are logically ANDed together, and all the tests must succeed if the rule is to fire. However, in some rule formulations the preconditions are general logical expressions rather than simple conjunctions. More difficult: transforming a rule set into tree cannot easily express disjunction between rules Example: rules which test different attributes Symmetry needs to be broken the  replicated subtree problem

One reason why rules are popular is that each rule seems to represent an independent “nugget” of knowledge. New rules can be added to an existing rule set without disturbing ones already there, whereas to add to a tree structure may require reshaping the whole tree.

Association rules

Association rules are really no different from classification rules except that they can predict any attribute, not just the class, and this gives them the freedom to predict combinations of attributes too.

  • The coverage of an association rule is the number of instances for which it predicts correctly—this is often called its support
  • Its accuracy—often called confidence—is the number of instances that it predicts correctly, expressed as a proportion of all instances to which it applies.

Rules with exceptions

Returning to classification rules, a natural extension is to allow them to have exceptions. Then incremental modifications can be made to a rule set by expressing exceptions to existing rules rather than re engineering the entire set.

Rules involving relations

rules are called propositional because the attribute-value language used to define them has the same power as what logicians call the propositional calculus.

Trees for numeric prediction

Because statisticians use the term regression for the process of computing an expression that predicts a numeric quantity, decision trees with averaged numeric values at the leaves are called regression trees.

Instance-based representation

The simplest form of learning is plain memorization, or rote learning Once aset of training instances has been memorized, on encountering a new instance the memory is searched for the training instance that most strongly resembles the new one.

This is known as instance-based learning. In a sense all the other learning methods are “instance-based,” too, because we always start with a set of instances as the initial training information. But the instance-based knowledge representation uses the instances themselves to represent what is learned, rather than inferring a rule set or decision tree and storing it instead. In instance-based learning, all the real work is done when the time comes to classify a new instance rather than when the training set is processed.

n instance-based learning, each new instance is compared with existing ones using a distance metric, and the closest existing instance is used to assign the class to the new one. This is called the nearest-neighbor classification method.  Sometimes more than one nearest neighbor is used, and the majority class of the closest k neighbors (or the distance-weighted average, if the class is numeric) is assigned to the new instance. This is termed the k-nearest-neighbor method.


When clusters rather than a classifier is learned, the output takes the form of a diagram that shows how the instances fall into clusters. In the simplest case this involves associating a cluster number with each instance, which might be depicted by laying the instances out in two dimensions and partitioning the space to show each cluster.

Some algorithms associate instances with clusters probabilistically rather than categorically. In this case, for every instance there is a probability or degree of membership with which it belongs to each of the clusters. This is shown in Figure 3.9(c). Figure 3.9(d) is used, in which elements joined together at lower levels are more tightly clustered than ones joined together at higher levels. Diagrams such as this are called dendrograms. This term means

just the same thing as tree diagrams (the Greek word dendron means “a tree”), but in clustering the more exotic version seems to be preferred—perhaps because biologic species are a prime application area for clustering techniques, and ancient languages are often used for naming in biology.




Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Input: Concepts, Instances, and Attributes

The input takes the form of concepts, instances, and attributes. We call the thing that is to be learned a concept description.

  • The information that the learner is given takes the form of a set of instances

Four basically different styles of learning appear in data mining applications.

  • classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples.
  • In association learning, any association among features is sought, not just ones that predict a particular class value. In clustering, groups of examples that belong together are sought
  • In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity.

Regardless of the type of learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description.

Classification learning is sometimes called supervised because, in a sense, the method operates under supervision by being provided with the actual outcome for each of the training examples

This outcome is called the class of the example.

When there is no specified class, clustering is used to group items that seem to fall naturally together.

The success of clustering is often measured subjectively in terms of how useful the result appears to be to a human user. It may be followed by a second step of classification learning in which rules are learned that give an

intelligible description of how new instances should be placed into the clusters.

What’s in an example?

These instances are the things that are to be classified, associated, or clustered. Although until now we have called them examples, henceforth we will use the more specific term instances to refer to the input. Each instance is an individual, independent example of the concept to be learned.

The input to a data mining scheme is generally expressed as a table of independent instances of the concept to be learned. Because of this, it has been suggested, disparagingly, that we should really talk of file mining rather than database mining. Relational data is more complex than a flat file. A finite set of finite relations can always be recast into a single table, although often at enormous cost in space. Moreover, denormalization can generate spurious regularities in the data, and it is essential to check the data for such artifacts before applying a learning method.

What’s in an attribute?

Each individual, independent instance that provides the input to machine learning is characterized by its values on a fixed, predefined set of features or attributes.

The value of an attribute for a particular instance is a measurement of the quantity to which the attribute refers.

Most practical data mining systems accommodate just two of these four levels of measurement: nominal and ordinal. Nominal attributes are sometimes called categorical, enumerated, or discrete Ordinal attributes are generally called numeric, or perhaps continuous, but without the implication of mathematical continuity.


Preparing the input

The idea of company wide database integration is known as data warehousing.

Sometimes called overlay data, this is not normally collected by an organization but is clearly relevant to the data mining problem. It, too, must be cleaned up and integrated with the other data that has been collected.

ARFF format

a standard way of representing datasets that consist of independent, unordered  instances and do not involve relationships among instances, called an ARFF file.

Attribute types

ARFF files accommodate the two basic data types, nominal and numeric.

Attributes are often normalized to lie in a fixed range, say, from zero to one, by dividing all values by the maximum value encountered or by subtracting the minimum value and  dividing by the range between the maximum and the minimum values. Another normalization technique is to calculate the statistical mean and standard deviation of the attribute values, subtract the mean from each value, and divide the result by the standard deviation. This process is called standardizing a statistical variable and results in a set of values whose mean is zero and standard deviation is one.

Missing values

You have to think carefully about the significance of missing values. They may occur for several reasons, such as malfunctioning measurement equipment, changes in experimental design during data collection, and collation of several similar but not identical datasets. Respondents in a survey may refuse to answer certain questions such as age or income. In an archaeological study, a specimen such as a skull may be damaged so that some variables cannot be measured.

Inaccurate values

Typographic errors in a dataset will obviously lead to incorrect values Duplicate data presents another source of error. People often make deliberate errors when entering personal data into databases.


Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Introduction to data mining

What is data mining.

In data mining, the data is stored electronically and the search is automated— or at least augmented—by computer.

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

  • as a black box whose innards are effectively incomprehensible and as a
  • transparent box whose construction reveals the structure of the pattern.

Such patterns we call structural because they capture the decision structure in an explicit way

Machine learning

Things learn when they change their behavior in a way that makes them perform better in the future.

domain knowledge

Market basket analysis is the use of association techniques to find groups of items that tend to occur together in transactions, typically supermarket checkout data.

What’s the difference between machine learning and statistics? Cynics, looking wryly at the explosion of commercial interest (and hype) in this area, equate data mining to statistics plus marketing. In truth, you should not look for a dividing line between machine learning and statistics because there is a continuum—and a multidimensional one at that—of data analysis techniques. Some derive from the skills taught in standard statistics courses, and others are more closely associated with the kind of machine learning that has arisen out of computer science. Historically, the two sides have had rather different traditions. If forced to point to a single difference of emphasis, it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all.


Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

A new machine learning challenge for the upcoming semester.

As the semester starts a new challenge has appeared, I was assigned with the task of creating a software capable of filtering information in the abstracts of research papers for the purpose of classifying them and creating a network of people that are working more or less in topics in the same area, as it seems that the universities lack founding for every single researcher since research investigation have grown lots in the last couple of decades, a software capable of aggregating professionals with the same interests could potentially reduce research costs. In the upcoming posts I’ll summaries my research toward my machine learning studies, findings and understandings.