Clustering

 
Fair Isle flickr photo by neil1877 shared under a Creative Commons (BY-NC-ND) license

Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances.

The groups that are identified may be exclusive so that any instance belongs in only one group. Or they may be overlapping so that an instance may fall into several groups. Or they may be probabilistic, whereby an instance belongs to each group with a certain probability.

Iterative distance-based clustering

The classic clustering technique is called k-means. First, you specify in advance how many clusters are being sought: this is the parameter k. Then k points are chosen at random as cluster centers. All instances are assigned to their closest cluster center according to the ordinary Euclidean distance metric. Next the centroid, or mean, of the instances in each cluster is calculated—this is the “means” part. These centroids are taken to be new center values for their respective clusters. Finally, the whole process is repeated with the new cluster centers. Iteration continues until the same points are assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized and will remain the same forever.

Faster distance calculations

For example, you can project the dataset and make cuts along selected axes, instead of using the arbitrary hyperplane divisions that are implied by choosing the nearest cluster center. But this inevitably compromises the quality of the resulting clusters.

Here’s a better way of speeding things up. Finding the closest cluster center is not so different from finding nearest neighbors in instance-based learning. Can the same efficient solutions—kD-trees and ball trees—be used? Yes! Indeed they can be applied in an even more efficient way, because in each iteration of k-means all the data points are processed together, whereas in instance-based learning test instances are processed individually.

 

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Advertisements

Instance-based learning

 
house flickr photo by barnyz shared under a Creative Commons (BY-NC-ND) license

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

Distance function

Although there are other possible choices, most instance-based learners use Euclidean distance.

5

When comparing distances it is not necessary to perform the square root operation; the sums of squares can be compared directly.

Different attributes are measured on different scales, so if the Euclidean distance formula were used directly, the effects of some attributes might be completely dwarfed by others that had larger scales of measurement. Consequently, it is usual to normalize all attribute values to lie between 0 and 1, by calculating

6

Nearest-neighbor instance-based learning is simple and often works very well. In the method described previously each attribute has exactly the same influence on the decision, just as it does in the Naïve Bayes method. Another problem is that the database can easily become corrupted by noisy exemplars.

 

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier

Linear models


bricks. flickr photo by eaortmann shared under a Creative Commons (BY-NC-ND) license

When the outcome, or class, is numeric, and all the attributes are numeric, linear regression is a natural technique to consider. This is a staple method in statistics. The idea is to express the class as a linear combination of the  attributes, with predetermined weights:

1

Linear regression is an excellent, simple method for numeric prediction, and it has been widely used in statistical applications for decades. Of course, linear models suffer from the disadvantage of, well, linearity. If the data exhibits a nonlinear dependency, the best-fitting straight line will be found, where “best” is interpreted as the least mean-squared difference.

Linear classification: Logistic regression

we can use any regression technique, whether linear or nonlinear, for classification. The trick is to perform a regression for each class, setting the output equal to one for training instances that belong to the class and zero for those that do not. The result is a linear expression for the class. Then, given a test example of unknown class, calculate the value of each linear expression and choose the one that is largest. This method is sometimes
called
multiresponse linear regression.

One way of looking at multiresponse linear regression is to imagine that it approximates a numeric membership function for each class. The membership function is 1 for instances that belong to that class and 0 for other instances. Given a new instance we calculate its membership for each class and select the biggest.

Drawbacks

  • First, the membership values it produces are not proper probabilities because they can fall outside the range 0 to 1.
  • Second, leastsquares regression assumes that the errors are not only statistically independent, but are also normally distributed with the same standard deviation, an assumption that is blatantly violated when the method is applied to classification problems because the observations only ever take on the values 0 and 1.

A related statistical technique called logistic regression does not suffer from these problems. Instead of approximating the 0 and 1 values directly, thereby risking illegitimate probability values when the target is overshot, logistic regression builds a linear model based on a transformed target variable.

Linear classification using the perceptron

to learn a hyperplane that separates the instances pertaining to the different classes let’s assume that there are only two of them. If the data can be separated perfectly into two groups using a hyperplane, it is said to be linearly separable. It turns out that if the data is linearly separable, there is a very simple algorithm for finding a separating hyperplane.

The algorithm is called the perceptron learning rule.

2

Here, a1, a2, . . ., ak are the attribute values, and w0, w1, . . ., wk are the weights that define the hyperplane. We will assume that each training instance a1, a2, . . . is extended by an additional attribute a0 that always has the value 1 This extension, which is called the bias, just means that we don’t have to include an additional constant element in the sum.

3

Of course, if the data is not linearly separable, the algorithm will not terminate, so an upper bound needs to be imposed on the number of iterations when this method is applied in practice.

Linear classification using Winnow

Winnow only updates the weight vector when a misclassified instance is encountered
it is
mistake driven.

The two methods differ in how the weights are updated. The perceptron rule employs an additive mechanism that alters the weight vector by adding (or subtracting) the instance’s attribute vector. Winnow employs multiplicative updates and alters weights individually by multiplying them by the user-specified parameter a (or its inverse).

4

 

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Mining association rules


I flickr photo by joannapoe shared under a Creative Commons (BY-SA) license

To find such rules, you would have to execute the rule-induction procedure once for every possible combination of attributes, with every possible combination of values, on the right-hand side. That would result in an enormous number of association rules, which would then have to be pruned down on the basis of their coverage (the number of instances that they predict correctly) and their accuracy (the same number expressed as a proportion of the number of instances to which the rule applies). This approach is quite infeasible. Instead, we capitalize on the fact that we are only interested in association rules with high coverage. We ignore, for the moment, the distinction between the left- and right-hand sides of a rule and seek combinations of attribute–value pairs that have a prespecified minimum coverage. These are called item sets: an attribute–value pair is an item.

Association rules

Once all item sets with the required coverage have been generated, the next step is to turn each into a rule, or set of rules, with at least the specified minimum accuracy. Some item sets will produce more than one rule; others will produce none.

Generating rules efficiently

The first stage proceeds by generating all one-item sets with the given
minimum coverage and then using this to generate the two-item sets, three-item sets (third column), and so on.
Each operation involves a pass through the dataset to count the items in each set, and after the pass the surviving item sets are stored in a hash table.  From the one-item sets, candidate two-item sets are generated, and then a pass is made through the dataset, counting the coverage of each two-item set; at the end the candidate sets with less than minimum coverage are removed from the table. The candidate two-item sets are simply all of the one-item sets taken in pairs, because a two-item set cannot have the minimum coverage unless both its constituent one-item sets have minimum coverage, too. This applies in general: a three-item set can only have the minimum coverage if all three of its two-item subsets have minimum coverage as well, and similarly for four-item sets.


Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Covering algorithms: Constructing rules


In “God’s Own Country” flickr photo by www.davidbaxendale.com shared under a Creative Commons (BY-ND) license

An alternative approach to a decision tree is to take each class in turn and seek a way of covering all instances in it, at the same time excluding instances not in the class. This is called a covering approach because at each stage you identify a rule that “covers” some of the instances. By its very nature, this covering approach leads to a set of rules rather than to a decision tree.

A difference between the covering algorithms in comparison with recursive divide and conquer decision trees is that, in the multi class case, a decision tree split takes all classes into account, trying to maximize the purity of the split, whereas the rule-generating method concentrates on one class at a time, disregarding what happens to the other classes.

A simple covering algorithm

Covering algorithms operate by adding tests to the rule that is under construction, always striving to create a rule with maximum accuracy. In contrast, divide and-conquer algorithms operate by adding tests to the tree that is under construction, always striving to maximize the separation among the classes. Each of these involves finding an attribute to split on. But the criterion for the best attribute is different in each case. Whereas divide-and-conquer algorithms such as ID3 choose an attribute to maximize the information gain, the covering algorithm we will describe chooses an attribute–value pair to maximize the probability of the desired classification.

covering

The PRISM method for constructing rules. It generates only correct or “perfect” rules. It measures the success of a rule by the accuracy formula p/t. Any rule with accuracy less than 100% is “incorrect” in that it assigns cases to the class in question that actually do not have that class.

The outer loop iterates over the classes, generating rules for each class in turn. Note that we reinitialize to the full set of examples each time round. Then we create rules for that class and remove the examples from the set until there are none of that class left.  Whenever we create a rule, start with an empty rule (which covers all the examples), and then restrict it by adding tests until it covers only examples of the desired class. At each stage choose the most promising test, that is, the one that maximizes the accuracy of  the rule. Finally, break ties by selecting the test with greatest coverage.

Methods such as PRISM can be described as separate-and-conquer algorithms: you identify a rule that covers many instances in the class (and excludes ones not in the class), separate out the covered instances because they are already taken care of by the rule, and continue the process on those that are left. This contrasts nicely with the divide-and-conquer approach of decision trees. The separate step greatly increases the efficiency of the method because the instance set continually shrinks as the operation proceeds.

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.

Divide-and-conquer: Constructing decision trees

 
A Pollinator in Pink……..HFDF! flickr photo by The Manic Macrographer shared under a Creative Commons (BY) license

The problem of constructing a decision tree can be expressed recursively. First, select an attribute to place at the root node and make one branch for each possible value. This splits up the example set into subsets, one for every value of the attribute. Now the process can be repeated recursively for each branch, using only those instances that actually reach the branch. If at any time all instances at a node have the same classification, stop developing that part of the tree.

The only thing left to decide is how to determine which attribute to split on, given a set of examples with different classes. If we had a measure of the purity of each node, we could choose the attribute that produces the purest daughter nodes. The measure of purity that we will use is called the information and is measured in units called bits. Associated with a node of the tree, it represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node.

div

div0

For outlook We can calculate the average information value of these, taking into account the number of instances that go down each branch—five down the first and third and four down the second:

div1

This average represents the amount of information that we expect would be necessary to specify the class of a new instance, given the tree structure in Figure 4.2(a) Before we created any of the nascent tree structures in Figure 4.2, the training examples at the root comprised  line yes and five no nodes, corresponding to an information value of

div2

Thus the tree in Figure is responsible for an information gain of

div4

The way forward is clear. We calculate the information gain for each attribute and choose the one that gains the most information to split on. In the situation of Figure 4.2,

div5

Calculating information

Before examining the detailed formula for calculating the amount of information required to specify the class of an example given that it reaches a tree node with a certain number of yes’s and no’s, consider first the kind of properties we would expect this quantity to have:

div6

1. When the number of either yes’s or no’s is zero, the information is zero.

2. When the number of yes’s and no’s is equal, the information reaches a maximum

3. The information must obey the multistage property illustrated previously.

Remarkably, it turns out that there is only one function that satisfies all these properties, and it is known as the information value or entropy:

div7

The arguments p1, p2, . . . of the entropy formula are expressed as fractions that add up to one, so that, for example, Info([2,3,4]) = entropy(2/3,3/9,4/9) Thus the multistage decision property can be written in general as

div8

Because of the way the log function works, you can calculate the information measure   without having to work out the individual fractions:

div9

The method that has been described using the information gain criterion is essentially the same as one known as ID3. (https://en.wikipedia.org/wiki/ID3_algorithm)

A series of improvements to ID3 culminated in a practical and influential system for decision tree induction called C4.5 (https://en.wikipedia.org/wiki/C4.5_algorithm). These improvements include methods for dealing with numeric attributes, missing values, noisy data, and generating rules from trees.

 

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier

Algorithms: The Basic Methods 1R and Statistical modeling

In the following 2 post I’m gonna go around the 9 most common data mining algorithms.

Inferring rudimentary rules

Here’s an easy way to find very simple classification rules from a set of instances. Called 1R for 1-rule, it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. The idea is this: we make rules that test a single attribute and branch accordingly. Each branch corresponds to a different value of the attribute. It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data. Then the error rate of the rules can easily be determined. Just count the errors that occur on the training data, that is, the number of instances that do not have the majority class.

R1

Missing values and numeric attributes

The good! Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes. It deals with these in simple but effective ways. Missing is treated as just another attribute value so that, for example, if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunny, overcast, and rainy and a fourth for missing.

The bad!

this procedure tends to form a large number of categories. The 1R method will naturally gravitate toward choosing an attribute that splits into many categories, because this will partition the dataset into many classes, making it more likely that instances will have the same class as the majority in their partition. This phenomenon is known as overfitting. For 1R, overfitting is likely to occur whenever an attribute has a large number of possible values

Surprisingly, despite its simplicity 1R did astonishingly—even embarrassingly—well in comparison with state-of-the-art learning methods, and the rules it produced turned out to be just a few percentage points less accurate, on almost all of the datasets, than the decision trees produced by a state-of-the-art decision tree induction scheme.

Statistical modeling

Another simple technique is to use all attributes and allow them to make contributions to the decision that are equally important and independent of one another, given the class.

statistic

shows a summary of the weather data obtained by counting how many times each attribute–value pair occurs with each value (yes and no) for play.

In the lower part of the table, we rewrote the same information in the form of fractions, or observed probabilities. For example, of the nine days that play is yes, outlook is sunny for two, yielding a fraction of 2/9. For play the fractions are different: they are the proportion of days that play is yes and no, respectively Now suppose we encounter a new example with the values that are shown in The first table We treat the five features in outlook, temperature, humidity, windy, and the overall likelihood that play is yes or no—as equally important, independent pieces of evidence and multiply the corresponding fractions.

likelihood of yes = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053

likelihood of no = 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206

probability

This method goes by the name of Naïve Bayes, because it’s based on Bayes’s rule and “naïvely” assumes  independence—it is only valid to multiply probabilities when the events are independent. The assumption that attributes are independent (given the class) in real life certainly is a simplistic one.

The good!

Despite the disparaging name, Naïve Bayes works very well when tested on actual datasets, particularly when combined with some of the attribute selection procedures

The bad!

One thing that can go wrong with Naïve Bayes is that if a particular attribute value does not occur in the training set in conjunction with every class value, things go badly awry. Although But the bug is easily fixed by minor adjustments to the method of calculating pro For example, the upper part of Table 4.2 shows that for play = yes, outlook is sunny for two examples, overcast for four, and rainy for three, and the lower part gives these events probabilities of 2/9, 4/9, and 3/9, respectively. Instead, we could add 1 to each numerator and compensate by adding 3 to the denominator, giving probabilities of 3/12, 5/12, and 4/12, respectively. This will ensure that an attribute value that occurs zero times receives a probability which is nonzero, albeit small. The strategy of adding 1 to each count is a standard technique called the Laplace estimator after the great eighteenth-century French mathematician Pierre Laplace. Although it works well in practice, there is no particular reason for adding 1 to the counts: we could instead choose a small constant μ and use probabilities from frequencies.

place

Finally, there is no particular reason for dividing m into three equal parts in the numerators: we could use

place1

In practice, the prior probabilities make little difference provided that there are a reasonable number of training instances, and people generally just estimate frequencies using the Laplace estimator by initializing all counts to one instead of to zero.

Missing values and numeric attributes

One of the really nice things about the Bayesian formulation is that missing values are no problem at all. For example, if the value of outlook were missing in the example of Table 4.3, the calculation would simply omit this attribute, yielding Numeric values are usually handled by assuming that they have a “normal”or “Gaussian”  probability distribution. Then, whereas we normalized the counts for the nominal attributes into probabilities, we calculated the mean and standard deviation for each class and each numeric attribute. Thus the mean value of temperature over the yes instances is 73, and its standard deviation is 6.2. The mean is simply the average of the preceding values, that is, the sum divided by the number of values. The standard deviation is the square root of the sample variance, which we can calculate as follows: subtract the mean from each value, square the result, sum them together, and then divide by one less than the number of values.

place3

Discussion

Naïve Bayes gives a simple approach, with clear semantics, to representing, using, and learning probabilistic knowledge. Impressive results can be achieved using it. It has often been shown that Naïve Bayes rivals, and indeed outperforms, more sophisticated classifiers on many datasets. The moral is, always try the simple things first. Repeatedly in machine learning people have eventually, after an extended struggle, obtained good results using sophisticated learning methods only to discover years later that simple methods such as 1R and Naïve

Bayes do just as well—or even better.

here are many datasets for which Naïve Bayes does not do so well, however, and it is easy to see why. Because attributes are treated as though they were completely independent, the addition of redundant ones skews the learning process.

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier