A “divide-and-conquer” approach to the problem of learning from a set of independent instances leads naturally to a style of representation called a decision tree.
If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible values of the attribute. In this case, because there is one branch for each possible value, the same attribute will not be retested further down the tree.
If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a predetermined constant, giving a two-way split.
Missing values pose an obvious problem. It is not clear which branch should be taken when a node tests an attribute whose value is missing. Sometimes, as described in Section 2.4, missing value is treated as an attribute value in its own right. If this is not the case, missing values should be treated in a special way rather than being considered as just another possible value that the attribute might take. A simple solution is to record the number of elements in the training set that go down each branch and to use the most popular branch if the value for a test instance is missing.
Generally, the preconditions are logically ANDed together, and all the tests must succeed if the rule is to fire. However, in some rule formulations the preconditions are general logical expressions rather than simple conjunctions. More difficult: transforming a rule set into tree cannot easily express disjunction between rules Example: rules which test different attributes Symmetry needs to be broken the replicated subtree problem
One reason why rules are popular is that each rule seems to represent an independent “nugget” of knowledge. New rules can be added to an existing rule set without disturbing ones already there, whereas to add to a tree structure may require reshaping the whole tree.
Association rules are really no different from classification rules except that they can predict any attribute, not just the class, and this gives them the freedom to predict combinations of attributes too.
- The coverage of an association rule is the number of instances for which it predicts correctly—this is often called its support
- Its accuracy—often called confidence—is the number of instances that it predicts correctly, expressed as a proportion of all instances to which it applies.
Rules with exceptions
Returning to classification rules, a natural extension is to allow them to have exceptions. Then incremental modifications can be made to a rule set by expressing exceptions to existing rules rather than re engineering the entire set.
Rules involving relations
rules are called propositional because the attribute-value language used to define them has the same power as what logicians call the propositional calculus.
Trees for numeric prediction
Because statisticians use the term regression for the process of computing an expression that predicts a numeric quantity, decision trees with averaged numeric values at the leaves are called regression trees.
The simplest form of learning is plain memorization, or rote learning Once aset of training instances has been memorized, on encountering a new instance the memory is searched for the training instance that most strongly resembles the new one.
This is known as instance-based learning. In a sense all the other learning methods are “instance-based,” too, because we always start with a set of instances as the initial training information. But the instance-based knowledge representation uses the instances themselves to represent what is learned, rather than inferring a rule set or decision tree and storing it instead. In instance-based learning, all the real work is done when the time comes to classify a new instance rather than when the training set is processed.
n instance-based learning, each new instance is compared with existing ones using a distance metric, and the closest existing instance is used to assign the class to the new one. This is called the nearest-neighbor classification method. Sometimes more than one nearest neighbor is used, and the majority class of the closest k neighbors (or the distance-weighted average, if the class is numeric) is assigned to the new instance. This is termed the k-nearest-neighbor method.
When clusters rather than a classifier is learned, the output takes the form of a diagram that shows how the instances fall into clusters. In the simplest case this involves associating a cluster number with each instance, which might be depicted by laying the instances out in two dimensions and partitioning the space to show each cluster.
Some algorithms associate instances with clusters probabilistically rather than categorically. In this case, for every instance there is a probability or degree of membership with which it belongs to each of the clusters. This is shown in Figure 3.9(c). Figure 3.9(d) is used, in which elements joined together at lower levels are more tightly clustered than ones joined together at higher levels. Diagrams such as this are called dendrograms. This term means
just the same thing as tree diagrams (the Greek word dendron means “a tree”), but in clustering the more exotic version seems to be preferred—perhaps because biologic species are a prime application area for clustering techniques, and ancient languages are often used for naming in biology.
Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.