The input takes the form of concepts, instances, and attributes. We call the thing that is to be learned a concept description.
- The information that the learner is given takes the form of a set of instances
Four basically different styles of learning appear in data mining applications.
- classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples.
- In association learning, any association among features is sought, not just ones that predict a particular class value. In clustering, groups of examples that belong together are sought
- In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity.
Regardless of the type of learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description.
Classification learning is sometimes called supervised because, in a sense, the method operates under supervision by being provided with the actual outcome for each of the training examples
This outcome is called the class of the example.
When there is no specified class, clustering is used to group items that seem to fall naturally together.
The success of clustering is often measured subjectively in terms of how useful the result appears to be to a human user. It may be followed by a second step of classification learning in which rules are learned that give an
intelligible description of how new instances should be placed into the clusters.
What’s in an example?
These instances are the things that are to be classified, associated, or clustered. Although until now we have called them examples, henceforth we will use the more specific term instances to refer to the input. Each instance is an individual, independent example of the concept to be learned.
The input to a data mining scheme is generally expressed as a table of independent instances of the concept to be learned. Because of this, it has been suggested, disparagingly, that we should really talk of file mining rather than database mining. Relational data is more complex than a flat file. A finite set of finite relations can always be recast into a single table, although often at enormous cost in space. Moreover, denormalization can generate spurious regularities in the data, and it is essential to check the data for such artifacts before applying a learning method.
What’s in an attribute?
Each individual, independent instance that provides the input to machine learning is characterized by its values on a fixed, predefined set of features or attributes.
The value of an attribute for a particular instance is a measurement of the quantity to which the attribute refers.
Most practical data mining systems accommodate just two of these four levels of measurement: nominal and ordinal. Nominal attributes are sometimes called categorical, enumerated, or discrete Ordinal attributes are generally called numeric, or perhaps continuous, but without the implication of mathematical continuity.
Preparing the input
The idea of company wide database integration is known as data warehousing.
Sometimes called overlay data, this is not normally collected by an organization but is clearly relevant to the data mining problem. It, too, must be cleaned up and integrated with the other data that has been collected.
a standard way of representing datasets that consist of independent, unordered instances and do not involve relationships among instances, called an ARFF file.
ARFF files accommodate the two basic data types, nominal and numeric.
Attributes are often normalized to lie in a fixed range, say, from zero to one, by dividing all values by the maximum value encountered or by subtracting the minimum value and dividing by the range between the maximum and the minimum values. Another normalization technique is to calculate the statistical mean and standard deviation of the attribute values, subtract the mean from each value, and divide the result by the standard deviation. This process is called standardizing a statistical variable and results in a set of values whose mean is zero and standard deviation is one.
You have to think carefully about the significance of missing values. They may occur for several reasons, such as malfunctioning measurement equipment, changes in experimental design during data collection, and collation of several similar but not identical datasets. Respondents in a survey may refuse to answer certain questions such as age or income. In an archaeological study, a specimen such as a skull may be damaged so that some variables cannot be measured.
Typographic errors in a dataset will obviously lead to incorrect values Duplicate data presents another source of error. People often make deliberate errors when entering personal data into databases.
Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.