Introduction to data mining

What is data mining.

In data mining, the data is stored electronically and the search is automated— or at least augmented—by computer.

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

  • as a black box whose innards are effectively incomprehensible and as a
  • transparent box whose construction reveals the structure of the pattern.

Such patterns we call structural because they capture the decision structure in an explicit way

Machine learning

Things learn when they change their behavior in a way that makes them perform better in the future.

domain knowledge

Market basket analysis is the use of association techniques to find groups of items that tend to occur together in transactions, typically supermarket checkout data.

What’s the difference between machine learning and statistics? Cynics, looking wryly at the explosion of commercial interest (and hype) in this area, equate data mining to statistics plus marketing. In truth, you should not look for a dividing line between machine learning and statistics because there is a continuum—and a multidimensional one at that—of data analysis techniques. Some derive from the skills taught in standard statistics courses, and others are more closely associated with the kind of machine learning that has arisen out of computer science. Historically, the two sides have had rather different traditions. If forced to point to a single difference of emphasis, it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all.


Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier.


Author: enroblog

Computer science student at ITESM

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s