Why has data mining become so popular

Data Mining Methods - an understandable overview of the most important processes

Nowadays, vast amounts of data are collected and stored. When evaluating this huge amount of data (big data), not only statistical procedures but also the new algorithms of data mining methods are used. But there is more to the term data mining than just modern gold prospecting in search of monetary knowledge and information.

In this article we first clarify the question: “What is data mining?” And give you a data mining definition. Then we present the 5 most important data mining methods: cluster analysis (cluster analysis), decision tree (decision tree), prediction (predictive analysis), association rules (mining association rules) and classification (classification).

A compilation of our services in the area of ​​data mining can be found on our website. We are happy to dig up new knowledge for you in your data treasures! Feel free to contact us.

This article answers the following questions

  • What is data mining?
  • What data mining methods are there?
  • What is a decision tree?
  • What is meant by cluster analysis?
  • How can you make predictions (predictive analysis)?
  • How do you set up mining association rules?
  • When do you need a classification?

What is data mining?

With the increasing efficiency of electronic media, increasing networking and the explosive growth in the storage possibilities of electronic data, there has been an increase in the amount of information available.

However, this electronically available data and the amount of data can no longer be compared with data collection on paper. In this context one speaks of big data: Both the amount of data (usually millions of data records) and the collection speed (real time), as well as the bandwidth of the collection instruments (cameras, satellites, internet, scanner cash registers, ...) are in every respect big.

Such enormous amounts of data place special demands on the evaluation. Big data analyzes should:

  • Process large amounts of data efficiently.
  • Deliver reliable, easily interpretable results.
  • Have the shortest possible processing time.
  • Be suitable for different types of data structures (e.g. text analysis, image processing, numbers, coordinates, ...).

Data mining methods are procedures that “track down” previously unknown, novel, useful and important information from Big Data. The data mining definition includes on the one hand classic statistical methods such as B. Regression analysis, logistic regression, generalized linear models (GLM). But new algorithms that meet the above requirements are also common data mining methods. The aim of data mining is to generalize the knowledge gained and thus generate new knowledge.

More about the data mining definition can be found in our glossary.

The distinction between statistical evaluation and data mining definition is listed in the following table.

Statistics versus data mining methods

statisticsData mining
DataManageable amounts of data from a sample size of 30Big data
TransferabilityConclusions are drawn from a sample of the populationThe population often does not exist
Sample is not defined
Databases are constantly changing
evaluationCan also be done with paper and pencilLimited to computers only.
Time span of data collection resultsEvaluations often take years (e.g. clinical studies)Results must be available promptly for data collection (e.g. criminalistics)
requirementsPreconditions must be checked very carefully for the processes usedData mining methods are no longer theoretically founded
Instead, data mining methods are used in parallel. You then choose the best model
aimTesting hypothesesGenerating hypotheses

The 5 most important data mining methods at a glance

The successful entrepreneur H.J. Geldig would like to optimize his sales strategies. For this reason, he collects data from all visitors on his shop websites in accordance with the applicable data protection laws. Within a very short time, Mr. Geldig has a huge amount of data that he leaves to experts for analysis. Its goal is to increase sales and maximize profit.

The data mining experts then advise him to carry out the following evaluations:

1. Cluster analysis - successful fishing in cloudy conditions

As part of the cluster analysis one tries to divide the huge amount of data into smaller homogeneous groups. All members of a cluster have similar or common properties. The attributes or properties should differ as much as possible between the groups.

The clusters are generated without prior knowledge. The similarity structures within a cluster are therefore not recognizable at first glance. What makes the individual clusters similar then often has to be worked out through additional analyzes. In addition, clusters occasionally arise that are of little help in terms of content.

The cluster analysis can be used to reduce and reduce the huge amount of data to homogeneous units. The further analyzes are then only carried out in certain clusters that are significant in terms of content.

In detail, the following steps are carried out for a cluster analysis:

  • Selection of the variables for the similarity search
  • Determining the distance measure: How is the distance between the data points measured? This strongly depends on the question and the scale of the data. Chi-square based distance measures are often used for nominal variables. For metrically scaled variables, for example, Euclidean distances or smallest squares can be used.
  • Determination of the number of clusters and cluster centers
  • Assignment of the points to the clusters based on the distance measure

The last two points are repeated recursively until all observations are assigned to a cluster.

A detailed presentation of specific clustering algorithms can be found here.

2. Classification - everyone has their own class!

In the classification, the objects are assigned to specific classes or groups. In our example, the experts use classification to distinguish buyers and non-buyers. Based on the classification, decision rules are searched for in the data. These rules should then be able to distinguish buyers from non-buyers. Classification includes, for example, data mining methods such as neural networks, Bayesian classification and k-nearest neighbors methods. Decision trees are also among the classification methods.

Example of a classification: The classes (destinations) are known and specified, the units are sorted according to their travel plans.

3. The Decision Tree - When you can't see the forest for the tree.

Decision trees are data mining methods or decision rules in the form of a tree. The result is a tree with a root and branches extending from it. The branches branch out continuously at nodes. The branches end last in leaves. These sheets then indicate the class affiliation or the decision.

Decision trees are popular because they can represent rules in a simple and understandable way. The rules are hierarchical, i. H. processed one after the other in a fixed order and then end with a result. The algorithm works as follows for discrete variables:

  1. First, the characteristic with the highest information content is selected with regard to the prediction of the label (target variable).
  2. A branch of the tree is then created for each value that the attribute can assume.
  3. Steps 1 and 2 are repeated for each new node.
  4. The tree is complete when each node uniquely identifies a class. The last nodes then determine the class. These are also called leaves.

In the case of continuous variables, suitable threshold values ​​are calculated in an additional step. This attribute, which is “broken down” into groups, can then be used like a polynomial feature.

Mr. Geldig is interested in establishing simple rules which customers can be granted installment payments. Payment by installments with the characteristics yes / no is the label. Gender, age, and preferred payment method are predictors. The decision tree generated on the basis of the database can be seen in the following figure. The root of the tree is gender. This is the variable with the initially highest information content. The decision to grant payment in installments can be read on the sheets. A 40 year old woman is refused payment in installments due to this decision tree.

Decision tree using the example of customer data

Decision Trees: Beware of Overfitting!

Decision trees are easy to interpret, but this algorithm does not necessarily lead to the tree with the best classification. Since the tree continues to grow until the data can be clearly assigned to a group, the risk of overfitting is high.

One speaks of overfitting when models are too heavily specified. These models predict the data for which they are optimized without errors. However, generalization and transferability to other data is no longer guaranteed. The reason for this is the addition of too many influencing variables. Decision trees with a large number of levels and leaves adapt perfectly to the training data, but lead to very high error rates for other data. In order to prevent overfitting, the decision tree should often be shortened afterwards. This process is called prunning. Branches with little information content are removed again afterwards. Another option is to use random forest methods. Many decision trees are generated on the same data, the class membership of individual observations is based on a collective decision across all decision trees.

4. Who with whom? - Association rules

Association rules are set up to make connections visible. These association rules are also known as dependency rules. The data mining experts advise Mr. Geldig to examine his customers' shopping carts with association rules. You can also analyze search processes with association rules. This results in statements such as: If a customer is looking for red wool sweaters, it is very likely that they will also buy yellow socks. The experts can create customer profiles and, for example, place targeted advertising.

Association rules are established by determining the frequency for different quantities and subsets. There are above all so-called frequent item sets of interest. This is understood to mean quantities, for example shopping baskets, in which the frequency of certain combinations exceeds a specified limit. First of all, each attribute is examined individually and then further attributes are added step by step, which also meet the frequent item set condition. This gives you combinations of attributes that very often appear in combination. In these frequently occurring combinations, all decompositions are then formed and the conclusions drawn from them.

5. Prediction model (predictive analysis or predictive analytics) - forecasts for the future

In predictive analysis or predictive analytics, predictive models for the future are created based on the data. As part of a model, experts try to predict the target variable (label) using influencing variables (predictors).

In the simplest case, the model can be a linear relationship. The choice of the model depends on the scale level of the target variable (label). In the case of dichotmonic labels (yes / no characteristics), a logistic regression is possible as part of a GLM. In contrast, linear regression is available for continuous labels. However, it can also be purely data-generated systems such as neural networks. Support vector machines, deep learning models or naive Bayes models are also possible.

Usually, several possible models are used in data mining in parallel. The model quality is then determined by means of cross-validation. The model with the best average fit is then used for the prediction.

Most models have such complex algorithms that they can no longer be understood by users.

With Mr. Geldig's data, predictive analysis can be used to encourage customers to buy again in his shop. This can e.g. For example, marketing activities take place before the customer has to find out other information. Customers who buy a coffee machine need coffee or cleaning products at certain intervals. Return rates can also be predicted using predictive analytics and suitable data.

Summary

In this article, we have brought you closer to and explained the 5 most important data mining methods. A comparison of the data mining methods can be found in the table below. The methods used in data mining are often exploratory, but the processes in the background are extremely complex and demanding. In particular, the interpretation of the results and the transferability is a delicate and difficult topic. Due to huge amounts of data, perfect models can be developed, but these cannot always be transferred to other data. We would be happy to help you professionally evaluate your data and present the results of your data mining project in a timely, understandable, customer-oriented and effective manner.

We take care of all aspects of dealing with big data independently and reliably. Feel free to contact us!

Target variable (label)statementRestriction
Cluster analysisNone, clusters are formed automaticallyFormation of homogeneous groups, reduction of the amount of dataFeatures of similarity must be determined afterwards
Association RulescategoricalIf - then rulesAre only valid for data sets, transferability, especially with a time component, must be checked.
Prediction model (predictive analysis)Any, choice of model depends on the scale levelPrediction models for future eventsThe best model must be determined on the basis of criteria (e.g. interpretability, fit, ...)
classificationcategoricalAssignment to predefined classesLarge number of possible algorithms, problem of overfitting
Decision TreecategoricalHierarchical decision-making rulesDanger of overfitting, confusing with a lot of levels

Further sources:

Wikipedia review article on data mining

Introduction to important data mining processes of the TH Nürnberg

Data mining introduction by Klaus-Perter Wiedmann, Frank Buckler and Holger Buxel

 

Keywords: cluster analysis, data mining, data mining definition, data mining methods, decision tree, predictive analysis, what is data mining