Kamis, 26 Juli 2007

By public demand, here is more information

Already the first answers are coming in. A Big Thanks to everyone that already have submitted their answers to our questionnaire. I get quite a few e-mails asking me how i am going to use the results, so i feel it is time for more explanation.

The process of Data Mining consists of the following steps (simplified..) :

1) Data Collection
2) Data Preparation
3) Analysis
4) Application of results

Currently we are on step (1), collecting the data. Although questions on your e-mails sent to me are for step (4), i feel that it is important to talk about step (3) as well :

By deploying specialized algorithms, we try to find common characteristics of a specific class in data (also known as classification) . By class we mean a category. For "happiness" we have two possible categories of people : Those that are generally happy and those that are not. In the same manner several other classes of people exist .

Once we decide on which class to analyze (say married/divorced) , now it is time to perform the analysis with the goal of creating a predictive model. Once a -reasonably accurate- model is created, we are ready to predict new, unseen cases of people. That essentially means that when new users submit the questionnaire, they will be given a score (= a percentage) of the probability of getting a divorce. Note that "divorce" is just one class ; several other predictions can be made for other classes too, once a predictive model has been created for each class . But that's not all. Not only we are able to predict new cases, but -with specific algorithms- we are able to find out what is important on an outcome (ie getting a divorce) and also the importance of each parameter (where each parameter is a question on the questionnaire).

In our example rule :

IF AGE >31 AND AGE<=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"

Not only we realize that age and the number of children is important in getting a divorce, but also we are able to know the relevant importance of each parameter. For example, a model may tell us that the most important factor in getting a divorce is first of all having children and the second factor in importance is the age.

Of course there are other kinds of analysis which do not search for common characteristics but seek to find associations between variables. Another analysis type finds homogeneous groups through clustering, but we will leave examples of these types of analysis for another post.

Tidak ada komentar:

Posting Komentar