Kamis, 26 Juli 2007

By public demand, here is more information

Already the first answers are coming in. A Big Thanks to everyone that already have submitted their answers to our questionnaire. I get quite a few e-mails asking me how i am going to use the results, so i feel it is time for more explanation.

The process of Data Mining consists of the following steps (simplified..) :

1) Data Collection
2) Data Preparation
3) Analysis
4) Application of results

Currently we are on step (1), collecting the data. Although questions on your e-mails sent to me are for step (4), i feel that it is important to talk about step (3) as well :

By deploying specialized algorithms, we try to find common characteristics of a specific class in data (also known as classification) . By class we mean a category. For "happiness" we have two possible categories of people : Those that are generally happy and those that are not. In the same manner several other classes of people exist .

Once we decide on which class to analyze (say married/divorced) , now it is time to perform the analysis with the goal of creating a predictive model. Once a -reasonably accurate- model is created, we are ready to predict new, unseen cases of people. That essentially means that when new users submit the questionnaire, they will be given a score (= a percentage) of the probability of getting a divorce. Note that "divorce" is just one class ; several other predictions can be made for other classes too, once a predictive model has been created for each class . But that's not all. Not only we are able to predict new cases, but -with specific algorithms- we are able to find out what is important on an outcome (ie getting a divorce) and also the importance of each parameter (where each parameter is a question on the questionnaire).

In our example rule :

IF AGE >31 AND AGE<=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"

Not only we realize that age and the number of children is important in getting a divorce, but also we are able to know the relevant importance of each parameter. For example, a model may tell us that the most important factor in getting a divorce is first of all having children and the second factor in importance is the age.

Of course there are other kinds of analysis which do not search for common characteristics but seek to find associations between variables. Another analysis type finds homogeneous groups through clustering, but we will leave examples of these types of analysis for another post.

Kamis, 19 Juli 2007

Questionnaire is Ready

Although i thought it will take a lot of time, the questionnaire is ready. I used a great tool to create the web questionnaire called websurveytoolbox. By using it, i managed having the questionnaire ready in less than 2 days. All data and questions of the questionnaire are saved on a MySQL database and the tool automatically creates jsp pages with the questions. I also registered the domain lifeanalytics.org although there is not much to see there for now, except that it hosts the necessary code and database for the questionnaire.

To fill the questionnaire please visit:

http://lifeanalytics.org/MainSurvey

I counted the time needed to fill out the questionnaire, so assuming you know "Everyday English" it shouldn't take more than 5 minutes to complete it. The questionnaire comprises of all kinds of questions, having the (tough) aim of describing one's life as best as possible in terms of facts, personal decisions and points of view.

As a last note, i would like to stress out that no personal data are asked, not even your e-mail. If you feel that this effort is worth it, please let other people know. You can even digg it to spread out the story ;-)

Rabu, 18 Juli 2007

Life, Uncertainty and Mathematics

One of the immediate questions that may arise is "How on earth are you going to predict if someone is going to get a divorce?".

So, what essentially i am trying to do here is to model life and its uncertainties with Mathematics...Now, can this be possible?

Certainly there are hundreds of factors that could play a role in getting a divorce. A questionnaire of a 100 -or less- questions cannot capture the facts of a person's life. But my goal is to just give it a try and see how it goes. Perhaps tens or even hundreds of thousands of answers may be able to give us a clue as to what is happening.

Each rule extracted (see previous post about what i mean by rules), will be tested for its statistical validity through chi-square tests and making adjustments through Bonferroni correction. Several other techniques will be used to assess the quality of the extracted models. In other words, if there is something there, we will find it.

Once a model is produced and is reasonably accurate, we will be ready to predict unseen cases. In other words -and continuing our divorce example- if a model is 80% correct in predicting whether someone will get a divorce, then anyone that fills the questionnaire at the end of the process, will also find out about the probability of getting a divorce. More importantly : Why he or she, is likely to get one.





Selasa, 17 Juli 2007

LifeAnalytics blog has started

I finally made the decision to start LifeAnalytics blog. Hopefully, many people will find useful the findings from the research on the patterns that emerge by just living.

What is LifeAnalytics? Simply put, i will be using analytical techniques (especially classification and associations discovery), also known as Data Mining to understand key facts about a person's life : For example ,what are the common characteristics of people that are divorced? What factors play an important role in having an increased risk for getting a divorce?

Of course, getting a divorce is one probability in someone who is married. Several other facets and facts compose our lives.... as an example consider the following life facts :

- Having a good marriage
- Being happy about work
- Having phobias
- Having an above-average salary
- Being Depressed

The goal then, is to look at all of those probable outcomes in one's life and try to extract "rules" that increase (or decrease) the probability of experiencing the above facts. In order to do this, thousands of people must somehow describe their lives and their character idiosyncrasies by submitting a questionnaire.

Example : By analyzing thousands of people's life facts (submitted via questionnaire), we may extract the following rule :

IF AGE >31 AND AGE<=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"


In other words, the example rule above says that People between 32 and 40 years old without children have increased probabilities (say for example 82%) in getting a divorce. Findings and conclusions like the example shown above will be given to anyone interested -free of charge of course- from this blog.

Think about it. Living our life creates "data" and along with the "data" of thousands of others we may find some really interesting answers.

Stay tuned, the journey to this kind of knowledge has -hopefully- just begun...