Kamis, 13 Desember 2007

What People Digg More? - Part 3

In this third -and final- part of the way that digg stories are analyzed, i will present an example of the co-occurrence table used to find statistically significant correlation of words.

In the previous part i outlined the way stories from digg are collected and how these stories are transformed in a way suitable for analysis. This last step of the analysis, is finding out about what subjects people seem to really like and which subjects are not 'digged' so much. To do that, a co-occurrence table is used which maps words to two categories : HighDiggs (=stories that are interesting) and LowDiggs (=stories that are not interesting).


This is an example of a word co-occurrence table from IBM's UI Modeler


The statistical significance between words and categories of interestingness is denoted by colors.The more intense the color, the more higher the affinity between the word and the category.

In the example table above we see that people are interested (and therefore 'digged' more) in :


1) Stories that have pictures
2) US President George W Bush
3) Apple Leopard
4) Ron Paul (not shown in table)


On the other hand people do not 'digg' stories about :


1) Microsoft (..!)
2) Blogs (not shown in table)



Selasa, 06 November 2007

What people Digg More? - Part 2

After getting some e-mails requesting more details about the way i analyze diggs, here are the details of the process :

First of all, obviously some coding is necessary to implement software that sifts through digg and records the number of diggs of the story as well as the time that the story has been out. The software is also responsible for selecting all stories that have been out for 10-11 days and for calculating the diggs_per_minute metric

where :

Diggs_per_minute= total_diggs / total_minutes

During the analysis it appeared that the diggs_per_minute metric was not normally distributed since its skewness (positive) was found to be 2.795.






After applying log transformation, skewness dropped to 0.534 having a mean value of -2.878 :









The next step is to create a text file, as follows :




Notice that there is a 'highdiggs' or 'lowdiggs' word at the end of each line (story). If the diggs_per_minute metric for each story exceeds the threshold value -2.878 then 'highdiggs' is appended at the end of the line, otherwise 'lowdiggs' is added.


The last step of the analysis is to use a co-occurrence matrix to see which words are associated with high digg and low digg stories. A chi-square test is used to test for statistical significance of word co-occurrences.


For the last part of the analysis i use a tool called Unstructured Information Modeler from IBM.



Selasa, 16 Oktober 2007

What people Digg More?

Due to too much work i wasn't able to write to this blog as much as i wanted. Although people continue to answer the questionnaire (over 400!) i wasn't able to make any other analysis so far that will shed some light on the patterns that emerge from living our lives.

However, i feel that i should write something about my new ventures on text mining. The question that came up to my mind was simple :

"What stories people tend to digg more?"

So i collected all stories on digg and for each story the number of diggs and the time that the story has been around was recorded. By dividing the number of diggs by the total minutes the story has been out, you get a "Diggs_per_Minute" score which essentially designates which stories are "hot" and which are not.

After the preliminary analysis i immediately found out that it is essential to use data from a specific time period and not just everything. If you think about it, a story should be out for quite a while (say 10 days) so that you are able to get a good estimate of the "Diggs_per_Minute" variable. Stories that have been out for less than 2 days tend to have a much greater score of Diggs per Minute than newer stories.

So the process is as follows: Diggs from stories that have been out for 10-11 days are collected. I then use text mining techniques to find out what words the stories with many diggs have in common. Don't you think that marketing people would love to know this information?


First Results for Most Digged stories :


1) Stories that have pictures tend to be digged more

2) Having the phrase "Digg this if you....."

3) Specific Companies / technologies etc (e.g Apple and Ipod)


That's all for now but i will come back with more.



Jumat, 14 September 2007

First Results Out : Phobias

Well over 300 people have submitted and described their lives on the questionnaire. I must admit that i couldn't wait for this time to come so that i can start finding out about patterns that emanate from our lives.

As explained in another post regarding classification analysis, the target variable in my first attempt is predicting class "Phobias". Simply put, what are the common characteristics of people having phobias?

It seems that people that have answered > 2 on the good looking scale question are less likely by 85% to have phobias. Does this make sense to you?

This is just one example of how interesting patterns may arise from analyzing submitted data. If you already have submitted the questionnaire why not asking your friends to do it?. The questionnaire can be reached on the following URL :

http://lifeanalytics.org/MainSurvey


Senin, 27 Agustus 2007

LifeAnalytics : Over a month online

Time sure passes by very quickly...it is over one month now that this blog has started. Over 200 people have submitted their answers to the questionnaire so far...Over half of visitors originate from the US but also people from Europe (especially UK, Germany and the Netherlands) are also producing many hits. A Big Thanks to kdnuggets (see links area) -the best site for data mining news- for listing this blog on its pages.

The truth is that with over 60 questions, we need more people to fill the survey, so if you are reading this and you haven't filled the questionnaire yet, you can submit your answers here :


http://lifeanalytics.org/MainSurvey


Very shortly, i will make a new post explaining what types of analysis can also be made -apart from classification and clustering- once we have enough data...

Kamis, 26 Juli 2007

By public demand, here is more information

Already the first answers are coming in. A Big Thanks to everyone that already have submitted their answers to our questionnaire. I get quite a few e-mails asking me how i am going to use the results, so i feel it is time for more explanation.

The process of Data Mining consists of the following steps (simplified..) :

1) Data Collection
2) Data Preparation
3) Analysis
4) Application of results

Currently we are on step (1), collecting the data. Although questions on your e-mails sent to me are for step (4), i feel that it is important to talk about step (3) as well :

By deploying specialized algorithms, we try to find common characteristics of a specific class in data (also known as classification) . By class we mean a category. For "happiness" we have two possible categories of people : Those that are generally happy and those that are not. In the same manner several other classes of people exist .

Once we decide on which class to analyze (say married/divorced) , now it is time to perform the analysis with the goal of creating a predictive model. Once a -reasonably accurate- model is created, we are ready to predict new, unseen cases of people. That essentially means that when new users submit the questionnaire, they will be given a score (= a percentage) of the probability of getting a divorce. Note that "divorce" is just one class ; several other predictions can be made for other classes too, once a predictive model has been created for each class . But that's not all. Not only we are able to predict new cases, but -with specific algorithms- we are able to find out what is important on an outcome (ie getting a divorce) and also the importance of each parameter (where each parameter is a question on the questionnaire).

In our example rule :

IF AGE >31 AND AGE<=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"

Not only we realize that age and the number of children is important in getting a divorce, but also we are able to know the relevant importance of each parameter. For example, a model may tell us that the most important factor in getting a divorce is first of all having children and the second factor in importance is the age.

Of course there are other kinds of analysis which do not search for common characteristics but seek to find associations between variables. Another analysis type finds homogeneous groups through clustering, but we will leave examples of these types of analysis for another post.

Kamis, 19 Juli 2007

Questionnaire is Ready

Although i thought it will take a lot of time, the questionnaire is ready. I used a great tool to create the web questionnaire called websurveytoolbox. By using it, i managed having the questionnaire ready in less than 2 days. All data and questions of the questionnaire are saved on a MySQL database and the tool automatically creates jsp pages with the questions. I also registered the domain lifeanalytics.org although there is not much to see there for now, except that it hosts the necessary code and database for the questionnaire.

To fill the questionnaire please visit:

http://lifeanalytics.org/MainSurvey

I counted the time needed to fill out the questionnaire, so assuming you know "Everyday English" it shouldn't take more than 5 minutes to complete it. The questionnaire comprises of all kinds of questions, having the (tough) aim of describing one's life as best as possible in terms of facts, personal decisions and points of view.

As a last note, i would like to stress out that no personal data are asked, not even your e-mail. If you feel that this effort is worth it, please let other people know. You can even digg it to spread out the story ;-)

Rabu, 18 Juli 2007

Life, Uncertainty and Mathematics

One of the immediate questions that may arise is "How on earth are you going to predict if someone is going to get a divorce?".

So, what essentially i am trying to do here is to model life and its uncertainties with Mathematics...Now, can this be possible?

Certainly there are hundreds of factors that could play a role in getting a divorce. A questionnaire of a 100 -or less- questions cannot capture the facts of a person's life. But my goal is to just give it a try and see how it goes. Perhaps tens or even hundreds of thousands of answers may be able to give us a clue as to what is happening.

Each rule extracted (see previous post about what i mean by rules), will be tested for its statistical validity through chi-square tests and making adjustments through Bonferroni correction. Several other techniques will be used to assess the quality of the extracted models. In other words, if there is something there, we will find it.

Once a model is produced and is reasonably accurate, we will be ready to predict unseen cases. In other words -and continuing our divorce example- if a model is 80% correct in predicting whether someone will get a divorce, then anyone that fills the questionnaire at the end of the process, will also find out about the probability of getting a divorce. More importantly : Why he or she, is likely to get one.





Selasa, 17 Juli 2007

LifeAnalytics blog has started

I finally made the decision to start LifeAnalytics blog. Hopefully, many people will find useful the findings from the research on the patterns that emerge by just living.

What is LifeAnalytics? Simply put, i will be using analytical techniques (especially classification and associations discovery), also known as Data Mining to understand key facts about a person's life : For example ,what are the common characteristics of people that are divorced? What factors play an important role in having an increased risk for getting a divorce?

Of course, getting a divorce is one probability in someone who is married. Several other facets and facts compose our lives.... as an example consider the following life facts :

- Having a good marriage
- Being happy about work
- Having phobias
- Having an above-average salary
- Being Depressed

The goal then, is to look at all of those probable outcomes in one's life and try to extract "rules" that increase (or decrease) the probability of experiencing the above facts. In order to do this, thousands of people must somehow describe their lives and their character idiosyncrasies by submitting a questionnaire.

Example : By analyzing thousands of people's life facts (submitted via questionnaire), we may extract the following rule :

IF AGE >31 AND AGE<=40 AND NUM_OF_CHILDREN = 0 THEN DIVORCE="TRUE"


In other words, the example rule above says that People between 32 and 40 years old without children have increased probabilities (say for example 82%) in getting a divorce. Findings and conclusions like the example shown above will be given to anyone interested -free of charge of course- from this blog.

Think about it. Living our life creates "data" and along with the "data" of thousands of others we may find some really interesting answers.

Stay tuned, the journey to this kind of knowledge has -hopefully- just begun...