Selasa, 06 November 2007

What people Digg More? - Part 2

After getting some e-mails requesting more details about the way i analyze diggs, here are the details of the process :

First of all, obviously some coding is necessary to implement software that sifts through digg and records the number of diggs of the story as well as the time that the story has been out. The software is also responsible for selecting all stories that have been out for 10-11 days and for calculating the diggs_per_minute metric

where :

Diggs_per_minute= total_diggs / total_minutes

During the analysis it appeared that the diggs_per_minute metric was not normally distributed since its skewness (positive) was found to be 2.795.






After applying log transformation, skewness dropped to 0.534 having a mean value of -2.878 :









The next step is to create a text file, as follows :




Notice that there is a 'highdiggs' or 'lowdiggs' word at the end of each line (story). If the diggs_per_minute metric for each story exceeds the threshold value -2.878 then 'highdiggs' is appended at the end of the line, otherwise 'lowdiggs' is added.


The last step of the analysis is to use a co-occurrence matrix to see which words are associated with high digg and low digg stories. A chi-square test is used to test for statistical significance of word co-occurrences.


For the last part of the analysis i use a tool called Unstructured Information Modeler from IBM.



Tidak ada komentar:

Posting Komentar