Selasa, 20 Januari 2009

Clustering the thoughts of Twitter Users

During the last two posts i presented the reasons and some problems on analyzing the thoughts of users on the web and particularly Twitter. (For more see Part1 and Part2 ).

As an example, we are going to be looking at a specific kind of thought that Twitter users make : What they don't want. By using the Twitter API i managed to extract all tweets having the phrase "i don't want to". The following text file shows the results :




The next step is to remove all phrases that do not give us any information about what users do not want :



Finally we remove the phrase "i don't want to". However, consider the following example:

"I must go to Chicago. I don't want to do that"


The steps discussed above will discard the first sentence which is actually what the user does not want to do and leave only the phrase "i don't want to do that" which is not particularly informative. At this point we must quantify the problem -let's assume it involves the 8.5% of our records- and recall what the pareto principle is all about.


After some additional pre-processing steps which are not discussed here, i feed the data to K-Means to see the clusters the algorithm comes up with. For a better presentation of the results, here is a screen capture from IBM's UI Modeler :




We immediately see -in descending order- what Twitter users do not want :

1) They do not want to go to work
2) They do not want to go to school
3) They do not want to hear about various issues
4) They do not want to buy things


Notice also the top two categories named Miscellaneous and None. These categories contain thoughts that have a very small frequency to form a cluster. These two categories consist the 69.56% of our records and at this point we should think again about the pareto principle.

Please note that not all necessary work is discussed here and i had to omit several actions that have to take place. In trying to understand what people actually think i am using an approach which uses Ontologies, Information Extraction, Clustering and Classification analysis with the ultimate goal to minimize the percentage of thoughts (69.56% in this example) that cannot form a cluster and to increase the accuracy of the analysis.

It is also an interesting fact that we could move further down the sentence branch (see this post) for even better insight. Here i presented a cluster analysis about what users do not want. As an example we could apply clustering on user thoughts specifically for "I don't want to feel".



Kamis, 15 Januari 2009

The Statistics of Everyday Talk


As discussed in the previous post, the analysis of free text on the Web -and as an example the thoughts expressed by Twitter users- could extract very interesting insights on how users think and how they behave.

In 2001 i visited Trillium where i had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the pareto principle became -once again- evident. When someone wishes to standardize entries in a Database so that the word "Parkway" is written in the same way across all records, he might find the following distribution of "parkway" entries :

15% of records contain the word "Parkway"
3% of records contain the word "Pkwy"
0.2% of records contain the word "Prkwy"
0.01% of records contain the word "Parkwy"

What that essentially means is that with a single SQL query one can find and correct 15% of "parkway" word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn increases the amount of work required, sometimes overwhelmingly.

In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be using the same phrase for describing the fact that they don't want to go to sleep with a simple "I don't want to go to sleep". But another 20% might be using something like : "i don't feel like sleeping" and another 10% something like "i don't want to go to bed right now".

So we immediately see one of the issues that Text Miners face : The fact that we can use different phrases to communicate the same meaning. If we wish to analyze text information for classification purposes -say the sentiment of customers- we could achieve a 60-65% accuracy in our results with some effort. For a mere 4% increase in accuracy -from 65% to 69%- the amount of extra effort required could prove prohibitive.

Consider the following chart :




These are all examples of phrases people use in their everyday talk. We can visualize such phrases starting with" i don't want to" and then each branch adds a new meaning to the phrase. So branches marked with numbers are the parts of speech that give us an idea of what a person doesn't want to do : To go, to feel, to visit,to know. Things are getting much more difficult in terms of the effort required if we wish to add more detail -and probably insight- to our analysis by moving further down the branches in our sentence tree.

Perhaps for marketeers, the ability to quantify the distribution of words on the 1st level of the tree depicted above could be enough : If we end up with the following words distribution :

To feel : 15%
To know : 7%
To go : 1%
To visit : 1%

Then, we get an insight on which words to use to market products more efficiently.


On the next post we will go through a hands-on example of analyzing the thoughts of Twitter users and specifically what people seem to "don't want".

Senin, 05 Januari 2009

Emotions, Beliefs and Analytics


When i first came across Data Mining and Machine Learning in 1998 i had no idea of the kind of applications that this field can have. As time passes by, the knowledge that can be available to a data/text miner becomes more and more a serious business....actually, a very serious one.

Not long time ago i have seen a presentation where a map of emotions from the web was created in real time by aggregating specific keywords from blogs and forum posts. Twistori is an example of such an application. Now, let's take this idea one step further.

Twitter is a "social messaging utility" in which users describe what they are doing -or what they are feeling/thinking- now. Users are able to send "tweets" even through SMS messages. The way that these messages are written is an ideal format for text mining : Short phrases that summarize what a user wants to say are a text miner's paradise.

It is logical to assume that Text mining and Information extraction techniques will become more important, since more data will be generated in the future. It is only a matter of time until the next "killer app" like FaceBook, YouTube and Twitter appears. Data/Text miners will be able to identify common "thought clusters" of people.

Now, consider the following example : By visiting this link you will get a list of people that have written in their "tweets" the phrase "I don't want to....".

Once this textual information is captured, preprocessed and then analyzed through cluster analysis we could end up with the following clusters of "I don't want-er's " :


- The cluster of users that do not want to work again/tomorrow/today (18.5%)

- The cluster of users that do not want to go to sleep (6%)

- The cluster of users that do not want to hurt someone (4.2%)


What is also interesting, is the ability to quantify the proportion of cases belonging to each cluster to the total of tweets. As shown in the example above, the most frequently occurring thought is from people that do not feel like working.


Now in the same way one could perform this type of analysis for :

"I Believe...."
"I wish i...."
"I want to buy..."

Essentially, what we are talking about is the extraction of the values, hopes and beliefs of hundreds of thousands -or even millions- of users...and in descending order. Once a first run is performed and clusters are extracted one could run this process again every month and see the trends of those clusters in time. It would be also interesting to see how these thought clusters change after specific World events.

For some people such as marketeers and social researchers -providing that results are accurate enough- this information is invaluable. Others, might feel that such an analysis is bad practice. Of course, there are companies that already capture brand sentiment across the web : Crimson Hexagon and Twitrratr are just two examples.


This post is the first in a series of posts discussing the application of Analytics to capture the thoughts that -as we speak now- exist on the Web. We will go through ways that one could explore this information and more specifically we will look at :


  • How clustering can group people's values, beliefs and emotions.

  • Why Ontologies and Natural Language Processing are needed for better results.

  • How classification analysis might give us knowledge on what are the common characteristics of various 'categories' of users.