Senin, 23 Februari 2009

Making more sense out of Twitter Tweets


Over the last 5 posts i have described how unstructured text information from Twitter can be used for Knowledge Extraction. Specific examples were given such as Sentiment Analysis for products (Amazon's Kindle), Segmentation of Twitter users, and finally cluster analysis of the emotions and thoughts expressed from twitter users.

So far i have discussed some ways that text mining could help us in getting more insight on how people think. Now it is time to put Information Extraction and Ontologies to the equation.

Information Extraction (IE) is the automated extraction of any information such as (to name a few) Names (first names, city names, country names etc), facts or events from unstructured text. An example of IE was given in these posts where thousands of adverts of flats are extracted and then data mining analysis is performed to identify what characteristics are important for achieving a high renting price.

Ontologies are used for knowledge representation and may also be used for structuring the information that exists on the web. To give an example, consider the following product keywords :
  • Coke
  • Sprite
  • Dr Pepper
If one asks you what is common about them, your brain looks for generalizations and comes up with the following answers :

  • They are all Carbonated Drinks

  • (Possibly) they all contain sugar since the word "Diet" or "Zero" or "Light" is not mentioned.

Now let's assume having an Ontology Engine that is able to do this and to be able to infer automatically that all these products are sugar-carbonated drinks. Such an action enables us to extract facts in a more coherent way. The reason behind this is that we lessen the effect discussed on The Statistics of Everyday Talk and thus are able to capture growing trends such as people expressing their thoughts regarding carbonated drinks rather than matching "Coke", "Sprite" and "Dr Pepper" individually. Without Ontologies such a trend could be easily missed.

By using Ontologies or taxonomies where applicable, an associations discovery algorithm can search in different levels of information detail. For example data miners usually employ taxonomic information (ex. Sprite, Coke, Pepsi = carbonated drinks) when performing associations discovery analysis on Super Markets and the effort of applying taxonomies almost always pays back in terms of the knowledge extracted regarding consumer behavior.

I have used Ontologies over the past 3 years and have seen them in action. The fact that with Ontologies one could possibly have access to inference and deductive reasoning techniques is of great use. The application of Information Extraction, Natural Language Processing and subsequent insertion of this information in an Ontological setting has many potential applications.



Minggu, 15 Februari 2009

Know your customers - The Twitter way


The more i analyze tweets on Twitter, the more interesting i find the whole process. First it was Cluster analysis of specific thoughts expressed from Twitter users and then it was Sentiment Mining for Amazon's Kindle. It was just a matter of time from having the urge to analyze Tweets on a broader perspective.

So i decided to perform a segmentation of the Twitter users : extract common groups of users but this time not for specific thoughts or specific products but a segmentation based on a more generic basis.

I had two goals in this cluster analysis :

1) Cluster the biographies of users
2) Cluster the tweets of the users.

I then decided that the more information i could collect the better, so the first thing i did was to make a 'spider' program to extract 10,000 twitter user names. Then for each twitter user the software visits his/her page and extracts :

a) The user's bio
b) Number of followers
c) Number of people following
d) Number of updates
e) 20 latest Tweets
f) Number of re-tweets
g) Number of replies to other users (ex when @user directive exists)


Let's see now what we could -potentially- do with such information :

1) Cluster analysis on user bios

2) Cluster analysis on user tweets

3) Classification analysis for identifying the common characteristics of users with many followers

4) Associations discovery between products : Which products tend to be mentioned together in each user's tweets?

5) Identification of common keywords per cluster : If we identify a cluster of users that we characterize as the "Parents", what keywords do "Parents" tend to use more? What about the "Tech junkies" cluster?

But let's start with the first analysis : Clustering the biographies of Twitterers. The analysis generated 30 clusters of users. Some of them are :

1) The Parents
2) The computer Geeks
3) The students
4) The social media addicts
5) The entrepreneurs

I looked at the "Parents" cluster more closely and wanted to find keywords that this cluster is associated with : Single and Jesus where some of them.

So we immediately identify one of the many customer groups : The parents, of which a significant percentage of them are single. The "Parents" cluster also expresses one of its values : Christianity.

By moving on to each generated cluster and finding the associated keywords, i was able to retrieve the values and beliefs of each cluster. Knowledge Extraction at its best.




Rabu, 11 Februari 2009

Sentiment Mining for Amazon's Kindle


Following the post on Clustering the thoughts of Twitter users, it is time to look at another example where Twitter can be used. So i decided to analyze -just- 1054 tweets that are about Amazon's e-reader kindle to see what i could come up with.

My goal was not to classify between positive or negative sentiment but to extract the general "buzz" about the product by means of cluster analysis. After extracting the tweets that contain the word "kindle" i continued in removing non-relevant information (such as tinyurl links) by using regex expressions.

Next, it was time to understand the data and a good way to do this is to look at word frequencies using TextStat. Here is what i came up with :



On the top of the word frequency list are the usual suspects : "I", "and", "to", but also "kindle", "kindle2" and "amazon" which is something that was expected. Now, let's see what are some of the words that do not occur frequently :



Here appears a fact that requires attention : Text miners use stop-word lists to remove the most frequent words but they also remove words that do not occur frequently. The table above shows that a non-frequently occurring word is disappointed and if we had chosen to omit words of a specific frequency range -such as less than 3- we could loose this important information. So caution is needed.

After running the analysis, i came up with 20 different clusters of similar "thinking". Note that we are not only interested in which those clusters are but also -more importantly- to the proportion of cases that each cluster contains (see previous post). Some of the examples of clusters found are :

1) A cluster of users that are questioning the usefulness of the product
2) Excited users
3) Users that are happy about the text-to-speech recognition feature of the product
4) Text-to-speech recognition and potential copyright issues


Twitter is a great source for sentiment extraction but one problem is the fact that people are re-tweeting the same news (" The new Kindle 2 is out") or they tweet about similar information from various tech news websites.