Senin, 28 Desember 2009

Building a Knowledge Hub


The web is a huge source of information. It stores facts, thoughts, feelings and intentions of people. It also records what people like and what they don't in an indirect way - something that we are going to be looking at shortly . Some of the examples on harnessing this information were shown previously in this blog, such as :

  • Extraction of user opinions, beliefs and values from Twitter
  • Prediction of popular stories on Digg
  • Prediction of popular Tweets

Consider the following snapshot from a BBC webpage :


The table above shows a representation of the most popular business stories on BBC on the 22nd December 2009. Even though we do not have specific metrics, we intuitively understand that the order with which the stories are listed also tell us the popularity of each post. Notice that the first post on the most read stories talks about the British economy while the last one is a title regarding football.

This is knowledge that we can harness. No doubt it is a very specific kind of knowledge because it tells us only what -mostly- British readers of BBC have found interesting. In other words this is knowledge for a specific population : Most likely in another country -say France- the title about UK being still in recession would not be so interesting but a title about France being in the same situation would. Subject, Time and Location are all important parameters that need to be captured and taken into account.

Let's consider the idea of creating a Knowledge Hub : This could be done by collecting massive amounts of information from Social Media, blogs, comments from forums and news titles (and their popularity). Techniques such as Information Extraction with concept annotation, Data and Text Mining could be used to extract knowledge by combining incidents, opinions, intentions and emotions found from different sources.

I have been monitoring and collecting for the past 3 months news and forum posts generated from/for a specific country. The information collected is then annotated in such a way to extract concepts. This text annotation is matched with keywords of concepts, incidents and intentions. Over the past month there has been a considerable amount of increase in negative economy sentiment, crime-related incidents and/or terms that communicate future social instability and uneasiness.

It is a very interesting fact that our behavior is recorded -up to a point- by the web. Again, the key is the way that we are able to organize this information into logical chunks and then use this representation to find possible insights.

2009 has been a year of big changes. Best wishes for a Happy and Prosperous New Year for everyone.

Sabtu, 31 Oktober 2009

The sentiment on US Economy from Twitter

Is the economic crisis over? What is the sentiment of people regarding US Economy and the future? These are some of the questions that many people ask these days and the signs are somewhat mixed. Dow Jones is close to the 10000 mark and some US Economy Indices show that the worse is behind. But do people feel the same?

To answer these questions 10000 Tweets containing the word economy were collected with the purpose of finding out what people think and how they feel about the US Economy and the economic crisis. The following web chart shows some of the results :



PositiveSentiment is an annotation type that includes all words that suggest positivity such as good, better,advances while the opposite annotation (NegativeSentiment) exists for all keywords that suggest negativity.

The bolder the lines between words the heavier the association. To get an idea of how people feel, look at the line that connects NegativeSentiment and the word still which implies that the strongest sentiment is that US Economy is still under big problems.

Some other findings :

- US President tells that the economy gets better but people don't feel the same.

- Economy cannot be getting better while at the same time there are layoffs.

- People expressing very negative feelings after losing their jobs.


Notice also the association between NegativeSentiment and people, job, money, sales. Interesting insights can also be found if brand names and product categories are also taken into account : In this analysis a specific brand was found that was associated with word sales and a good overall sentiment. Buying behavior can also be found regarding consumer intentions.

You will also find that an association exists between finance_institution keywords (implying keyword Fed) and PositiveSentiment. This association exists because a number of Re-Tweets is about the Fed signaling the start of exit from recession and its impact on housing. Interesting also is the association between the words fool and annotation PositiveSentiment (...)

Specific Tweets were removed such as spam Tweets (that try to sell investing products). Re-Tweets were kept intact since we are making the assumption that if someone Re-Tweets -say- a positive sentiment Tweet then he/she also feels the same -positive- sentiment. Tweets that were jokes were identified, marked accordingly and removed.

As with many examples in the past, the software that was used consisted of GATE (for annotating unstructured text from Tweets) but also SPSS Clementine (now PASW Modeller). Here is the setup from GATE :




Specific rules (JAPE) were used that identify and annotate accordingly negative and positive sentiment. Consider the following sentences :

- The economy is most likely bad at the moment
- If the economy is great then why so many people can't find a job?

The first sentence has clearly a negative sentiment since the word bad exists. However the second phrase contains the word great so a specific matching rule should take into consideration the word If and annotate this phrase as one having negative sentiment despite the presence of word great.

After running GATE here is how the -now structured- data look like from a smaller sample of the original dataset (notice the highlighted record and the IfGood flag) :


With data in a structured form as the one depicted above we are then ready to identify which Tweets were found having a positive or negative sentiment, see erroneous annotations , take corrective actions and finally analyze the information and extract knowledge from it.

Senin, 12 Oktober 2009

Mining the Tweets

I received through my Google Alerts a very interesting article : Twitter is in talks with Microsoft and Google regarding the use of Data Mining technology on user Tweets.

Despite the fact that Twitter execs do not appear so eager in making the deal as soon as possible, these news clearly show where things are going. If and when the deal is finalized it will be very interesting to see :


1) What kind of Data and Text Mining techniques will be mostly used? Which of them will prove useful?

Many examples of what can be done in terms of Data and Text Mining application on Twitter were given in this blog (starting from January 2009). In my opinion, types of analysis that will prove to be interesting -apart from Sentiment Mining for Products and Services which is already taking place- are Cluster Analysis (see post "Clustering the Thoughts of Twitter Users" here) and Prediction of Virality.

Although Twitter will be able to monetize through insights extracted from Cluster Analysis and Opinion - Sentiment Mining perhaps the most important analysis is finding patterns in user emotional states. Recall that everything needed for such an analysis exists in user Tweets : Life Events, thoughts and their associated emotional states. What emotions drive people in making several decisions such as which Product to buy or which Politician to support? What kind of feelings are generated during a bad economy? Perhaps by analyzing Tweets we could understand people (and thus consumers) in entirely new ways since this is the first time that this information is available to us.

2) How will Twitter users react when knowing their Tweets are being analyzed?

My first impression is that Twitter users do not care too much if companies extract the insights discussed above however this does not mean that people's opinion will stay like this. Again, user reaction on this matter is something that could be changed anytime and should be looked at closely.

3) Which other technologies will be mostly sought?

Although no one can give a definitive answer, i would likely expect Natural Language Processing (NLP) and Ontologies to be also heavily used and sought as expertise.

Minggu, 09 Agustus 2009

Surviving Cancer, Happiness and Twitter

Twitter is a great source of information on how people feel and how they behave. In previous posts we have discussed several examples of extracting from Twitter posts the feelings of Twitter users, their beliefs and values.

My latest analysis goal was to extract specific life events (such as the birth of a child) and the associated feelings and emotions of such an event.

First i wanted to identify life events associated with happiness. To do this i used text classification and a great piece of software called GATE. The data used originated from tweets of 60K Twitter Users and their biographies.

After completing the analysis, several "patterns of happiness" emerged but i believe that there is one that deserves a post on its own and should be disclosed : One of the most happiest groups of people on Twitter are cancer survivors. I was really amazed to find out that these people who faced -and possibly still facing- this life threatening disease were amongst the happiest people on Twitter and used very frequently words expressing happiness, satisfaction and blessedness.

I do believe that Twitter is a huge source of information and insights for Marketing, Branding and PR. It also appears that by analyzing Tweets we could also learn some important life lessons as well.

More to come soon.

Selasa, 04 Agustus 2009

A computer program predicts Viral Tweets

In the previous post we have seen that the author of a Tweet is the most important factor for making a viral Tweet. This time we will use Text Mining to score Tweets and see how much viral they could become. Each Tweet is fed to a computer program (an algorithm) and the algorithm responds with the probability each Tweet has to become viral (we assume that when a Tweet receives more than 30 Re-Tweets it is considered viral).



The information that is given to the algorithm is the Text of the Tweet and its author. Many other parameters can be taken into consideration such as the time that the Tweet has been posted, the type of the Tweet (ie. politics, technology, health, etc) or even whether this Tweet is part of a novel subject. Here is the output of the software that performs the predictions :




The number of Re-Tweets is shown in squares. Pay also close attention to the circled text shown above. For each Tweet the most probable outcome is given ('t'= Tweet will become viral, 'f'=otherwise) and a confidence for each prediction is given as a number from 0 to 1. As an example, the first Tweet shown above was posted from Paula Abdul saying that she will not return to American Idol. The algorithm predicts with a confidence of 63.38% that what Paula Abdul posted will be interesting (and it actually was).

The predictive model has an overall accuracy of 72.88% in predicting which Tweets will be viral in a total of 59 Tweets. An example of an incorrect prediction can be seen at the 4th circle from the top. The algorithm gave a 53.66% confidence that this Tweet will not become viral but actually this was a viral Tweet.

You can find the text file of the actual run from the algorithm here.

By looking the text file, results metrics such as TP (True positives) versus FP (False positives) can be calculated. It is also interesting to see how the algorithm switches to negative predictions when the number of Re-Tweets of each Tweet become less than 30.

Even though the example given here is very simplistic -and optimistic-, the application of a tool of this kind for PR, Marketing and Branding could prove very useful. Marketeers can try different messages and see what impact each message is likely to have. Consider the following run that shows that @mashable is more influential than @lifeanalytics :




The following run shows that specific keywords raise our chances in making a Viral Tweet :




In theory this information could provide the basis for performing A/B tests : One could simply use the 2 messages shown above and record what impact each one has using Google Analytics (a process which could prove whether this technology works or not).

Finding information that is interesting to masses is actually a much harder problem. Twitter is a data source that is biased for many reasons : Specific people can pass their messages with great ease and Twitter is used by specific population segments. Almost a week ago i came across reddit and i believe that this site (and also Digg) is able to capture the preference of masses in a more efficient way than Twitter. The truth is that the available information from forums, blogs and many other websites can capture different aspects of human behavior. All that is needed to extract useful knowledge is an efficient blending of these facts, emotions and beliefs of people from different web sources.

Selasa, 30 Juni 2009

Predicting the next Viral Tweet

It is time to use Twitter data for another reason : Can Predictive Analytics be used to identify which tweets have an increased probability to become viral?





First we have to identify the problem and see what information we should consider. Every Tweet has an author, a content and is posted on a specific day and time. More specifically, for every tweet we can collect usage data such as

  • Day of Post
  • Time of post
  • Elapsed minutes since tweet has been posted
  • Author of tweet (Twitter username)
  • Number of followers of the author
and also information such as :

  • Subject of post
  • Whether the tweet involves a question being asked
  • Whether the tweet contains hashtags
  • Whether the tweet contains a "Please Re-Tweet" directive (or variants)
  • Whether a user is mentioned
  • The text of the tweet itself.

Our goal then is to combine the information mentioned above and come up with a predictive model that when given an author, day, time of post and text of the tweet it will be able to tell us whether this tweet has an increased probability to become viral.

For this Data & Text mining exercise (and keeping in mind that tweets have been sampled from one website and not Twitter itself) let's define what is a viral tweet : After collecting approx. 8000 tweets from dailyrt.com it was found that the median value of Re-tweets is 17. Here we make the assumption that if a tweet exceeds 30 Re-tweets it is considered viral (and actually this specific assumption makes the classification task much easier).

As discussed above, usage data do not tell us anything about the content of a tweet. Usage data tell us about the name of the author, his/her followers, when the tweet has been posted and how many minutes elapsed since its post. Can this information alone predict whether a tweet will become viral? A data mining model predicted (without using the elapsed time as input field) with an overall accuracy of 75.03% whether a tweet can be viral and -perhaps as expected- shown that the most important factor for making a viral tweet is its author. Running a process called Feature Selection tells us just that :



But what we have seen so far only tells us one -the Data Mining- side of the story. With Text Mining we can see the importance of words and authors. To do that, each author is appended at the end of each tweet (so essentially the author becomes a part of each tweet text). Here is what Feature Selection tells us :



A Tweet mentioning Michael Jackson has a great probability of becoming viral but perhaps it should be also posted by a popular author to make a greater impact. Pay attention also to the fact that @mashable and the @theonion are on top of our feature selection list shown above.

The difficult -but also interesting- task is to predict a viral tweet that has an impact not because of its author but because of its content and to do this the methodology of data collection and analysis differs significantly.

On the next post we will see a model predicting viral tweets in action : We will submit several tweets and their author and the model will tell us the probability that each submitted tweet has to become viral.

Selasa, 23 Juni 2009

How Habitat UK *should* have used Twitter

Following the great post from Tiphereth Gloria i wanted to take the opportunity to show an example of how Habitat UK should be using Twitter.

My suggestion would be that instead of the "initiative" they took they should identify the values, beliefs and needs of their customers by capturing and analyzing relevant tweets instead. And here is how they could do it :

First they should capture all relevant Tweets every -say- month :



The second step would be to identify what people want when they talk about furniture. If they used Text Mining they would have found specific furniture products that customers want to buy and the values associated with these types. For an example look at the following table :



The table shows us (pay attention to dark red cells) that customers looking to buy baby furniture have Safety as their number one associated value. With this knowledge then perhaps Habitat UK would make sure that when they advertise Baby furniture they would use this word on their advertisements to capture the interest of their customers. Of course what was shown above is not some new information but is meant to be given as an example.

Some more things that Habitat UK could have done with Text Mining would be to see :

  • How important it is to suggest solutions to customers
  • Which rooms people want to re-furnish more often and -more importantly- why.
  • How problems (such as furniture received is damaged or difficult to assembly) affect their brand.
  • How people feel excited when they wait for their new furniture...and how bad they feel when furniture is not delivered on time.

There is much more that can be done : By running Cluster analysis many kinds of customer thoughts can be grouped together : One of them was how much "Feeling good" is closely related to new furniture and how it affects people's psyche.

By using Social Media Analytics, Habitat UK -and most other companies- would understand their customers better, see what is important for them and with this knowledge they would be able to take informed decisions that would -most likely- make a real difference.


Jumat, 19 Juni 2009

How people use Twitter - 10 distinct usage groups

During this post we will be looking at another example of cluster analysis performed on Twitter. The analysis was performed on 17000 Twitter users with the goal of extracting distinct groups of usage which essentially shows us the different types of Usage behavior of Twitter users. The following parameters were taken under consideration :

  • Number of Followers
  • Number of Links posted per 20 Tweets (not during RT)
  • Number of Updates
  • Elapsed Days

The following table shows the results :


Note that each cluster has a specific number from 1 to 10. Clusters are listed according to their size which means that cluster "10" is the largest usage group, while cluster "5" being the smallest.

Let's see what the table tells us, starting with the first line : Cluster 10, is the largest (=more frequent) type of usage behavior. Users of that group have an average number of followers, have been using Twitter for relatively many days (elapsedDays=high) ,have a high number of updates while the number of links they provide per 20 tweets is average - say around 3 links-

Now consider -highlighted- cluster 8 which we will call The Information providers : Notice that even though this group of users have relatively few elapsed days and average number of updates, they achieve a High number of followers. The reason is that these users provide a large number of links per 20 Tweets ( Note that this confirms findings during a previous analysis).

See also cluster 3 : Even though this group of users has been on Twitter for many days but also has a high number of updates, it appears that it pays a price for not providing links.

Recall that the "#OfLinks" parameter counts only these links that are NOT part of a Retweet. This tells us that users that are able to find original content and provide it to the community tend to gain more followers.

This analysis was given with the aim of providing a simple example and should not be considered as a detailed analysis since few parameters have been taken into account. Cluster Analysis on Twitter data (which include things that people like doing, professions, interests, marital status, mention of products or opinions to name a few) can -potentially- give us excellent insights on different aspects of user behavior.

Rabu, 03 Juni 2009

Social Media, Corporate Decisions and Analytics

Over the past 6 months we have seen real-world applications of Data and Text Mining applied on Social Media Data from Twitter. We went through many examples that look at Social Media Data in different ways :


  • We identified what Twitter users don't want, grouped their beliefs and also ordered all of this information accordingly

  • We identified which usage behavior increases our chances of having a large number of followers (if a large number of followers is our goal)



  • We found which words appear to be associated with a large number of followers. (We have seen that negative thinking and words in Tweets possibly drive people away)




  • We extracted segments of Twitter users with similar characteristics.


The list of possible applications does not end here. Over the next posts we will also discuss about :


  • Predicting whether a Tweet has the potential to become "viral".

  • Associating specific events and user emotional states.

To recap : A computer program is able to monitor the words - phrases that you say and your emotions, flag them as positive or negative, track the rate with which you increase your follower count, track the number of updates, Re-Tweets, replies, hashtags, smileys and questions that you make, flags any mentions about products and services and assigns you to a predefined segment of users sharing similar behavior and interests. Then for each segment its "social media fitness value" is identified (by looking at the follower count).

Usage of Google Wave will possibly reveal other insights : Due to the fact that the sequence of posts will be easily extracted then we could also take under consideration the number of consecutive posts who had a positive sentiment and whether these positive posts appeared at the beginning, center or the end of each thread's sequence. We could also look at the number of posts -that are part of the same thread- having videos or pictures attached and ultimately identify how all of this information may affect one's point of view. Of course I am not certain whether such a scenario could prove useful. I sure would like to try though.

We are presented with a unique opportunity to understand people much better than before and with the examples shown so far this should be more clear by now. Predictive Analytics is about extracting knowledge and identifying what is more likely to work. As Ian Ayres put it in his book Super Crunchers, Decisions are beginning to be based even more on facts and less on intuition. It appears that Social Media Analytics will play an important role in making Corporate decisions for PR, Branding and Marketing and this will happen through better understanding of human behavior.

Kamis, 21 Mei 2009

Twitter Analytics : Cluster Analysis reveals similar Twitter Users

So far we have seen various examples of using analytics to gain insights from Twitter. Using cluster analysis is a personal favorite : It enables us to identify common groups of users and in this post we are going to look at a segmentation based on user biography keywords. This analysis was also presented in an older post but some readers asked me to elaborate a bit more on this type of analysis.

Biography information allows us to segment Twitter users in groups of similar interests, professions and qualities. What is more interesting however is that we can identify the words that each segment appears to be associated with. Let's see an example of words that tend to co-exist with the phrase "social media" in the Biographies of Twitter users :




By looking at the column named "social_media" we see some associated keywords like : addiction (synonym for addicted, junkie etc), evangelist, enthusiast, analytics etc.

Other groups found and their associated words were :

The Geeks : Developer, Linux, Mac, gaming, photography
The Parents : married, boys, girls, christian,conservative
The business owners : CEO, entrepreneur, marketing, founder, lifestyle

Note that "The Geeks" have Mac as an associated keyword which of course refers to Apple Macintosh : An example suggesting a possible strong bond between a brand and a specific customer segment.

Now imagine running a similar analysis for other segments such as Single Dads and Mothers, Teenage Girls, Nice Guys,
IT Developers, VIPs or any other "segment" you prefer (see this entry -posted Jan. 2009- for more)

On a personal Note : Having used Text Mining on Twitter over the past 6 months i realized that whenever a new cycle of analysis is made i come up most of the time with things that i already know. But apart from expected results some of the fine details of people's lives also appear such as the implications of a life-changing event, the joy of owning something new or the plain fact of "watching TV and feeling bored". Many of the insights found during these months -although not discussed here on purpose- are highly thought provoking.

Perhaps Twitter Analytics could also give us some possible clues on:

  • Whether a specific profession could be a risk factor for being single.

  • How important is fashion for girls.

  • How mobile phone user requirements change according to the "segment" they belong to.

  • What are the most common things that people don't want.

  • Finding individuals that do not fit any "segment".

But the list of potential applications does not end here : Using a technique called Association Rule Learning (or Association Discovery) we can extract emotions or thoughts that appear to co-exist and also emotions that seem to be associated with specific events. Classification Analysis can also play an important part (more on these techniques soon).

Each technique looks at the Social Media Data world from a different perspective. The usage behavior, cluster membership, the emotions and thoughts and also the Tweets that users seem to prefer most (using data from sites such as repeets.com) may be combined. What we can potentially achieve from a combined analysis of this kind will be discussed in later posts.

As already stated in previous posts: The use of the methods described so far enables us to form hypotheses but in no way it is assumed that associations found are the definite cause of a specific event.

Kamis, 14 Mei 2009

Twitter Analytics : Bio information and popularity

In the previous post we identified words used in Tweets that appear to be associated with low number of followers : We found that when someone uses foul or negative language then his/her follower count appears to be affected negatively (see here for more).

It is time to identify the words contained in the biographies of popular Twitter users and to be more specific the biographies of users being in the top 30% (in terms of no. of followers) of a random sample of 10000 users. As i always have stated in these series of posts : Treat results as possible clues only. Please also notice how i used (in this and older posts) the words "appears" or "were found" when discussing correlation. The technique shown is the same as discussed in the previous post. Results are as follows :




  • Student appears to be correlated with low popularity accounts.

  • Engineer also appears to exist often in low popularity accounts although the correlation was not found to be as strong as for students.

  • Common words existing in popular users Bio appear to be the following : social, media, marketing, CEO, founder, author, entrepreneur, blog, twitter, news, writer, internet.

Some comments :


  • It is not suggested that by having specific words in your bio, you will get more followers. Many other things are and could be important in achieving a high follower count. Same applies for unpopular accounts.

  • Looking at the results i wondered why students were found to be associated with low follower numbers and i think that this requires more attention. One possible reason could be that students might be spending most of their social media time on FaceBook or other SM sites. There can be many pitfalls in performing a random sampling from Twitter and "Students" could be one of these cases. However please share your comments.

  • Notice that some words that appear to be associated with high follower numbers are words that communicate authority ( ex. founder, CEO).


To recap from the last 3 posts :

1) Do not use foul language - keep your conversations positive.
2) Use "Thank you" often. "Stay tuned" seems to work well also.
3) Post frequently. Posting some links is also important.
4) Make sure you have a good Bio filled in.


Finally, if you find the contents of this blog interesting you can always have a look for more updates on my new account on Twitter @lifeanalytics and also send me your suggestions and/or comments.

Selasa, 05 Mei 2009

Twitter Analytics : These words may be affecting your popularity

Text Mining techniques can be used to identify specific words that are correlated with Twitter accounts having high or low popularity. This can be done in two ways : (1) By analyzing the text of the Tweets of each user and (2) By analyzing the text of the biography of each user.

Let's start with the results of the first type of analysis with data originating from user Tweets. Pay attention only to cells that are highlighted in red, their corresponding category column (LOWFOLLOWERS , HIGHFOLLOWERS) and the word at the beginning of each corresponding row. Results show which words appear to be important especially because the affinity shown here is moderate. Use results as possible clues only.



The results so far show us that :

hate, bed : are found to be correlated with low popularity
top, online, send, list,web,media, join : with high popularity


Here is another portion of the results table :




The pattern should be evident by now : Words of negative attitude appear to be influencing a user's follower count negatively. As also shown above, foul language appears to work negatively also. Several other insights were found such as the existence of specific phrases that are correlated with low popularity ("watching TV") while other phrases ("stay tuned" ) with popular accounts. The number shown in parentheses quantifies the magnitude of the association that each word has and thus enables us to order words by their importance.

Some of the words -and their synonyms- that were found to be associated with very low follower counts are :

- Sleep, Hate, Damn, Feeling, Homework, Class, Boring, Stuck


A total of 63 words and 25 phrases were found having either a positive or negative association with the followers count. Interestingly, specific phrases that communicate any kind of opportunity are also associated with high number of followers. "Thank you" is highly related with a user's large popularity.

Here comes the interesting part : Once the Text Mining analysis is completed, a predictive model can be generated that may be used for scoring future Tweets. Let's assume that you are about to send the following 2 Tweets :

1) 'Today i feel like sleeping all day. Yawn...'
2) '@xyz Your website traffic can be increased with good marketing'

Before you post however, you decide to feed these 2 sentences to a predictive model. The predictive model returns for every Tweet the predicted result (GOOD or BAD) and the associated probability. Here are the results for these 2 examples from an actual run :





In other words :

1) The first Tweet may have a negative effect with a probability of 83.5%
2) The second Tweet may have a positive effect with probability 99.9%


Note that :

  • A predictive model is able to consider combination of words, not just single words. This raises considerably the accuracy of any prediction.
  • In any real world application of Text Mining a 100% prediction accuracy cannot be achieved: Although application-specific, a 72-78% accuracy may be achieved - with considerable effort. Of course many more things are important to achieve high popularity and the example above is given merely to discuss what techniques currently exist. A combination of analytical techniques is the best option and this will be discussed in a future post.

Several other types of analysis can extract similarly interesting insights : Let's not forget that Twitter Tweets contain the emotions, beliefs and values of users. They contain what people want and what they don't want. See Clustering the thoughts of Twitter Users and Know your customers the Twitter way for a further discussion on this.

There will be more to say about Text Mining and how it can be put to use by PR Agencies and Marketing companies with practical examples shortly.

Minggu, 03 Mei 2009

Twitter Analytics : Which usage behavior attracts many followers?

This is the first part of a series of posts where Data Mining and Text Mining will be applied to extract potentially useful facts about the usage of Twitter and to draw some conclusions such as what makes a Twitter account interesting enough to other users.

The conclusions that will be presented here are from the analysis of 3651 Twitter accounts and are meant to show how Predictive Analytics can help. Please note that results are shown for informational purposes only.


First, the data used can be summarized with the following table :





You can immediately see problems in the ranges of the data used especially on the number of "followers" and "following". This is something to be expected since among the users captured were Jack Dorsey (founder of Twitter), Sen. McCain and George Stephanopoulos - users that obviously have a huge amount of followers.

Before finding which usage behavior attracts many followers, one should be able to identify what exactly is a "popular twitter account". Is it just the absolute number of followers? Perhaps it could be equally important -or at least interesting- to also look at :

1) The followers/following ratio

2) The number of followers per day

For our example the absolute number of followers was used as the only criterion of a successful Twitter account. The results can be summarized with the following decision tree :





Some usage patterns that raise the chance of having a successful Twitter account are the following :

  • Having a bio is an absolute must : 82.3% of unsuccessful Twitter accounts have their biography information missing.

  • You should provide more than 3 links per 20 tweets and also more than 0.960 updates per day

  • If you don't want to provide more than 3 links per 20 tweets, then try to post more than 5.857 updates per day.

  • Users that post more than 3 links per 20 tweets but post less than or equal to 0.960 updates per day, will need more than 222.5 days of usage to get an adequate amount of followers.

By using Feature Selection we are able to look also at the relevant importance of each parameter on achieving many followers : Here are the results of Feature Selection from using ChiSquare, GainRatio and InfoGain attribute evaluators.



=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===

average merit average rank attribute
362.743 +-10.419 1 +- 0 4 numberOfLinks
319.397 +-10.133 2.4 +- 0.49 6 hasBlankProfile?
311.661 +- 8.612 2.6 +- 0.49 7 updatesPerDay
192.525 +- 7.481 4.1 +- 0.3 3 retweetsNumber
178.236 +- 5.963 4.9 +- 0.3 1 elapsedDays
36.148 +- 3.579 6 +- 0 2 otherUsersTalk
17.843 +- 4.475 7 +- 0 5 questionsAsked


average merit average rank attribute
0.1 +- 0.003 1 +- 0 6 hasBlankProfile?
0.042 +- 0.001 2.4 +- 0.49 4 numberOfLinks
0.039 +- 0.002 3.2 +- 0.6 3 retweetsNumber
0.04 +- 0.004 3.4 +- 0.92 7 updatesPerDay
0.025 +- 0.001 5 +- 0 1 elapsedDays
0.011 +- 0.001 6 +- 0 2 otherUsersTalk
0.005 +- 0.001 7 +- 0 5 questionsAsked

average merit average rank attribute
0.082 +- 0.002 1 +- 0 4 numberOfLinks
0.074 +- 0.003 2.1 +- 0.3 6 hasBlankProfile?
0.071 +- 0.002 2.9 +- 0.3 7 updatesPerDay
0.044 +- 0.002 4.1 +- 0.3 3 retweetsNumber
0.041 +- 0.001 4.9 +- 0.3 1 elapsedDays
0.008 +- 0.001 6 +- 0 2 otherUsersTalk
0.004 +- 0.001 7 +- 0 5 questionsAsked


We see that all three attribute evaluators agree that the number of links provided on Tweets and whether the profile of the user is filled in are the two most important parameters in achieving many followers. Notice also that sending messages to other users (otherUsersTalk) and asking questions (questionsAsked) is not as important as one would expect.

The analysis shown above gives many insights but it does not take into account what the users say and how this affects the popularity of a Twitter account. Text Mining will try to give some answers for this question and also identify which keywords on Twitter profiles seem to be associated with many followers.

Senin, 27 April 2009

Twitter Analytics : Words that make a difference

Predictive Analytics are already widely used on Twitter to extract -potentially- interesting insights. In previous posts we discussed about :

  • Sentiment Analysis and Ontologies
  • Analyzing the biographies of Twitter users and identifying clusters of similar users.
  • Cluster Analysis on the thoughts of Twitter users
  • Identifying the values and beliefs of Twitter users.

One additionally interesting insight is the knowledge of what makes a Twitter user having many followers. Consider the following questions :

  • Are there words that could potentially decrease the popularity of a Twitter account?
  • How important is to have an actual photo (and not the default o_O photo)?
  • Which interests or professions tend to be associated with many followers?
  • How important is to have at least a small text of biography information?

To answer these questions, data from 100000 Twitter users were collected over the past few weeks. Information collected includes the number of followers, number of friends, total updates, number of Retweets (per 20 tweets), number of replies to other users, number of links to external URLs, number of months that the user is on Twitter, etc. Here is how the data looks like :


You will notice that the separator tilde '^' is used. The first portion of each line contains the user name, date of account creation, months elapsed since account creation, number of friends,number of re-tweets etc.

The first analysis that was performed was to identify whether specific keywords that exist on user biographies seem to be associated with a large number of followers. A second type of analysis was performed only with numeric data (such as number of re-tweets, number of user replies, number of updates,etc). Then a third type of analysis uses both a vector of keywords plus numerical data. Since a lot of work is needed, the process (but not all results) will be presented during the next posts.

FYI : Users that tend to use a lot the words "boredom", "boring" or "bored" tend to minimize their chances of being popular.


Kamis, 12 Maret 2009

Social Media Monitoring with ScoutLabs - Interview

During my previous posts i have shown some examples of Sentiment Analysis using Twitter. I came across a Sentiment Analysis product named ScoutLabs which is able to give insights on what customers are saying on the Web about a product or service and decided to get an interview from ScoutLabs CEO Jennifer Zeszut :





- Please tell us about ScoutLabs and how companies may benefit from using it.

Scout Labs is a powerful, web-based application that finds signals in the noise of social media to help teams build better products and stronger customer relationships.

Scout Labs is a product company, not an agency. We provide cutting-edge technology and a collaborative platform for companies and their agents to listen to customers and engage with them out across the Internet. With Scout Labs, our users:

* Know when to tune in and what’s most important to pay attention to
* Hear what customers love and hate about brands
* Reach out to influential customers to build relationships
* Engage in proactive customer service
* Let the voice of the customer inspire new product and marketing ideas

Scout Labs has grown significantly since it was founded in 2006. With offices in San Francisco and users all over the world, the company currently employs over 20 professionals. Our CEO and product team guide the application with insight from the world of marketing, brand management and product management, but the majority of Scout Labs employees are senior engineers with expertise in search technology, high-performance systems, natural language processing, machine learning, web crawling and data visualization.


- How is ScoutLabs different from other sentiment analysis solutions?

Many sentiment analysis solutions are really human-powered, which is great, if you’ve got a big budget and or a lot of time. Ours is automated, and we process millions of posts per day and score it for sentiment as it happens, with an accuracy rate (agrees with humans) 73% of the time and it will get better, because as users (across our system) change sentiment values in our system, we aggregate that data and use it as labeled data to improve our algorithms even further. We also can “backfill” or back-score 3 months of previous data with its sentiment scores in 20 minutes to (at most) 24 hours.


- What kind of information (product names, company names, areas, city names) can ScoutLabs identify in user conversations "Out of the Box" ?

Scout labs lets you track anything you like. It is purely search based (not a database of x fixed company names). We see searches on everything! Company names, product names, people, industries, “the price of rubber”, green energy, styrofoam, “tricked out shoe”, “canceling cable”, “favorite hotel in bali”-- you name it. For some of these, sentiment doesn’t make a ton of sense, but we’ll score it for you nonetheless (just in case).

- One interesting functionality is to be able to identify associated keywords by products. For example consider two different brands of running shoes such as Nike and Adidas. "Great design" might be associated with the first brand while the other brand is perceived as being "comfortable". Is ScoutLabs able to identify automatically such information between two or more similar brands-products?

We do offer an analysis of the top conversations associated with any search. This comes stright from the conversations themselves. They are the frequent words that are emerging from the conversations for the time period. And yes! Very often we find powerful adjectives that describe a brand (sometimes good, but not always). For example, when i did a search for dora the explorer (the popular cartoon character for the pre-school set), the words that emerged was “skank” “sexy”. Huh? Sure enough, parents are outraged by a mattel announcement about how they are going to have dora grow up, move to new york and get fashionable (with a short skirt). This is a true story: http://www.Scoutlabs.Com/2009/03/12/scandalous-doll-drama-scout-labs-style/



- How much should one expect to pay for such a service?
A single team with 25 searches (which is where most companies start) is only $249. For now. We have only committed to this low price for the first 1000 companies to try us. That price may go up one day soon. All our pricing plans are here: http://www.Scoutlabs.Com/plans/

Senin, 23 Februari 2009

Making more sense out of Twitter Tweets


Over the last 5 posts i have described how unstructured text information from Twitter can be used for Knowledge Extraction. Specific examples were given such as Sentiment Analysis for products (Amazon's Kindle), Segmentation of Twitter users, and finally cluster analysis of the emotions and thoughts expressed from twitter users.

So far i have discussed some ways that text mining could help us in getting more insight on how people think. Now it is time to put Information Extraction and Ontologies to the equation.

Information Extraction (IE) is the automated extraction of any information such as (to name a few) Names (first names, city names, country names etc), facts or events from unstructured text. An example of IE was given in these posts where thousands of adverts of flats are extracted and then data mining analysis is performed to identify what characteristics are important for achieving a high renting price.

Ontologies are used for knowledge representation and may also be used for structuring the information that exists on the web. To give an example, consider the following product keywords :
  • Coke
  • Sprite
  • Dr Pepper
If one asks you what is common about them, your brain looks for generalizations and comes up with the following answers :

  • They are all Carbonated Drinks

  • (Possibly) they all contain sugar since the word "Diet" or "Zero" or "Light" is not mentioned.

Now let's assume having an Ontology Engine that is able to do this and to be able to infer automatically that all these products are sugar-carbonated drinks. Such an action enables us to extract facts in a more coherent way. The reason behind this is that we lessen the effect discussed on The Statistics of Everyday Talk and thus are able to capture growing trends such as people expressing their thoughts regarding carbonated drinks rather than matching "Coke", "Sprite" and "Dr Pepper" individually. Without Ontologies such a trend could be easily missed.

By using Ontologies or taxonomies where applicable, an associations discovery algorithm can search in different levels of information detail. For example data miners usually employ taxonomic information (ex. Sprite, Coke, Pepsi = carbonated drinks) when performing associations discovery analysis on Super Markets and the effort of applying taxonomies almost always pays back in terms of the knowledge extracted regarding consumer behavior.

I have used Ontologies over the past 3 years and have seen them in action. The fact that with Ontologies one could possibly have access to inference and deductive reasoning techniques is of great use. The application of Information Extraction, Natural Language Processing and subsequent insertion of this information in an Ontological setting has many potential applications.



Minggu, 15 Februari 2009

Know your customers - The Twitter way


The more i analyze tweets on Twitter, the more interesting i find the whole process. First it was Cluster analysis of specific thoughts expressed from Twitter users and then it was Sentiment Mining for Amazon's Kindle. It was just a matter of time from having the urge to analyze Tweets on a broader perspective.

So i decided to perform a segmentation of the Twitter users : extract common groups of users but this time not for specific thoughts or specific products but a segmentation based on a more generic basis.

I had two goals in this cluster analysis :

1) Cluster the biographies of users
2) Cluster the tweets of the users.

I then decided that the more information i could collect the better, so the first thing i did was to make a 'spider' program to extract 10,000 twitter user names. Then for each twitter user the software visits his/her page and extracts :

a) The user's bio
b) Number of followers
c) Number of people following
d) Number of updates
e) 20 latest Tweets
f) Number of re-tweets
g) Number of replies to other users (ex when @user directive exists)


Let's see now what we could -potentially- do with such information :

1) Cluster analysis on user bios

2) Cluster analysis on user tweets

3) Classification analysis for identifying the common characteristics of users with many followers

4) Associations discovery between products : Which products tend to be mentioned together in each user's tweets?

5) Identification of common keywords per cluster : If we identify a cluster of users that we characterize as the "Parents", what keywords do "Parents" tend to use more? What about the "Tech junkies" cluster?

But let's start with the first analysis : Clustering the biographies of Twitterers. The analysis generated 30 clusters of users. Some of them are :

1) The Parents
2) The computer Geeks
3) The students
4) The social media addicts
5) The entrepreneurs

I looked at the "Parents" cluster more closely and wanted to find keywords that this cluster is associated with : Single and Jesus where some of them.

So we immediately identify one of the many customer groups : The parents, of which a significant percentage of them are single. The "Parents" cluster also expresses one of its values : Christianity.

By moving on to each generated cluster and finding the associated keywords, i was able to retrieve the values and beliefs of each cluster. Knowledge Extraction at its best.




Rabu, 11 Februari 2009

Sentiment Mining for Amazon's Kindle


Following the post on Clustering the thoughts of Twitter users, it is time to look at another example where Twitter can be used. So i decided to analyze -just- 1054 tweets that are about Amazon's e-reader kindle to see what i could come up with.

My goal was not to classify between positive or negative sentiment but to extract the general "buzz" about the product by means of cluster analysis. After extracting the tweets that contain the word "kindle" i continued in removing non-relevant information (such as tinyurl links) by using regex expressions.

Next, it was time to understand the data and a good way to do this is to look at word frequencies using TextStat. Here is what i came up with :



On the top of the word frequency list are the usual suspects : "I", "and", "to", but also "kindle", "kindle2" and "amazon" which is something that was expected. Now, let's see what are some of the words that do not occur frequently :



Here appears a fact that requires attention : Text miners use stop-word lists to remove the most frequent words but they also remove words that do not occur frequently. The table above shows that a non-frequently occurring word is disappointed and if we had chosen to omit words of a specific frequency range -such as less than 3- we could loose this important information. So caution is needed.

After running the analysis, i came up with 20 different clusters of similar "thinking". Note that we are not only interested in which those clusters are but also -more importantly- to the proportion of cases that each cluster contains (see previous post). Some of the examples of clusters found are :

1) A cluster of users that are questioning the usefulness of the product
2) Excited users
3) Users that are happy about the text-to-speech recognition feature of the product
4) Text-to-speech recognition and potential copyright issues


Twitter is a great source for sentiment extraction but one problem is the fact that people are re-tweeting the same news (" The new Kindle 2 is out") or they tweet about similar information from various tech news websites.

Selasa, 20 Januari 2009

Clustering the thoughts of Twitter Users

During the last two posts i presented the reasons and some problems on analyzing the thoughts of users on the web and particularly Twitter. (For more see Part1 and Part2 ).

As an example, we are going to be looking at a specific kind of thought that Twitter users make : What they don't want. By using the Twitter API i managed to extract all tweets having the phrase "i don't want to". The following text file shows the results :




The next step is to remove all phrases that do not give us any information about what users do not want :



Finally we remove the phrase "i don't want to". However, consider the following example:

"I must go to Chicago. I don't want to do that"


The steps discussed above will discard the first sentence which is actually what the user does not want to do and leave only the phrase "i don't want to do that" which is not particularly informative. At this point we must quantify the problem -let's assume it involves the 8.5% of our records- and recall what the pareto principle is all about.


After some additional pre-processing steps which are not discussed here, i feed the data to K-Means to see the clusters the algorithm comes up with. For a better presentation of the results, here is a screen capture from IBM's UI Modeler :




We immediately see -in descending order- what Twitter users do not want :

1) They do not want to go to work
2) They do not want to go to school
3) They do not want to hear about various issues
4) They do not want to buy things


Notice also the top two categories named Miscellaneous and None. These categories contain thoughts that have a very small frequency to form a cluster. These two categories consist the 69.56% of our records and at this point we should think again about the pareto principle.

Please note that not all necessary work is discussed here and i had to omit several actions that have to take place. In trying to understand what people actually think i am using an approach which uses Ontologies, Information Extraction, Clustering and Classification analysis with the ultimate goal to minimize the percentage of thoughts (69.56% in this example) that cannot form a cluster and to increase the accuracy of the analysis.

It is also an interesting fact that we could move further down the sentence branch (see this post) for even better insight. Here i presented a cluster analysis about what users do not want. As an example we could apply clustering on user thoughts specifically for "I don't want to feel".



Kamis, 15 Januari 2009

The Statistics of Everyday Talk


As discussed in the previous post, the analysis of free text on the Web -and as an example the thoughts expressed by Twitter users- could extract very interesting insights on how users think and how they behave.

In 2001 i visited Trillium where i had a very useful seminar on Data Cleaning, Data Quality and Standardization during which the pareto principle became -once again- evident. When someone wishes to standardize entries in a Database so that the word "Parkway" is written in the same way across all records, he might find the following distribution of "parkway" entries :

15% of records contain the word "Parkway"
3% of records contain the word "Pkwy"
0.2% of records contain the word "Prkwy"
0.01% of records contain the word "Parkwy"

What that essentially means is that with a single SQL query one can find and correct 15% of "parkway" word synonyms to whatever standardized form is needed. But for the remaining variations one query solves only a very small fraction of the problem and this in turn increases the amount of work required, sometimes overwhelmingly.

In capturing and analyzing natural language we are confronted with the same problem : 60% of people might be using the same phrase for describing the fact that they don't want to go to sleep with a simple "I don't want to go to sleep". But another 20% might be using something like : "i don't feel like sleeping" and another 10% something like "i don't want to go to bed right now".

So we immediately see one of the issues that Text Miners face : The fact that we can use different phrases to communicate the same meaning. If we wish to analyze text information for classification purposes -say the sentiment of customers- we could achieve a 60-65% accuracy in our results with some effort. For a mere 4% increase in accuracy -from 65% to 69%- the amount of extra effort required could prove prohibitive.

Consider the following chart :




These are all examples of phrases people use in their everyday talk. We can visualize such phrases starting with" i don't want to" and then each branch adds a new meaning to the phrase. So branches marked with numbers are the parts of speech that give us an idea of what a person doesn't want to do : To go, to feel, to visit,to know. Things are getting much more difficult in terms of the effort required if we wish to add more detail -and probably insight- to our analysis by moving further down the branches in our sentence tree.

Perhaps for marketeers, the ability to quantify the distribution of words on the 1st level of the tree depicted above could be enough : If we end up with the following words distribution :

To feel : 15%
To know : 7%
To go : 1%
To visit : 1%

Then, we get an insight on which words to use to market products more efficiently.


On the next post we will go through a hands-on example of analyzing the thoughts of Twitter users and specifically what people seem to "don't want".

Senin, 05 Januari 2009

Emotions, Beliefs and Analytics


When i first came across Data Mining and Machine Learning in 1998 i had no idea of the kind of applications that this field can have. As time passes by, the knowledge that can be available to a data/text miner becomes more and more a serious business....actually, a very serious one.

Not long time ago i have seen a presentation where a map of emotions from the web was created in real time by aggregating specific keywords from blogs and forum posts. Twistori is an example of such an application. Now, let's take this idea one step further.

Twitter is a "social messaging utility" in which users describe what they are doing -or what they are feeling/thinking- now. Users are able to send "tweets" even through SMS messages. The way that these messages are written is an ideal format for text mining : Short phrases that summarize what a user wants to say are a text miner's paradise.

It is logical to assume that Text mining and Information extraction techniques will become more important, since more data will be generated in the future. It is only a matter of time until the next "killer app" like FaceBook, YouTube and Twitter appears. Data/Text miners will be able to identify common "thought clusters" of people.

Now, consider the following example : By visiting this link you will get a list of people that have written in their "tweets" the phrase "I don't want to....".

Once this textual information is captured, preprocessed and then analyzed through cluster analysis we could end up with the following clusters of "I don't want-er's " :


- The cluster of users that do not want to work again/tomorrow/today (18.5%)

- The cluster of users that do not want to go to sleep (6%)

- The cluster of users that do not want to hurt someone (4.2%)


What is also interesting, is the ability to quantify the proportion of cases belonging to each cluster to the total of tweets. As shown in the example above, the most frequently occurring thought is from people that do not feel like working.


Now in the same way one could perform this type of analysis for :

"I Believe...."
"I wish i...."
"I want to buy..."

Essentially, what we are talking about is the extraction of the values, hopes and beliefs of hundreds of thousands -or even millions- of users...and in descending order. Once a first run is performed and clusters are extracted one could run this process again every month and see the trends of those clusters in time. It would be also interesting to see how these thought clusters change after specific World events.

For some people such as marketeers and social researchers -providing that results are accurate enough- this information is invaluable. Others, might feel that such an analysis is bad practice. Of course, there are companies that already capture brand sentiment across the web : Crimson Hexagon and Twitrratr are just two examples.


This post is the first in a series of posts discussing the application of Analytics to capture the thoughts that -as we speak now- exist on the Web. We will go through ways that one could explore this information and more specifically we will look at :


  • How clustering can group people's values, beliefs and emotions.

  • Why Ontologies and Natural Language Processing are needed for better results.

  • How classification analysis might give us knowledge on what are the common characteristics of various 'categories' of users.