Kamis, 21 Mei 2009

Twitter Analytics : Cluster Analysis reveals similar Twitter Users

So far we have seen various examples of using analytics to gain insights from Twitter. Using cluster analysis is a personal favorite : It enables us to identify common groups of users and in this post we are going to look at a segmentation based on user biography keywords. This analysis was also presented in an older post but some readers asked me to elaborate a bit more on this type of analysis.

Biography information allows us to segment Twitter users in groups of similar interests, professions and qualities. What is more interesting however is that we can identify the words that each segment appears to be associated with. Let's see an example of words that tend to co-exist with the phrase "social media" in the Biographies of Twitter users :




By looking at the column named "social_media" we see some associated keywords like : addiction (synonym for addicted, junkie etc), evangelist, enthusiast, analytics etc.

Other groups found and their associated words were :

The Geeks : Developer, Linux, Mac, gaming, photography
The Parents : married, boys, girls, christian,conservative
The business owners : CEO, entrepreneur, marketing, founder, lifestyle

Note that "The Geeks" have Mac as an associated keyword which of course refers to Apple Macintosh : An example suggesting a possible strong bond between a brand and a specific customer segment.

Now imagine running a similar analysis for other segments such as Single Dads and Mothers, Teenage Girls, Nice Guys,
IT Developers, VIPs or any other "segment" you prefer (see this entry -posted Jan. 2009- for more)

On a personal Note : Having used Text Mining on Twitter over the past 6 months i realized that whenever a new cycle of analysis is made i come up most of the time with things that i already know. But apart from expected results some of the fine details of people's lives also appear such as the implications of a life-changing event, the joy of owning something new or the plain fact of "watching TV and feeling bored". Many of the insights found during these months -although not discussed here on purpose- are highly thought provoking.

Perhaps Twitter Analytics could also give us some possible clues on:

  • Whether a specific profession could be a risk factor for being single.

  • How important is fashion for girls.

  • How mobile phone user requirements change according to the "segment" they belong to.

  • What are the most common things that people don't want.

  • Finding individuals that do not fit any "segment".

But the list of potential applications does not end here : Using a technique called Association Rule Learning (or Association Discovery) we can extract emotions or thoughts that appear to co-exist and also emotions that seem to be associated with specific events. Classification Analysis can also play an important part (more on these techniques soon).

Each technique looks at the Social Media Data world from a different perspective. The usage behavior, cluster membership, the emotions and thoughts and also the Tweets that users seem to prefer most (using data from sites such as repeets.com) may be combined. What we can potentially achieve from a combined analysis of this kind will be discussed in later posts.

As already stated in previous posts: The use of the methods described so far enables us to form hypotheses but in no way it is assumed that associations found are the definite cause of a specific event.

Kamis, 14 Mei 2009

Twitter Analytics : Bio information and popularity

In the previous post we identified words used in Tweets that appear to be associated with low number of followers : We found that when someone uses foul or negative language then his/her follower count appears to be affected negatively (see here for more).

It is time to identify the words contained in the biographies of popular Twitter users and to be more specific the biographies of users being in the top 30% (in terms of no. of followers) of a random sample of 10000 users. As i always have stated in these series of posts : Treat results as possible clues only. Please also notice how i used (in this and older posts) the words "appears" or "were found" when discussing correlation. The technique shown is the same as discussed in the previous post. Results are as follows :




  • Student appears to be correlated with low popularity accounts.

  • Engineer also appears to exist often in low popularity accounts although the correlation was not found to be as strong as for students.

  • Common words existing in popular users Bio appear to be the following : social, media, marketing, CEO, founder, author, entrepreneur, blog, twitter, news, writer, internet.

Some comments :


  • It is not suggested that by having specific words in your bio, you will get more followers. Many other things are and could be important in achieving a high follower count. Same applies for unpopular accounts.

  • Looking at the results i wondered why students were found to be associated with low follower numbers and i think that this requires more attention. One possible reason could be that students might be spending most of their social media time on FaceBook or other SM sites. There can be many pitfalls in performing a random sampling from Twitter and "Students" could be one of these cases. However please share your comments.

  • Notice that some words that appear to be associated with high follower numbers are words that communicate authority ( ex. founder, CEO).


To recap from the last 3 posts :

1) Do not use foul language - keep your conversations positive.
2) Use "Thank you" often. "Stay tuned" seems to work well also.
3) Post frequently. Posting some links is also important.
4) Make sure you have a good Bio filled in.


Finally, if you find the contents of this blog interesting you can always have a look for more updates on my new account on Twitter @lifeanalytics and also send me your suggestions and/or comments.

Selasa, 05 Mei 2009

Twitter Analytics : These words may be affecting your popularity

Text Mining techniques can be used to identify specific words that are correlated with Twitter accounts having high or low popularity. This can be done in two ways : (1) By analyzing the text of the Tweets of each user and (2) By analyzing the text of the biography of each user.

Let's start with the results of the first type of analysis with data originating from user Tweets. Pay attention only to cells that are highlighted in red, their corresponding category column (LOWFOLLOWERS , HIGHFOLLOWERS) and the word at the beginning of each corresponding row. Results show which words appear to be important especially because the affinity shown here is moderate. Use results as possible clues only.



The results so far show us that :

hate, bed : are found to be correlated with low popularity
top, online, send, list,web,media, join : with high popularity


Here is another portion of the results table :




The pattern should be evident by now : Words of negative attitude appear to be influencing a user's follower count negatively. As also shown above, foul language appears to work negatively also. Several other insights were found such as the existence of specific phrases that are correlated with low popularity ("watching TV") while other phrases ("stay tuned" ) with popular accounts. The number shown in parentheses quantifies the magnitude of the association that each word has and thus enables us to order words by their importance.

Some of the words -and their synonyms- that were found to be associated with very low follower counts are :

- Sleep, Hate, Damn, Feeling, Homework, Class, Boring, Stuck


A total of 63 words and 25 phrases were found having either a positive or negative association with the followers count. Interestingly, specific phrases that communicate any kind of opportunity are also associated with high number of followers. "Thank you" is highly related with a user's large popularity.

Here comes the interesting part : Once the Text Mining analysis is completed, a predictive model can be generated that may be used for scoring future Tweets. Let's assume that you are about to send the following 2 Tweets :

1) 'Today i feel like sleeping all day. Yawn...'
2) '@xyz Your website traffic can be increased with good marketing'

Before you post however, you decide to feed these 2 sentences to a predictive model. The predictive model returns for every Tweet the predicted result (GOOD or BAD) and the associated probability. Here are the results for these 2 examples from an actual run :





In other words :

1) The first Tweet may have a negative effect with a probability of 83.5%
2) The second Tweet may have a positive effect with probability 99.9%


Note that :

  • A predictive model is able to consider combination of words, not just single words. This raises considerably the accuracy of any prediction.
  • In any real world application of Text Mining a 100% prediction accuracy cannot be achieved: Although application-specific, a 72-78% accuracy may be achieved - with considerable effort. Of course many more things are important to achieve high popularity and the example above is given merely to discuss what techniques currently exist. A combination of analytical techniques is the best option and this will be discussed in a future post.

Several other types of analysis can extract similarly interesting insights : Let's not forget that Twitter Tweets contain the emotions, beliefs and values of users. They contain what people want and what they don't want. See Clustering the thoughts of Twitter Users and Know your customers the Twitter way for a further discussion on this.

There will be more to say about Text Mining and how it can be put to use by PR Agencies and Marketing companies with practical examples shortly.

Minggu, 03 Mei 2009

Twitter Analytics : Which usage behavior attracts many followers?

This is the first part of a series of posts where Data Mining and Text Mining will be applied to extract potentially useful facts about the usage of Twitter and to draw some conclusions such as what makes a Twitter account interesting enough to other users.

The conclusions that will be presented here are from the analysis of 3651 Twitter accounts and are meant to show how Predictive Analytics can help. Please note that results are shown for informational purposes only.


First, the data used can be summarized with the following table :





You can immediately see problems in the ranges of the data used especially on the number of "followers" and "following". This is something to be expected since among the users captured were Jack Dorsey (founder of Twitter), Sen. McCain and George Stephanopoulos - users that obviously have a huge amount of followers.

Before finding which usage behavior attracts many followers, one should be able to identify what exactly is a "popular twitter account". Is it just the absolute number of followers? Perhaps it could be equally important -or at least interesting- to also look at :

1) The followers/following ratio

2) The number of followers per day

For our example the absolute number of followers was used as the only criterion of a successful Twitter account. The results can be summarized with the following decision tree :





Some usage patterns that raise the chance of having a successful Twitter account are the following :

  • Having a bio is an absolute must : 82.3% of unsuccessful Twitter accounts have their biography information missing.

  • You should provide more than 3 links per 20 tweets and also more than 0.960 updates per day

  • If you don't want to provide more than 3 links per 20 tweets, then try to post more than 5.857 updates per day.

  • Users that post more than 3 links per 20 tweets but post less than or equal to 0.960 updates per day, will need more than 222.5 days of usage to get an adequate amount of followers.

By using Feature Selection we are able to look also at the relevant importance of each parameter on achieving many followers : Here are the results of Feature Selection from using ChiSquare, GainRatio and InfoGain attribute evaluators.



=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===

average merit average rank attribute
362.743 +-10.419 1 +- 0 4 numberOfLinks
319.397 +-10.133 2.4 +- 0.49 6 hasBlankProfile?
311.661 +- 8.612 2.6 +- 0.49 7 updatesPerDay
192.525 +- 7.481 4.1 +- 0.3 3 retweetsNumber
178.236 +- 5.963 4.9 +- 0.3 1 elapsedDays
36.148 +- 3.579 6 +- 0 2 otherUsersTalk
17.843 +- 4.475 7 +- 0 5 questionsAsked


average merit average rank attribute
0.1 +- 0.003 1 +- 0 6 hasBlankProfile?
0.042 +- 0.001 2.4 +- 0.49 4 numberOfLinks
0.039 +- 0.002 3.2 +- 0.6 3 retweetsNumber
0.04 +- 0.004 3.4 +- 0.92 7 updatesPerDay
0.025 +- 0.001 5 +- 0 1 elapsedDays
0.011 +- 0.001 6 +- 0 2 otherUsersTalk
0.005 +- 0.001 7 +- 0 5 questionsAsked

average merit average rank attribute
0.082 +- 0.002 1 +- 0 4 numberOfLinks
0.074 +- 0.003 2.1 +- 0.3 6 hasBlankProfile?
0.071 +- 0.002 2.9 +- 0.3 7 updatesPerDay
0.044 +- 0.002 4.1 +- 0.3 3 retweetsNumber
0.041 +- 0.001 4.9 +- 0.3 1 elapsedDays
0.008 +- 0.001 6 +- 0 2 otherUsersTalk
0.004 +- 0.001 7 +- 0 5 questionsAsked


We see that all three attribute evaluators agree that the number of links provided on Tweets and whether the profile of the user is filled in are the two most important parameters in achieving many followers. Notice also that sending messages to other users (otherUsersTalk) and asking questions (questionsAsked) is not as important as one would expect.

The analysis shown above gives many insights but it does not take into account what the users say and how this affects the popularity of a Twitter account. Text Mining will try to give some answers for this question and also identify which keywords on Twitter profiles seem to be associated with many followers.