Selasa, 30 Juni 2009

Predicting the next Viral Tweet

It is time to use Twitter data for another reason : Can Predictive Analytics be used to identify which tweets have an increased probability to become viral?





First we have to identify the problem and see what information we should consider. Every Tweet has an author, a content and is posted on a specific day and time. More specifically, for every tweet we can collect usage data such as

  • Day of Post
  • Time of post
  • Elapsed minutes since tweet has been posted
  • Author of tweet (Twitter username)
  • Number of followers of the author
and also information such as :

  • Subject of post
  • Whether the tweet involves a question being asked
  • Whether the tweet contains hashtags
  • Whether the tweet contains a "Please Re-Tweet" directive (or variants)
  • Whether a user is mentioned
  • The text of the tweet itself.

Our goal then is to combine the information mentioned above and come up with a predictive model that when given an author, day, time of post and text of the tweet it will be able to tell us whether this tweet has an increased probability to become viral.

For this Data & Text mining exercise (and keeping in mind that tweets have been sampled from one website and not Twitter itself) let's define what is a viral tweet : After collecting approx. 8000 tweets from dailyrt.com it was found that the median value of Re-tweets is 17. Here we make the assumption that if a tweet exceeds 30 Re-tweets it is considered viral (and actually this specific assumption makes the classification task much easier).

As discussed above, usage data do not tell us anything about the content of a tweet. Usage data tell us about the name of the author, his/her followers, when the tweet has been posted and how many minutes elapsed since its post. Can this information alone predict whether a tweet will become viral? A data mining model predicted (without using the elapsed time as input field) with an overall accuracy of 75.03% whether a tweet can be viral and -perhaps as expected- shown that the most important factor for making a viral tweet is its author. Running a process called Feature Selection tells us just that :



But what we have seen so far only tells us one -the Data Mining- side of the story. With Text Mining we can see the importance of words and authors. To do that, each author is appended at the end of each tweet (so essentially the author becomes a part of each tweet text). Here is what Feature Selection tells us :



A Tweet mentioning Michael Jackson has a great probability of becoming viral but perhaps it should be also posted by a popular author to make a greater impact. Pay attention also to the fact that @mashable and the @theonion are on top of our feature selection list shown above.

The difficult -but also interesting- task is to predict a viral tweet that has an impact not because of its author but because of its content and to do this the methodology of data collection and analysis differs significantly.

On the next post we will see a model predicting viral tweets in action : We will submit several tweets and their author and the model will tell us the probability that each submitted tweet has to become viral.

Selasa, 23 Juni 2009

How Habitat UK *should* have used Twitter

Following the great post from Tiphereth Gloria i wanted to take the opportunity to show an example of how Habitat UK should be using Twitter.

My suggestion would be that instead of the "initiative" they took they should identify the values, beliefs and needs of their customers by capturing and analyzing relevant tweets instead. And here is how they could do it :

First they should capture all relevant Tweets every -say- month :



The second step would be to identify what people want when they talk about furniture. If they used Text Mining they would have found specific furniture products that customers want to buy and the values associated with these types. For an example look at the following table :



The table shows us (pay attention to dark red cells) that customers looking to buy baby furniture have Safety as their number one associated value. With this knowledge then perhaps Habitat UK would make sure that when they advertise Baby furniture they would use this word on their advertisements to capture the interest of their customers. Of course what was shown above is not some new information but is meant to be given as an example.

Some more things that Habitat UK could have done with Text Mining would be to see :

  • How important it is to suggest solutions to customers
  • Which rooms people want to re-furnish more often and -more importantly- why.
  • How problems (such as furniture received is damaged or difficult to assembly) affect their brand.
  • How people feel excited when they wait for their new furniture...and how bad they feel when furniture is not delivered on time.

There is much more that can be done : By running Cluster analysis many kinds of customer thoughts can be grouped together : One of them was how much "Feeling good" is closely related to new furniture and how it affects people's psyche.

By using Social Media Analytics, Habitat UK -and most other companies- would understand their customers better, see what is important for them and with this knowledge they would be able to take informed decisions that would -most likely- make a real difference.


Jumat, 19 Juni 2009

How people use Twitter - 10 distinct usage groups

During this post we will be looking at another example of cluster analysis performed on Twitter. The analysis was performed on 17000 Twitter users with the goal of extracting distinct groups of usage which essentially shows us the different types of Usage behavior of Twitter users. The following parameters were taken under consideration :

  • Number of Followers
  • Number of Links posted per 20 Tweets (not during RT)
  • Number of Updates
  • Elapsed Days

The following table shows the results :


Note that each cluster has a specific number from 1 to 10. Clusters are listed according to their size which means that cluster "10" is the largest usage group, while cluster "5" being the smallest.

Let's see what the table tells us, starting with the first line : Cluster 10, is the largest (=more frequent) type of usage behavior. Users of that group have an average number of followers, have been using Twitter for relatively many days (elapsedDays=high) ,have a high number of updates while the number of links they provide per 20 tweets is average - say around 3 links-

Now consider -highlighted- cluster 8 which we will call The Information providers : Notice that even though this group of users have relatively few elapsed days and average number of updates, they achieve a High number of followers. The reason is that these users provide a large number of links per 20 Tweets ( Note that this confirms findings during a previous analysis).

See also cluster 3 : Even though this group of users has been on Twitter for many days but also has a high number of updates, it appears that it pays a price for not providing links.

Recall that the "#OfLinks" parameter counts only these links that are NOT part of a Retweet. This tells us that users that are able to find original content and provide it to the community tend to gain more followers.

This analysis was given with the aim of providing a simple example and should not be considered as a detailed analysis since few parameters have been taken into account. Cluster Analysis on Twitter data (which include things that people like doing, professions, interests, marital status, mention of products or opinions to name a few) can -potentially- give us excellent insights on different aspects of user behavior.

Rabu, 03 Juni 2009

Social Media, Corporate Decisions and Analytics

Over the past 6 months we have seen real-world applications of Data and Text Mining applied on Social Media Data from Twitter. We went through many examples that look at Social Media Data in different ways :


  • We identified what Twitter users don't want, grouped their beliefs and also ordered all of this information accordingly

  • We identified which usage behavior increases our chances of having a large number of followers (if a large number of followers is our goal)



  • We found which words appear to be associated with a large number of followers. (We have seen that negative thinking and words in Tweets possibly drive people away)




  • We extracted segments of Twitter users with similar characteristics.


The list of possible applications does not end here. Over the next posts we will also discuss about :


  • Predicting whether a Tweet has the potential to become "viral".

  • Associating specific events and user emotional states.

To recap : A computer program is able to monitor the words - phrases that you say and your emotions, flag them as positive or negative, track the rate with which you increase your follower count, track the number of updates, Re-Tweets, replies, hashtags, smileys and questions that you make, flags any mentions about products and services and assigns you to a predefined segment of users sharing similar behavior and interests. Then for each segment its "social media fitness value" is identified (by looking at the follower count).

Usage of Google Wave will possibly reveal other insights : Due to the fact that the sequence of posts will be easily extracted then we could also take under consideration the number of consecutive posts who had a positive sentiment and whether these positive posts appeared at the beginning, center or the end of each thread's sequence. We could also look at the number of posts -that are part of the same thread- having videos or pictures attached and ultimately identify how all of this information may affect one's point of view. Of course I am not certain whether such a scenario could prove useful. I sure would like to try though.

We are presented with a unique opportunity to understand people much better than before and with the examples shown so far this should be more clear by now. Predictive Analytics is about extracting knowledge and identifying what is more likely to work. As Ian Ayres put it in his book Super Crunchers, Decisions are beginning to be based even more on facts and less on intuition. It appears that Social Media Analytics will play an important role in making Corporate decisions for PR, Branding and Marketing and this will happen through better understanding of human behavior.