Selasa, 23 November 2010

Spam Detection in Social Data : A new business?

All of us who use Twitter know the problem of spam Tweets. Spamming on Twitter can happen in several ways. For example spammers can use a trending topic to make their tweets visible (that also happen to have nothing to do with the current topic). Other tweets, although they do not contain erroneous hash tags they contain uninteresting information.

In a previous example, Tweets were used to analyze the sentiment of Twitter users on U.S Economy. The study used several thousands of Tweets to extract insights. However between all tweets that originally discussed about the economy there were several spam Tweets such as "make money online even if the economy is bad".

It is well known that the most time-consuming process in a Data / Text Mining project is pre-processing. Therefore when one wants to analyze tweets and extract knowledge from them, obviously one step is to remove spam and uninteresting Tweets to minimize the chances of GIGO.

Spam detection in Tweets -and Social Media unstructured data in general- is a difficult task. It requires "concept-aware" analysis of Text. One of the interesting facets of analytics is the ability to solve the same problem in several ways, or -perhaps even better- to combine all available tools to reach a better solution.

There is an ever growing number of companies that analyze Social Media Data and erroneous data may be seriously altering their insights - even if millions of records are available. Perhaps in the very near future, providing cleaned social media data to analytic companies or other information consumers could be a business in its own.

It is possible to perform spam detection in many ways : Using machine learning methods is one : In other words, training a classifier to sift through -say- hundreds of thousands of tweets that are marked accordingly as "spam" or "no-spam". We could use a more elaborate methodology to actually build and define rules by non-automatic methods that characterize spam Tweets. We could even consider other information such as who Tweeted, how many followers this user has or how often '@' is used to address other users. Once again, problem representation and how / which algorithms are used should be carefully selected.

Spam detection in Social Media Data is one of the problems that will become more important as more analytic companies are created. Detecting interesting information is another area to watch. People want real insights.

In the previous post, tweets were used to identify what people want / feel / don't like when they visit a shopping mall. While analyzing this information it was found that word 'Omaha' was associated with the word "Mall". Under close inspection i realized that "Omaha Mall" is a song by Justin Bieber. Of course i am not suggesting that these Tweets about Justin's song were spam but they had nothing to do with the purpose of the analysis. Could an automated technique identify this inconsistency and suggest to filter out this information? Being able to automatically select the right information will probably become more important as text information increases and a fast, correct and actionable intelligence becomes a necessity.

Selasa, 02 November 2010

Mining consumer behavior in Tweets


In the previous post we discussed the first steps necessary to understand what consumers write in their Tweets regarding their recent visit to a shopping Mall. In this post we will see how from this information Marketeers are able to understand spending patterns, know what consumers liked about their visit to a shopping Mall and know what is important for consumers. According to Dr Dimitriadis with whom i teamed up for this analysis, important things to look at include (list not exhaustive) :

- spending patterns and situations (when, what and with whom people spend money)
- tenant mix preference (which products people like to buy and what else / new they want)
- experience evaluation (safety, availability of stores / products, cleanliness etc)
- perception of shopping mall communications (what people think about mall ads / messages)

Although there are many behaviors and opinions we could look for, let's identify first what makes a consumer happy. To find this out we can analyze all Tweets containing a :-) (smilie) and find which keywords co-exist in these Tweets. Here are some of the results:

Apart from some typical words that suggest positive feelings, we also identify that 'friend' and 'birthday' are commonly found with smilies. It was found that consumers that shop for a birthday present or outfit use smilies often. Let's see what happens with tweets that contain negative feelings :-( (frownie):





A frown appears more often when consumers do not find what they were looking for and also when they are at the mall alone. But what about what people hate when they visit a mall? A similar statistical test is performed to identify words which co-occur with the phrase "I hate it when." These words are:

-Park
-People
-Walk


By looking at the actual tweets we can identify that many people hate it when:

1. a mall is very busy
2. it is difficult to park at the mall
3. people in front walk at a much slower pace (particularly older people)

Next we can perform a cluster analysis for these tweets to identify common "thought clusters" of the consumers and their behavior. As an example i have used Rapid-I to generate these clusters using the following setup:


Without getting into technical details (such as usage of tokenization, stop word removal and optimization of the process) by executing the stream shown above, a cluster analysis is run that identifies common consumer thoughts on their visit to the shopping Mall. Some of the clusters found are :

- People that state their intent to buy something
- Consumers which eat a meal and then go to the movies
- "saw a cute guy / girl looking at me"
- "I had a good time at the mall"

As discussed in previous posts, cluster analysis not only allows us to find common groups of behavior and thoughts but also to identify the frequency with which these behaviors and thoughts appear in consumer Tweets.

This behavior mining seems endless : In the same manner we can look for mentions of food, (for example see how often 'Chinese', 'Indian' or 'Pizza' appear in Tweets) or buying patterns (which items are discussed more frequently in "i want to buy" Tweets) or whether users feel more happy when they buy gifts for themselves or for others.