Rabu, 08 Desember 2010

What Women Want - As seen in Tweets

Probably a post which will be interesting to many of us. 15000 Tweets were collected that contained the phrase "women want". What do women value most when it comes to how they want to feel? What do women really love? How important is for women to feel special? And finally, can Tweets really tell us this information by applying Text Mining techniques to them?

Normally at this point i would describe technical details such as how i pre-processed Tweets and the problems i ran to while trying to analyze this information. Thanks to @nathalief i was advised to focus on giving information not only about what women want but also on their feelings.

First let's see the results from Tweets that apart the phrase "women want" they also contain words such as "feel, feeling, feels, felt" in them. The following chart shows what words where frequently found in these Tweets (and thus what feelings a woman wants to experience) :




So it appears that one of the first priorities in terms of how women want to feel is security (shown as safe and secure in the chart). Notice how important for women is also to feel special and to feel that someone loves them (words love, loved, like).

How about words that frequently occur with the word "Love" :



It appears that women want "to love and to be loved" with "respect", "affection" and "sex" coming next.

Surely there must be men that also give their opinions on "what women want" within these Tweets. Quite possibly many guys would say that "women just love money". In order to capture those who believe that women want money, let's see which words occur frequently with the word "money" within these Tweets :



Notice how the landscape of keywords changes here : Apart from "love", "secure" and "hurt" we see words labeled as "censored" (for obvious reasons), "shoes" and "future" : these words communicate a more materialistic and logical point of view on what women want. Unfortunately for this analysis there was no way to identify which Tweets were originated from women and which Tweets originated from men. Also at the time these tweets were collected a specific Re-Tweet was about a more 'materialistic' profile of women (words multiple, and shoes). I decided to keep this re-tweet in the data that was analyzed because i felt that since this tweet was heavily re-tweeted then it was also liked by a large audience.

Perhaps these results show once again that "Men are from Mars and Women are from Venus"

Selasa, 23 November 2010

Spam Detection in Social Data : A new business?

All of us who use Twitter know the problem of spam Tweets. Spamming on Twitter can happen in several ways. For example spammers can use a trending topic to make their tweets visible (that also happen to have nothing to do with the current topic). Other tweets, although they do not contain erroneous hash tags they contain uninteresting information.

In a previous example, Tweets were used to analyze the sentiment of Twitter users on U.S Economy. The study used several thousands of Tweets to extract insights. However between all tweets that originally discussed about the economy there were several spam Tweets such as "make money online even if the economy is bad".

It is well known that the most time-consuming process in a Data / Text Mining project is pre-processing. Therefore when one wants to analyze tweets and extract knowledge from them, obviously one step is to remove spam and uninteresting Tweets to minimize the chances of GIGO.

Spam detection in Tweets -and Social Media unstructured data in general- is a difficult task. It requires "concept-aware" analysis of Text. One of the interesting facets of analytics is the ability to solve the same problem in several ways, or -perhaps even better- to combine all available tools to reach a better solution.

There is an ever growing number of companies that analyze Social Media Data and erroneous data may be seriously altering their insights - even if millions of records are available. Perhaps in the very near future, providing cleaned social media data to analytic companies or other information consumers could be a business in its own.

It is possible to perform spam detection in many ways : Using machine learning methods is one : In other words, training a classifier to sift through -say- hundreds of thousands of tweets that are marked accordingly as "spam" or "no-spam". We could use a more elaborate methodology to actually build and define rules by non-automatic methods that characterize spam Tweets. We could even consider other information such as who Tweeted, how many followers this user has or how often '@' is used to address other users. Once again, problem representation and how / which algorithms are used should be carefully selected.

Spam detection in Social Media Data is one of the problems that will become more important as more analytic companies are created. Detecting interesting information is another area to watch. People want real insights.

In the previous post, tweets were used to identify what people want / feel / don't like when they visit a shopping mall. While analyzing this information it was found that word 'Omaha' was associated with the word "Mall". Under close inspection i realized that "Omaha Mall" is a song by Justin Bieber. Of course i am not suggesting that these Tweets about Justin's song were spam but they had nothing to do with the purpose of the analysis. Could an automated technique identify this inconsistency and suggest to filter out this information? Being able to automatically select the right information will probably become more important as text information increases and a fast, correct and actionable intelligence becomes a necessity.

Selasa, 02 November 2010

Mining consumer behavior in Tweets


In the previous post we discussed the first steps necessary to understand what consumers write in their Tweets regarding their recent visit to a shopping Mall. In this post we will see how from this information Marketeers are able to understand spending patterns, know what consumers liked about their visit to a shopping Mall and know what is important for consumers. According to Dr Dimitriadis with whom i teamed up for this analysis, important things to look at include (list not exhaustive) :

- spending patterns and situations (when, what and with whom people spend money)
- tenant mix preference (which products people like to buy and what else / new they want)
- experience evaluation (safety, availability of stores / products, cleanliness etc)
- perception of shopping mall communications (what people think about mall ads / messages)

Although there are many behaviors and opinions we could look for, let's identify first what makes a consumer happy. To find this out we can analyze all Tweets containing a :-) (smilie) and find which keywords co-exist in these Tweets. Here are some of the results:

Apart from some typical words that suggest positive feelings, we also identify that 'friend' and 'birthday' are commonly found with smilies. It was found that consumers that shop for a birthday present or outfit use smilies often. Let's see what happens with tweets that contain negative feelings :-( (frownie):





A frown appears more often when consumers do not find what they were looking for and also when they are at the mall alone. But what about what people hate when they visit a mall? A similar statistical test is performed to identify words which co-occur with the phrase "I hate it when." These words are:

-Park
-People
-Walk


By looking at the actual tweets we can identify that many people hate it when:

1. a mall is very busy
2. it is difficult to park at the mall
3. people in front walk at a much slower pace (particularly older people)

Next we can perform a cluster analysis for these tweets to identify common "thought clusters" of the consumers and their behavior. As an example i have used Rapid-I to generate these clusters using the following setup:


Without getting into technical details (such as usage of tokenization, stop word removal and optimization of the process) by executing the stream shown above, a cluster analysis is run that identifies common consumer thoughts on their visit to the shopping Mall. Some of the clusters found are :

- People that state their intent to buy something
- Consumers which eat a meal and then go to the movies
- "saw a cute guy / girl looking at me"
- "I had a good time at the mall"

As discussed in previous posts, cluster analysis not only allows us to find common groups of behavior and thoughts but also to identify the frequency with which these behaviors and thoughts appear in consumer Tweets.

This behavior mining seems endless : In the same manner we can look for mentions of food, (for example see how often 'Chinese', 'Indian' or 'Pizza' appear in Tweets) or buying patterns (which items are discussed more frequently in "i want to buy" Tweets) or whether users feel more happy when they buy gifts for themselves or for others.

Senin, 27 September 2010

Inside a consumer's mind with Text Analytics



So far we have seen several examples on how Predictive Analytics applied in Social Media and Blog posts can help us suggest better strategies in Marketing, Branding, Sales and PR . This post is a walk-through example on how we can choose a concept, extract what users write about this concept on Twitter, get insights on how consumers think / behave about it and finally group similar consumer thoughts and experiences using Cluster Analysis. A "concept" could be :

- Any activity
- A Brand (e.g Apple Inc.)
- A Product / Service
- A Politician


and -almost- anything discussed in user Tweets .

What we will look at is work that was made specifically for understanding what consumers think, liked or disliked while visiting a shopping Mall. What do people feel when visiting a Mall? Which words are associated with a positive experience or when a smiley is present in Tweets about Malls? Using the Twitter API approximately 36000 distinct Tweets where collected on consumer experiences from visiting a shopping Mall (sample below shows an example of a consumer's negative sentiment ) :



So how can an analyst get into a consumer's mind by analyzing Tweets and how would this information be useful? To find some answers I teamed up with Marketing Strategist Dr Nikos Dimitriadis to assist me in the actionability and interestingness of each extracted insight. Note that we capture thoughts from a biased sample which means that we cannot make inferences about the general population. However this work can be a great additional tool for finding new ideas and insights for Marketing initiatives -on top of more traditional methods such as focus groups- and also enables us to form several hypotheses as to what could likely work.

After a number of pre-processing steps to clean captured Tweets from irrelevant information (such as links), replace words with their synonyms and remove frequently occurring words such as 'and', 'to', 'at', 'in' and 'mall' and also filter all Tweets with small length i started performing frequency counts of the words contained in Tweets about Malls :


We immediately notice how often LOL and :-) (smiley) appear in Tweets about being, going or returning from the Mall which also gives us examples of consumers being in a specific mood . Here is what happens when we look at the most frequently occurring 2-word phrases :


and 3-word phrases (Note : ive = i've) :





Looking at the two charts we also notice that we frequently find the phrases :

- My best friend : since consumers Tweet the fact that are visiting a Mall with their best friend.

- My nails done : appears to be one of women's frequently discussed activity.

We then could look at Words and Phrases that seem interesting in understanding consumer experiences and values when visiting a Mall, such as :

- Shop
- Shoes
- Parking Lot
- Food Court
- Need / Want
- Walk around
- Made my Day
- Post Picture FaceBook

and mine through all these words / phrases to understand what consumers think : What exactly made the day of consumers who used the phrase "Made my Day" in their tweets? How do consumers feel when they visit the Mall with their best friend? when they are alone? Which activities trigger positive feelings? But more importantly : How could one use this information to better understand consumers and Market a concept? More on the next post.

Selasa, 21 September 2010

Social Media Insights from Predictive Analytics



Here is one more example on how Predictive Analytics may help professionals to make better decisions. For this post a total of 3000 Social Media title posts where analyzed to gain -hopefully- important insights for Social Media professionals. To achieve this, Text Mining was used to analyze the text of titles, identify the most important subjects (do posts about Personal Branding tend to be re-tweeted more than Social Media Monitoring?) and also try to prioritize the various areas of Social Media.

We start with the basics. Many of Social Media pros read (and write) about various subjects : How-to's, things to avoid, Adoption of Social Media etc). The first goal was to identify the most frequently occurring subject areas in Social Media posts using simple keyword frequencies. The following chart shows this information :


Although the fact that Social and Media is on top of the list is not much of an insight or that Twitter appeared in posts more frequently than FaceBook, we see that Brand is found more frequently than Marketing or Strategy.

However, there is a slight problem : The chart shown above is about single words and perhaps measuring how often 2 adjacent words occur in Social Media posts could be more useful with Social Media being omitted (click to enlarge):



This leads us to the fact that most of Social Media posts where found to be about How-to's (note that phrases How to and ways to have similar meaning). One could dig more to identify the concepts for which How-to's apply (How to monetize, How to be successful, How to avoid mistakes etc)


The next goal was to find words and phrases that are commonly found in posts with a high number of retweets (>40). To get this insight various Text Mining techniques where used. The following features have been taken into consideration :

- Author of Post
- Title of Post
- Number of Retweets


and here are some of the results :



Words that have a negative weight tend to be found in SM posts with a low number of re-tweets (write, talk, trust, sentiment) while launch and America where commonly found in popular posts. Please notice (the reason will be explained later) that personal is one of the hot words but also link and increase.

With this information, an analyst may then identify why such words tend to commonly exist in popular Social Media posts. Here are some insights :

  • Personal Branding appears to be a hot area. People are primarily interested on the various ways they can increase their "personal worth" in the Social Media arena.
  • IWOM : Internet Word Of Mouth is also a concept that frequently occurs in SM posts with many re-tweets.
  • Positive & Possible : It appears that posts that discuss various possibilities in a positive way (use of the word could) where found to be re-tweeted more (recall link and increase keywords discussed previously).

Minggu, 05 September 2010

"Ways to stop Social Media and Sentiment Mining"



While looking at my Google Analytics account i came across a keyword search originated from Australia which was different from keywords that usually drive traffic to my blog. The keywords were the following :

"Ways to stop Social Media and Sentiment Mining"

I decided to write this post assuming that the person who submitted this search does not like the fact that machines are mining his points of view about people or products or "understand" to some point whether he/she feels happy or not.

Among the many interesting aspects of being a Data Miner is to explain to other people what a Data Miner does (this was also discussed by G Piatetsky - Shapiro if my memory serves me well). When asked, i sometimes say that i also "analyze emotions as these are expressed on the Web". At first people are very interested but after a short amount of time almost always the next responses go along these lines :

- Are you allowed to do this?
- Is this legal?
- Have you ever heard about Big Brother?

It's no big secret that emotions play a major role in our lives and drive our decisions. Many people start to realize that companies are already using Information Extraction and Data - Text Mining techniques to extract the things that we discuss about various products or people and better understand our behavior. I believe that the most important thing in this area is not just Sentiment Mining or in other words whether we feel positive or negative about a Person, Product or Brand but the ability of Analytics to extract our core values and analyze our emotions.



When applying Text Mining or a mixture of Data and Text Mining methods on -for example- Twitter, we are not only able to see the sentiment for a product. We can identify a user that is alone, feeling bored and watching television. We can form several hypotheses on whether users that survived from Cancer express more positive thoughts than other user groups (see Surviving Cancer, Happiness and Twitter), find what sort of lifestyle makes a CEO happy or whether a specific profession increases your chances of being single (see Twitter Analytics : Cluster Analysis reveals similar users). Cluster Analysis can also identify core values of people and what they want or what trying to avoid.

Some of the examples discussed above have a clear business value while others don't. The important fact however is that analysts now have data to analyze emotions and our responses on facts happening in our lives on a much deeper level. This information has not been available on this scale before.

Should we stop extracting these insights and how dangerous can these insights become?

Selasa, 31 Agustus 2010

Banks, Risk Disclosure and Text Analytics



A UK-based MSc student of Kingston Business School - Christos Gkemitzis had an idea for his MSc project which immediately caught my attention : Use Text Analytics methods to annual reports given by Banks and extract metrics on how these Banks handle their Credit and Interest rate risk as explained in these reports and then test several hypotheses ( do Banks of a higher risk profile disclose bigger amount of risk-related information compared to those having lower risk profile?) and also identify any correlations :
  • between the size of the Bank and volume of risk disclosures
  • between the risk of the Bank's profile and volume of risk disclosures
  • between the profitability of the Bank and volume of risk disclosures

Essentially the problem is to -automatically- identify mentions of credit risk but in a specific way :

1) Identify sentences mentioning risk refer to the present, past or future
2) Identify positive, negative or neutral sentiment mentions about Credit Risk
3) Identify qualitative versus quantitative information regarding the Bank's Credit Risk


For example consider the following text which is part of an actual Bank report :

"A substantial increase of credit risk and provisions is also expected, as from 2009 on, theeconomy will be entering a period of low growth."

The sentence above contains qualitative information ("substantial increase of credit risk and provisions") and negative Sentiment referring to the future ("also expected" and "will be entering a period of...").

while the following sentence :

"The Group’s ongoing efforts to manage efficiently credit risk led the level of loan losses to 3.3% in December 2008"

contains quantitative information ("level of loan losses to 3.3%") with a positive sentiment about Credit Risk handling in the past.

After receiving some PDF samples of Bank reports from Christos, i began feeding these reports to the GATE Text Analysis toolkit in order to assess the feasibility of such analysis. After some tutorials through Skype, Christos -who had no prior knowledge of programming- started using the toolkit on his own in a very short amount of time. Here is a snapshot of GATE in action for the analysis :






The snapshot shows how GATE correctly identified a part of text that communicates a negative sentiment for Credit Risk in a qualitative manner for the future (notice that "QualitativeBadNewsFuture" is checked).

After running GATE in many documents, Christos had the necessary metrics (=how many mentions of different Risk types exist in a document) to test his hypotheses using a 2-tailed Wilcoxon test. To identify correlations, Spearman coefficient was also used.

Since this is work which has not been submitted yet, it is not permitted to post the findings of this research. The post shows however another application of Text Analytics and the many sources of unstructured information that could be mined for knowledge.

Minggu, 08 Agustus 2010

Interview with Kaggle CEO Anthony GoldBloom


For those that haven't heard of Kaggle before, Kaggle is a team of people that provide the functionality and support to host Data Mining contests. Here is how it works : Suppose that you are working for a Telco and wish to implement a new Churn prediction model. Rather than running this project in-house, you submit your data to Kaggle. What happens next is that -hopefully- many statisticians globally will each analyze your dataset, produce a model and then submit their prediction model(s) to Kaggle. The best model (and hence its creator) gets the prize which is given by the Telco company. Here is the interview with Kaggle CEO, Anthony GoldBloom :




- What is Kaggle and what new ideas brings to the predictive analytics arena?

Kaggle aims to help companies and researchers make predictions more precise by providing a platform for data prediction competitions. Competitions turn out to be a great way to get the most out of a dataset. This is because there are infinitely many approaches to any data modeling problem. By opening up a data prediction problem to a wide audience, a competition makes it possible to get to the frontier of what is possible given a dataset's inherent noise and richness.

- Can you tell us more about "real-time science" and how it could help Research globally?

Data modeling competitions can facilitate real-time science. Consider the recent announcement about the discovery of genetic markers that correlate with extreme longevity. Work on the study began in 1995, with results published in 2010. Had the study been run as a data modeling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision).

Data modeling competitions also benchmark, in real time, new techniques against old. A technique that performs well in competitions can prove its mettle long before any paper can be published, helping the science to progress more quickly.

Competitions also help to avoid situations where valuable techniques are overlooked by the scientific establishment. This aspect of the case for competitions is neatly illustrated by Ruslan Salakhutdinov, now a postdoctoral fellow at the Massachusetts Institute of Technology, who had a new algorithm rejected by the NIPS conference. According to Ruslan, the reviewer ‘basically said “it’s junk and I am very confident it’s junk”’. It later turned out that his algorithm was good enough to make him an early leader in the Netflix Prize (he called his Netflix Prize team NIPS_reject).

- How can companies benefit through Kaggle?

Companies can use Kaggle to gain an advantage over their competitors. Consider a bank that wants to improve the algorithms that vet loan applicants. If a bank can develop a more effective algorithm they will have fewer defaults and can charge lower interest rates than their competitors. Kaggle has proven to be an effective way to improve existing models very quickly.

Competitions are also really useful to companies that want to develop new products and capabilities. Consider a hedge fund that wants to be able to generate long-range weather forecasts in key agriculture regions. They can attempt to hire a weather forecasting expert or they can use Kaggle to throw the problem open to a wide audience. Using Kaggle they can be sure they'll get great results very quickly.

- How is the best model selected?

The competition host will typically split their dataset into two parts - a training dataset and a test dataset. The training dataset includes all explanatory variables as well as the dependent variable (or the answer). The test dataset also includes all the explanatory variables but the dependent variable (or answer) is withheld.

Participants train their models on the training dataset. They then apply their models to generate predictions on the test dataset. Those predictions are then scored on-the-fly against the actual answers (using one of several evaluation methods). Once the competition deadline passes, the team that generates the most accurate predictions gives the winning methodology to the competition host in exchange for the prize money.

Selasa, 27 Juli 2010

Summarization of Blog posts with "Web Pulse" Reports

In the past couple of months i was looking for a way to best capture and understand what happens on the Web -and more specifically what people write in blogs- in terms of sentiment and emerging trends. The first thing that i came up with was the the idea of creating a "Web Pulse" Report : A way to summarize what people are discussing on the web. Although the implementation was not as complex as i expected, i was pleased to find that the knowledge that can be extracted is -to say the least- very useful and interesting. Before looking at an actual Report examples here are the elements that comprise it :


1) Concept Frequencies : Identifies the concepts that bloggers most frequently write about

2) Global co-occurence Matrix : Identifies most frequent word bigrams

3) Keyword Associations for Concepts : Which keywords tend to co-exist with a specific concept?

4) Most frequent n-grams associated with a given Concept (where n=2,3,4,5)


As an example we will identify what bloggers were discussing in Greek blogs on July 27th, 2010 and specifically the Blog titles in more than 300 Greek blogs.

Here are the concept frequencies found (in descending order) on that date :


[Turkey]=178
[Politics]=128
[Economy]=101
[International Monetary Fund - IMF]=62
[Banking]=61
[Public Sector]=50
[Negative Characterizations]=30
[Political Parties]=29
[George Papandreou]=29 (=Prime Mininster of Greece)
[Loans]=22
[Society]=20


The first interesting fact was that "Turkey" appears to be in the top of the list of Greek blog articles, even though Greek mass media did not place so much weight in the latest Turkish behavior in the Aegean sea on that day. The second concept is Politics with the Economy following next.

Here is the top part of the Global Co-occurence Matrix found (in Greek) :


ΣΚAΦΟΣ,ΤΟΥΡΚΙΚΟ : 25
ΡΕΙΣ,ΤΟΥΡΚΙΚΟ : 25
ΠΙΡΙ,ΤΟΥΡΚΙΚΟ : 25
ΡΕΙΣ,ΣΚAΦΟΣ : 24
ΕΛΛAΔΑ,ΧΩΡΑ : 23
ΥΠΟΥΡΓΕΙΟΥ,ΟΙΚΟΝΟΜΙΚΩΝ : 22
ΡΕΙΣ,ΕΡΕΥΝΗΤΙΚΟ : 22
ΠΙΡΙ,ΕΡΕΥΝΗΤΙΚΟ : 22
ΡΕΙΣ,ΠΙΡΙ : 21
ΑΝΑΜΕΝΕΤΑΙ,ΣΥΜΦΩΝΑ : 21
ΠΟΛΙΤΙΚΗ,ΧΩΡΑΣ : 19
ΟΙΚΟΝΟΜΙΑ,ΕΛΛΗΝΙΚΗ : 19
ΜΟΝAΔΩΝ,ΔΕΗ : 19
ΚΥΒΕΡΝΗΣΗ,ΠΑΠΑΝΔΡΕΟΥ : 19
ΗΓΕΣΙΑ,ΠΟΛΙΤΙΚΗ : 19
ΕΡΕΥΝΗΤΙΚΟ,ΤΟΥΡΚΙΚΟ : 19
ΥΠΟΧΡΕΩΣΕΙΣ,ΜΝΗΜΟΝΙΟ : 16
ΧΩΡΑ,ΜΝΗΜΟΝΙΟ : 15

The top 4 frequent keyword associations is -again- about the latest problems of Greece with Turkey and more specifically with the fact that a Turkish boat named "Piri Reis" (in Greek : ΠΙΡΙ ΡΕΙΣ) has been repeatedly entering without a permission a Greek part of Aegean Sea.

Let's look at the Associations frequencies found between specific Concepts : The following is an example of concepts associated with "Giorgos Papandreou" (Greek Prime Minster)

International Monetary Fund - IMF=32
Politics=28
Political Reform=6
Nea Dimokratia=3 (=Oppositional Political Party)
Politics, International Monetary Fund, Loans,Political Parties=2
Negative Sentiment=2
Public Sector=2
Uncertainty=2

It appears that George Papandreou is frequently mentioned where the IMF is involved and also a political reform might be on its way.

The fourth element of the report shows phrases that are commonly found in Blog posts. Since many blogs tend to use the same titles, with this functionality one is able to look at the distribution of the information from one blog to another.

The report can be enhanced in various ways : For example by tokenizing Blog posts in sentences i have added the option of performing chi-square tests to identify co-occurences in a more concise way, rather than using strictly absolute term frequencies. Through different types of analysis and knowledge representation we are able to look to our subject(s) of interest in different ways, which -hopefully- leads us to better insights.

From my experience so far, this type of report is a simple but efficient way to summarize the content of Blogs and also show what is 'hot' at the moment and why.

Rabu, 26 Mei 2010

Concept Trending : A Glimpse into the future?

In the previous post some ideas were presented on the trends of Text Analytics. Analyzing and extracting knowledge from text is a hard thing, whether this involves Sentiment Analysis, Text Classification, Cluster Analysis or Information Extraction.

A particularly interesting application of Text Analytics is the identification of trends for specific concepts. In contrast with simple keyword trending, this type of trending attempts to disambiguate keywords according to their context and use co-reference resolution to identify the subjects for which the sentiment relates to.

To better understand concept trending let's look at an example : Suppose that one wishes to identify the trend of negative characterizations -and even swear words- that exist on the Greek web. The first step would be to collect the information from various blogs and forums whenever a negative keyword is found. A Text analysis toolkit could then provide the means of identifying the subject(s) of negative characterizations on the Greek web such as Politicians, the Economy or the International Monetary Fund which recently came in to the rescue.

From a post dated December 28th, 2009 :

"Over the past month there has been a considerable amount of increase in negative economy sentiment, crime-related incidents and/or terms that communicate future social instability and uneasiness."

Although not stated on purpose, the country which the article addressed was Greece and the trend increase on negative sentiment was found to be starting in the beginning of December 2009. This is a photo of a Greek newspaper taken on February 4, 2010





The title shown writes about the "Fear of Social Explosion". On May 6th 2010 after clashes in the center of Athens, mentions about "Social Explosion" in Greece started appearing on the Web. The following Google search uses a timeline for "Social Unrest". The increase of mentions appears to be starting on February 2010.



Although concept trending has significant challenges it is a process which in my experience has proven itself many times. A recent article at NewScientist suggests that by capturing the sentiment of the crowds we are able to predict the moves of S&P 500 or by looking at keyword searches such as "job search engine" we can predict coming changes of the US unemployment rate.

Senin, 17 Mei 2010

The future and trends of Text Analytics

I recently attended a GATE seminar on the University of Sheffield. Having used GATE for quite some time now, i was happy to see that the GATE team is well committed to developing the GATE Text Analysis Workbench by constantly adding more functionality.

Although many of the participants were PhD students i was also happy to see people from companies that now wish to leverage the hidden knowledge that exists in unstructured text. Whether it was analysis on text of Patents information, intelligent search on Text of Photo Captions for a large News Agency or understanding what a customer wants, Text Analytics are becoming an important tool for making better decisions.

I also had the opportunity to speak with several people about the future of Text Analytics. What are we likely to see happening in the next years on Information Extraction and Text Analytics?



First we have to understand how Text Analytics deliver results. In order for a computer to 'understand' unstructured text, it should be 'taught' that the word 'Dollar' is a currency of a country that is called 'US' and also that US, United States, USA and U.S.A is the same concept. This means that hundreds of thousands of concepts and synonyms have to be specified so that a computer identifies them in unstructured text. This process is called Text Annotation.

The Golden Standard of Text Annotation is annotations done by humans : A computer sifts through the text of a web page, annotates it with concepts and then these annotations are checked against annotations made by humans on the same text to assess the accuracy with which a computer 'understands' this text and the concepts and entities that exist in it.

So what does the future hold? First of all, since unstructured text becomes more available there will be a greater need for 'annotation farms' : Groups of people who will be manually annotating free text, identifying an ever-growing number of Companies, Managers, Politician names, or anything else that has to be 'taught' to a computer. Note that Annotation Farms already exist but the need for this service will become greater.

The second trend on Text Analytics could be something equivalent to what we have seen happening with NetFlix. Suppose that you own a company that produces Brand 'X' and you wish to track the reputation of your product online. You would then submit a sample of your product's mentions to various companies that analyze text and have them compete against each other in terms of -for example- Precision and Recall. The one that produces consistently the best metrics (whether Precision - Recall, Kappa statistic or F-Measure) will also get the job.

A third trend could be the development of text analytics for specific concepts : Sentiment Analysis and Named Entity recognition is hard work if one wants to produce sound and accurate results. So it could be probable that Text Analytics experts will choose a specific concept -For example reputation of Banks- and then work in the analysis of this -very specific- concept so that they achieve better metrics.

Selasa, 23 Maret 2010

Predictive Analytics and Politics - Part 2

In the previous post we have seen an example of analyzing messages sent from citizens regarding a new taxation plan. We identified some correlations between keywords and concepts but there are more ways to gain knowledge from such unstructured information.

By using Cluster Analysis we can extract groups of similar concepts among thousands of comments written by citizens but also presenting an order within them. Let's assume that Cluster Analysis reveals the following clusters (or similar concepts) within submitted messages :

- battling tax fraud

- requests for a fair tax plan

- requests for less taxation for large families

- various incentives for citizens

Our problem is finding the order of importance that people place on the various concept categories shown above : Is battling tax fraud considered more important (=discussed more frequently by citizens) than requesting a fair tax plan? How about taxation for larger families?

A cluster analysis can reveal to us the size of each cluster and -as a consequence- how important each cluster is :



We make the assumption that in the text representation shown above Cluster 5 (which contains 329 citizen messages) is about requests for a fair tax plan and Cluster 10 contains messages with requests that tax fraud should be minimized. It appears that significantly less people are concerned with a battle against fraudulent activity but they request -more immediate- benefits through a fair tax plan.

Collecting and analyzing information found in blogs and forum entries is another area of analysis that could prove very interesting. Let's see an example with the Political / Social / Economic situation in Greece : The goal is to identify and extract trends and co-occurences of key concepts from blog titles and forum posts such as :

- Names of major Political parties
- Names of Politicians
- Economy (words/phrases such as "austerity plan")
- Negative characterizations
- Company Names
...etc

For this kind of data several applications can emerge. We could track specific concepts through time and see their trends. We can also identify which concepts are discussed together. As an example we could identify the reasons on why Giorgos Papandreou (PM of Greece) is characterized in a bad way in blog posts. (= what other concepts are found in Blog posts containing keywords 'Giorgos Papandreou' AND Bad Characterizations?) :


(Note : PASOK = Governmental Political Party )

Politics = 120
Economy=72
Economy, Politics=40
PASOK=24
Politics, PASOK, Referendum=8
Economy, Politics,PASOK,Referendum, Immigrants=8
Economy, Politics, Society=8
Society, PASOK=4


In other words : Giorgos Papandreou is criticized mainly for his Political decisions and the Economy followed by criticism on PASOK. Negative sentiment also exists because of the fact that a percentage of Greek citizens require that a referendum should take place concerning the latest decision of the Greek government to give to a large proportion of Immigrants the Greek citizenship.

Jumat, 12 Maret 2010

Predictive Analytics and Politics - Part 1

One of the most interesting applications of Data/Text Mining and Information Extraction is Politics. I started collecting information from various blogs, websites and forums and applying Information Extraction and Data/Text Mining techniques to extract potentially useful knowledge in this area. By combining different pieces of information one could come up with trends that may tell us what lies ahead of us.

The latest developments in Greece are more or less known to most of people that read International News. The situation is difficult and the voice of citizens in various blogs and forums could give us the sentiment of Greek Web Users. For example :

- Which are the most frequently occurring words?

- Which are the most frequently occurring thoughts?

- What are the things that have to be changed by Greek politicians?

To answer these questions i have started collecting information found on the top 120 Greek blogs, the OpenGov website (a state-run website where Greek citizens express their opinions) and a couple more Greek sites of economic content. For blogs and forums a Java program scans every 20 minutes for new information :

This information is then sent to an annotation engine which analyzes the textual content. Once the text is analyzed we can -for example- produce a keyword vector that we can later use to understand what citizens are saying on the Web. We can then find out answers to many interesting questions such as :

- With which words is Mr George Papandreou (PM of Greece) associated with?

-When there are some very negative words (such as swearing) what other words are found in the same text?

- What does keyword trending tell us? (For example, we identify an increasingly number of swear words in citizen posts)


First let's see some examples regarding the OpenGov website where thousands of citizens have expressed their opinions on the tax policy of the Greek state. The following chart shows us a number of pairwise correlations between written words in these comments :



Under the red rectangle appear two words (dikigoros,iatros) which in Greek mean "Lawyer" and "Medical Doctor" respectively. This essentially tells us that these two professions are used together frequently in citizen discussions. By looking closely at these messages one can reveal that professionals in these two sectors are said to avoid taxes by not issuing receipts.

Next we could use association rule learning to look for some more -potentially interesting - rules :


The highlighted rule although one of low support it could prove interesting : A subset of citizens are requesting that freelancers and the self-employed should be more closely monitored for tax fraud.

Apart from rule learning, it is interesting to identify the proportion of the total dataset for which each rule holds. That also gives us a sense of order with which different ideas and thoughts exist on the mind of citizens.

In the next post : What is the Voice of the Citizen tells us in Blogs and forums?

Senin, 04 Januari 2010

Detecting Novelty in Twitter posts

A question one could come up with is the following : How can we easily identify and extract novel information from the web? Although we could apply this "novelty detection" into many areas i would like to discuss for now the idea of semi-automatically identifying novelty among posts on Twitter.

Let's take for example the IPhone. Thousands of Tweets are generated every day regarding the Apple IPhone. These tweets mainly discuss about :

  • Which new apps are available / used / liked.
  • New accessories (cases, chargers, etc)
  • User Experiences and sentiment (such as blaming IPhone's short battery life)
  • Pros and cons of the IPhone vs other similar devices
  • Upgrading / hacking etc.

So the problem is : How can we identify novel information among thousands of tweets? Some would argue that we should first define what is "novelty" such as finding a new application or a new accessory for the famous mobile device. Others might argue that novelty is a customer idea that not many people about the IPhone thought about and for which Apple would be interested in identifying among thousands of Tweets. As an example consider the following Tweets :



A subset of users experiences problems with the automatic orientation of the IPhone : This subset of IPhone users is perhaps very small but identifying these tweets could give Apple some ideas to work on.

Here is another subset of Tweets that talk about the charger's cable length :



In the example shown above notice that using just "iphone cable" as search terms would return a large number of Tweets, making it hard to identify novelty among all these Tweets.

Searching for novelty and identifying new ideas among Tweets is not an easy task. The problem is that we do not know what we are looking for in the first place : We can define the general context -such as wanting to identify novelty in user experience- but then we come to a halt in terms of what techniques to use (with an exception perhaps being cluster analysis).

The potential of using semi-automatic novelty detection on Twitter and other websites -such as delicious links- is very big. Although this is work still in progress, the general methodology of novelty detection in Twitter could be to :

1) Collect a large subset of Tweets mentioning IPhone and a keyword that identifies context (such as the word charger).

2) Identify keyword frequencies

3) Generate search queries using a subset of keywords chosen in an "intelligent" way, otherwise the number of search queries would be practically impossible to be evaluated.

4) Test these combination of keywords by submitting them to Twitter search and evaluating the results.

Steps (3) and (4) shown above are the key to success of course. In our example about the IPhone cable being too short we had results returned because the combination of keywords submitted could make sense. Trying out IPhone, cable, snow tells us that such keyword combination is not a valid one and -hence- not an "intellligent" keywords subset :