Rabu, 26 Mei 2010

Concept Trending : A Glimpse into the future?

In the previous post some ideas were presented on the trends of Text Analytics. Analyzing and extracting knowledge from text is a hard thing, whether this involves Sentiment Analysis, Text Classification, Cluster Analysis or Information Extraction.

A particularly interesting application of Text Analytics is the identification of trends for specific concepts. In contrast with simple keyword trending, this type of trending attempts to disambiguate keywords according to their context and use co-reference resolution to identify the subjects for which the sentiment relates to.

To better understand concept trending let's look at an example : Suppose that one wishes to identify the trend of negative characterizations -and even swear words- that exist on the Greek web. The first step would be to collect the information from various blogs and forums whenever a negative keyword is found. A Text analysis toolkit could then provide the means of identifying the subject(s) of negative characterizations on the Greek web such as Politicians, the Economy or the International Monetary Fund which recently came in to the rescue.

From a post dated December 28th, 2009 :

"Over the past month there has been a considerable amount of increase in negative economy sentiment, crime-related incidents and/or terms that communicate future social instability and uneasiness."

Although not stated on purpose, the country which the article addressed was Greece and the trend increase on negative sentiment was found to be starting in the beginning of December 2009. This is a photo of a Greek newspaper taken on February 4, 2010





The title shown writes about the "Fear of Social Explosion". On May 6th 2010 after clashes in the center of Athens, mentions about "Social Explosion" in Greece started appearing on the Web. The following Google search uses a timeline for "Social Unrest". The increase of mentions appears to be starting on February 2010.



Although concept trending has significant challenges it is a process which in my experience has proven itself many times. A recent article at NewScientist suggests that by capturing the sentiment of the crowds we are able to predict the moves of S&P 500 or by looking at keyword searches such as "job search engine" we can predict coming changes of the US unemployment rate.

Senin, 17 Mei 2010

The future and trends of Text Analytics

I recently attended a GATE seminar on the University of Sheffield. Having used GATE for quite some time now, i was happy to see that the GATE team is well committed to developing the GATE Text Analysis Workbench by constantly adding more functionality.

Although many of the participants were PhD students i was also happy to see people from companies that now wish to leverage the hidden knowledge that exists in unstructured text. Whether it was analysis on text of Patents information, intelligent search on Text of Photo Captions for a large News Agency or understanding what a customer wants, Text Analytics are becoming an important tool for making better decisions.

I also had the opportunity to speak with several people about the future of Text Analytics. What are we likely to see happening in the next years on Information Extraction and Text Analytics?



First we have to understand how Text Analytics deliver results. In order for a computer to 'understand' unstructured text, it should be 'taught' that the word 'Dollar' is a currency of a country that is called 'US' and also that US, United States, USA and U.S.A is the same concept. This means that hundreds of thousands of concepts and synonyms have to be specified so that a computer identifies them in unstructured text. This process is called Text Annotation.

The Golden Standard of Text Annotation is annotations done by humans : A computer sifts through the text of a web page, annotates it with concepts and then these annotations are checked against annotations made by humans on the same text to assess the accuracy with which a computer 'understands' this text and the concepts and entities that exist in it.

So what does the future hold? First of all, since unstructured text becomes more available there will be a greater need for 'annotation farms' : Groups of people who will be manually annotating free text, identifying an ever-growing number of Companies, Managers, Politician names, or anything else that has to be 'taught' to a computer. Note that Annotation Farms already exist but the need for this service will become greater.

The second trend on Text Analytics could be something equivalent to what we have seen happening with NetFlix. Suppose that you own a company that produces Brand 'X' and you wish to track the reputation of your product online. You would then submit a sample of your product's mentions to various companies that analyze text and have them compete against each other in terms of -for example- Precision and Recall. The one that produces consistently the best metrics (whether Precision - Recall, Kappa statistic or F-Measure) will also get the job.

A third trend could be the development of text analytics for specific concepts : Sentiment Analysis and Named Entity recognition is hard work if one wants to produce sound and accurate results. So it could be probable that Text Analytics experts will choose a specific concept -For example reputation of Banks- and then work in the analysis of this -very specific- concept so that they achieve better metrics.