Kamis, 18 Desember 2008

Personalizing your RSS Feeds


78.8%. This is the estimated accuracy with which an algorithm is able to predict what kind of information i like to read and thus what kind of news seem to be more interesting to me.

The system proves itself every day and it actually helps a lot because it can spot instantly the information that i like among hundreds of RSS news headers. Such service is so much more than simple keyword matching because it takes into account the combination of words and thus it can differentiate news (in terms of how interesting they are) even if these news are about similar concepts.

Personalization of RSS feeds is a well-known application of text classification. The amount of information -the header- is almost always 2-3 sentences long which makes it ideal for feeding it to a classifier. The software that i built is quite simple : First i have a list of about 10 RSS sources : Financial, Medical, International News, Tech news etc. The application scans the RSS feeds every 20 minutes and each new header is appended to a text file on my hard disk.

When i ran the application for the first time, i simply saved all headers on the hard disk and built my first text classifier. But that was back then. Today the classifier scans the RSS feeds and automatically appends an RSS header to either the "Interesting" text file or the "Uninteresting" text file...and it does so correctly most of the time.

When i have some spare time, i look at the classified headers and correct the errors my classifier made by putting the right headers to the right place. I then re-train the classifier and everything is ready for the next run.


Also interesting is the fact that, the more i am using the classifier the better it gets in terms of the training frequency it requires. During the first week of usage i had to produce a model almost every day...but not any more.

So RSS personalization is an application to look at. First of all it saves a lot of time. Second, really useful applications can emerge. For example, consider one investor that wishes to know if something significant enough occurs on the news that might affect the markets for better or for worse. Notice that on a previous post, i described how i am flagging every news header as an important or unimportant one. Therefore if a classifier is able to differentiate accurately enough as to what is important, then an investor can receive e-mail alarms -or even SMS messages if he is not online- of the event. Perhaps, the message might even include how much a stock or an index is likely to be affected by the breaking news.

There is also the personalization that results from collaborative filtering. However, i believe that a "personal news classifier" -if i may call it like that- after sufficient training time, does a much better job in terms of its predictive accuracy.



Senin, 08 Desember 2008

When Telecom customers complain-Pt. 2

On the previous post i explained the first steps in deploying Information Extraction, Text Mining and Computational Linguistics to capture the essence of Telecom customers complaints.

We have already discussed about the big picture : Retrieve data (essentially user messages from forums) and then use Information Extraction to transform unstructured information to a structured form. This transformation is done by building a set of matching rules for specific phrases or keywords such as

-signal
-antenna
-customer care

and words of sentiment such as

-worse
-worst
-better
-best
-outraged

among many others.


So here comes the interesting part : Suppose that a telecom company has in its possession an application that is able to search and extract sentiment from unstructured information. Having such a tool means that :

  • A user can query directly on user forums for -example- specific network problems and break down those problems by area name.
  • A user can directly query for hot phrases such as "canceling my subscription" and cluster keywords around those messages. If the telecom company is also running (and most likely it is running) churn prediction models, then analysts have yet another source to cross-check and/or enhance the conclusions of their churning models with this new information.
  • Special matching rules can be applied to extract why users prefer company XYZ over company ABC.
  • This technology can be applied to e-mails and/or free text complaints to the customer care center, which means that analysts can further enhance their churning models with additional data.
  • Matching rules can be built that associate keywords to Telecom companies in terms of their co-occurrence. So telecom company XYZ has the phrase "good signal" associated with its brand whilst company ABC has the phrase "bargain" as the associated keyword.
  • Match billing plan keywords and then cluster them with sentiment keywords. In other words, how do customers perceived the new billing plan and what is the sentiment about it?


It is easy to realize that Information Extraction combined with Text Mining and linguistics is a powerful combination that can extract many "knowledge nuggets". The fact that such an application cannot be 100% accurate may arise acceptance problems but its sure worth the effort in the end if potential problems are clearly presented before implementation of this application.

Let us not forget that a complaint given by a customer to the customer center remains there - between the boundaries of the company. A complaint posted on a forum can be seen by hundreds of thousands of others (and it will most likely stay there for a long time ) ,influencing potential and existing customers in a non-positive way.

A Sentiment analysis application may be also used for :

  • Banking
  • Pharmaceuticals
  • Insurance
  • Consumer Products (Customer Reviews)
.
and of course for capturing the sentiment of citizens for politicians (...)


Minggu, 07 Desember 2008

When Telecom customers complain

Probably one of the best uses of Information Extraction, Text mining and Computational Linguistics combined together, is their ability to show us the sentiment of customers. Today we are going to see an example for capturing the sentiment of Telecom customers.

When a customer writes his/her opinion on a forum, a wealth of information is generated because -more importantly- a customer uses words and phrases that cannot be found during a controlled study. The words, phrases and expressions are far more emotionally powerful than a Likert scale answer of type "Totally disagree / agree".

So let us see the steps required :


First Step : The first thing of course is to actually find the data : User forums where people talk about mobile phones and mobile companies is obviously the place to look and there are lots of those places. Perhaps the volume of the messages is not enough but usually the available information is more than enough. Special code can be written to extract text from posts but without loss of the nature of the posting. As an example, the fact that a post has generated 20 replies is considered valuable information. The more posted replies, the more sentiment exists and this information has to be taken into consideration.

Second Step : Deploy information extraction techniques to identify phrases of good or bad sentiment (and actually many other things) about Telecom keywords such as :

- Signal
- Customer Care
- Billing

....etc

The following screen capture shows an example which is in Greek but i will provide all necessary explanation - Please also note that this is a simplified version of the process :





Notice that on the right hand-side there are some bars that denote the type of keywords found : The first category is called "Characterization" and if it is checked (which on the above screen capture it is) the software will highlight posts that only have some kind of characterization, whether good or bad. Notice also the yellow bar which has the name "Network". Because it is checked, words that are synonyms of "Network" are highlighted and indeed this is the case because

Signal = σήμα (in Greek) and
Flawless = άψογο

so the highlighted phrase άψογο σήμα means "flawless signal", which is a good characterization for the signal of two particular telecom companies. Notice also a line under the "Features" tab which says that between positions 3425 to 3429 there is a mention about signal ("mentionsSignal = true").

Again, i have to point out that this is a simplified version of the process. Text Mining and Information Extraction is actually very hard work but it is also very rewarding for those that ultimately deploy and use it. On the next post we will see the problems (and there are many of them) but also how this unstructured information is turned to "nuggets of gold".


Rabu, 03 Desember 2008

Analytics and the Financial Markets

On previous posts, i explained ways to analyze the financial markets by using data mining and text mining techniques. I also went through some potential pitfalls and perils during such type of analysis.

By combining different data sources (worldwide indices, moving averages, oscillators, clustering or categorization of financial news) an investor could take better decisions on where and when to invest. After such an analysis our goal is to make sufficiently better predictions than mere chance.

Some days ago i came across a website called Inner8. Inner8 is a really interesting idea : Collaborative filtering of stock picking. Combine this with analytics and an investor has on his arsenal -yet- another investing tool. Imagine thousands of Inner8 subscribers making stock predictions and giving their ideas, insights and sentiment for the stock market. After a few months some users will be "prediction super stars" from mere chance, so one has to proceed with caution. Nevertheless it is a website to keep looking at in the future, especially if the subscriber volume increases significantly.

So let us go back to our problem : We have to think of a good way to combine the information in our possession (aka problem representation) and feed this data on one or more algorithms with the goal of achieving models of high predictive value.

Some of the things to consider :

1) Should the "sliding window" technique be used? Could repetition of training data (because there is repetition of data in sliding window training) affect the predictive power of the model?

2) How many variables? Which are good predictors?

3) Do we care only about predictive power of the model? How about the interpretation of why a stock behaves as it does?

4) How can we represent the "additive effect" of 2 straight days of bad market news if a sliding window is not used?

5) Prediction Goal : Are we after price prediction (Regression) or price limits? (Classification)

Unfortunately the list does not end here : Since i am after predictions of stock prices in the Greek Stock Exchange, the data should be presented to the learning algorithm in a coherent way. European Markets are affected by the closing of US Markets and Asia. During Greek trading hours the US Markets open (approx. 45 mins before the end of trading - at 16:30 EET) , a fact that should be also taken into account.

I am sure that there are many users out there that have read a couple of data mining books, downloaded an open-source data mining tool, fed some data in and expect to see results. My only advice to them without the slightest sign of criticism: Paper-trade first...