Selasa, 25 November 2008

Predicting popular stories on Digg

On its latest news, KDNuggets mentions a paper from HP Labs that outlines a process of analyzing and predicting the popularity of a Digg story or a YouTube video submission.

No question that this is interesting material. On my post dated October 16th, 2007 i presented an analysis that i performed on what keywords seem to play a part on a post being popular on Digg. You can find all 3 parts of the post here.

So i made a new run to collect the stories from Digg and this is an example of what i came with (please note : For illustrative purposes only) :










The paper from HP Labs takes a different route and makes its predictions based on the popularity of a submitted story in the first few hours rather than after some days. The authors also conclude that after a digg story is out, users tend to vote for it in the beginning but when a specified threshold time has passed the rate with which the story is digged fades away. On the contrary, videos submitted on YouTube are being viewed by users on a linear trend after the video is submitted.

It is true that there is an inherent nature of seasonality on the news and the way that users 'digg' stories. It is also interesting to see at buzzwords that seem to keep repeating (in terms of how interesting they are -or not) over time.

Between the previous runs that i have made and the current one, i have seen some repeating patterns. One of these patterns shows Microsoft on a declining trend in terms of how much of an interesting subject it appears for digg users. Here is what Google Trends shows about the term 'Microsoft' :






Could such a trend may be a glimpse on Microsoft's 'future' somehow?

I have already built a text classifier which accepts phrases and shows the probability of this phrase being highly 'digged' based on the keywords that the phrase has. More on this on a future post.


Jumat, 21 November 2008

Reality Mining vs Life Analytics

We will be taking a short break from making predictions for the financial markets because i just came across a novel term (at least for me) regarding an analytics application : It is called Reality Mining

In short, Reality Mining is about using smart phones to record the interaction of a user with his device and the interaction of the user with other cell phone users. By analyzing this data, patterns of behavior can be extracted that can potentially be interesting to social researchers. For more information on Reality Mining see here.

What made me start this blog was the following question: If there could be a way to capture -and thus record- what a person feels, thinks and otherwise experiences everyday, what kind of patterns might emerge from analyzing this information? I think that this is one step (or even more steps) further from Reality Mining. What would happen if this kind of information was recorded for a vast amount of people and then analyzed? What if we could predict how thoughts we make and what we experience might affect our life later on and the decisions we will make? This is full-blown 'Life Mining'.

I get some e-mails from readers on "how this life analytics project is going" and for this subject there will be some future posts very soon.




Minggu, 16 November 2008

Text Mining on Financial News

As discussed previously, an analyst should give specific attention to problem representation particularly when we are dealing with text data. A way to do this will be discussed below, however something has to give and there is no perfect solution for such a task.

First of all we have to find the source of the news : It could be financial news sites such as Bloomberg, Financial Times, or RSS Feeds URLs such as the ones provided by MarketWatch. RSS Feeds might be a better solution because there is already some predetermined categorization of news according to the feed type and this can be great help for some analysts.

After finding the news sources and making the necessary code to get the actual information we could end up with the following text file :



You can see that i use a '^' separator to differentiate between :

1) A date stamp,
2) A date string
3) The news string
4) A characterization of the news (important or unimportant)
5) A categorization of the financial news.


This simple file could provide the basis for a training file for text categorization. Assuming that we have trained algorithms to automatically classify news, we could use a news classifier to first categorize news to important or unimportant and pass only the important news to a second classifier which will do the detailed classification of the news.

Another option is to use clustering : You can imagine that the solution detailed above has a tremendous amount of work depending on how much data you are planning to collect...so too much data means too much work, less data could mean -usually but not always- less accuracy.

But how could clustering be performed on such data? Simply, we just use field number (4) on our training text file to train a clustering algorithm and then see what 'classes' the algorithm has come up with.


So let's see a small example about clustering : This is a capture from WEKA just before the clustering process :


I have produced a training file which essentially contains the 'buzzwords' of financial news : barrel, recession, Yen, Euro, ECB, price, consumer, etc. The file is then analyzed by K-means algorithm to extract clusters of the same 'buzzwords'. Each cluster is assigned a number so each news header ultimately falls onto one cluster number.


After running the K-Means algorithm i ended up with 16 clusters. Let's see two instances that K-Means decided that they should fall under cluster '6' :


Instance_number : 130.0

Fear
Decrease
US
Economy
Futures

and


Instance_number : 174.0

Fear
Decrease
US
Price
Oil
Banking
Recession


So the first instance is about fears for the US Economy which results in US Futures dropping and the second instance must be -something about- a decrease of Oil prices and Banking stocks because of the fear of US recession. Not bad at all...

But not so fast : Clustering presents a lot of problems later in the process. Remember that what we are after, is to combine text mining and data mining together to better understand how the markets react. Should one use classification or clustering? There are many more things to take under consideration and for obvious reasons i cannot disclose all the details of such a project...but i am hoping to give to the interested reader a good enough introduction on the subject.

Jumat, 14 November 2008

Capturing the Financial Facts

So far, we have seen the data mining part on analyzing the financial markets and some of the problems that arise during such an analysis : Data have to be collected and pre-processed accordingly. There are dangers of over-fitting and the analyst must make sure that the model(s) created have the expected quality. The analyst has also to choose relevant attributes with which the analysis will be performed and how the training of the algorithms will be made.

The markets react to financial news and there is no question about this. Of course there are other factors that make people buy or sell : For example if a stock price has hit a support or resistance level then some investors are going to either buy or sell when such a price level is reached. Investors are also going to buy or sell when specific technical indicators such as MACD or oscillators show the signals to do so. Even when bad news are out, markets after an -unknown- number of consecutive drops will go up by an -unknown- percentage and vice-versa.

People that are involved with Machine Learning know that the representation of the problem at hand is of high importance...so first we are going to see how financial news can be represented in a helpful way for the analysis.

We have to see with what we are dealing here. To do this, we have to analyze and categorize accordingly the financial information as this is created. Financial News can be news about a number of things :

1) The number of jobless claims in US is higher than last year.
2) Automotive company's XYZ sales were dropped by 15%
3) Oil prices hit -yet- another record high
4) The dollar is dropping


....and the list goes on.


So the first problem arises : Should we categorize the information according to its content and present it to the algorithms? We could do that by having a boolean field for each type of news on our training file and set it accordingly to TRUE or FALSE values. By using this method we could easily reach thousands of input fields, since for the "jobless claims" news type we could have the following variants :

-A specific country for the jobless claim report (not only the US, it could be any country)

-Jobless claims could be higher than expected or higher than last year or the highest in the last decade.


It is easy to see that this gets way too fast out of control. Perhaps a better solution would be to try to create clusters of (more or less) the same news. The idea of clustering the financial news might seem an interesting one and an analyst could define a number of clusters -say he is after 100- and let the clustering process categorize accordingly all the news. But is clustering the solution? More on this on the next post...

Rabu, 12 November 2008

Model Testing

Once a model has been created (such as the decision tree for our example), the analyst is required to test the model. During model testing, an analyst performs specific tests that show the actual predictive power of a model.

Many methods can be used for model testing, depending on the problem. For our example and since the available volume of data is sufficiently large, the model training and testing methodology i used was as follows :

1) 50% of data were used for model training
2) 25% of data were used for model validation - fine tuning
3) 25% were used for testing of the model.

In other words, 75% of the data were used for training the algorithm and assessing the impact that changes on algorithm parameters have on the accuracy of the model. For a decision tree algorithm (and depending on the type of decision tree used) an analyst might try different settings for splitting criteria and/or number of minimum cases per branch, etc.

Unfortunately, numerous times an analyst finds that the predicted accuracy of the model given during training - model validation phases (ie steps 1 and 2 shown above) is in no way representative when the model is tested on unseen cases ( Step 3).

During my analysis, numerous models were showing an estimated accuracy of 85% or more but when they were presented on actual data, the accuracy was dropping down to 50-53%, suggesting that overfitting was present. Consequently, the use of these biased models to predict new cases would have detrimental effects in actual stock trading.

When all models are built, the analyst should choose a model (when there is a requirement to use only one model) according to :

1) (Statistically significant) best accuracy.
2) Misclassification costs, if these are not taken into account during the model building process.


On the next post we will see how text mining may help us in making better predictions for the markets.