Selasa, 31 Agustus 2010

Banks, Risk Disclosure and Text Analytics



A UK-based MSc student of Kingston Business School - Christos Gkemitzis had an idea for his MSc project which immediately caught my attention : Use Text Analytics methods to annual reports given by Banks and extract metrics on how these Banks handle their Credit and Interest rate risk as explained in these reports and then test several hypotheses ( do Banks of a higher risk profile disclose bigger amount of risk-related information compared to those having lower risk profile?) and also identify any correlations :
  • between the size of the Bank and volume of risk disclosures
  • between the risk of the Bank's profile and volume of risk disclosures
  • between the profitability of the Bank and volume of risk disclosures

Essentially the problem is to -automatically- identify mentions of credit risk but in a specific way :

1) Identify sentences mentioning risk refer to the present, past or future
2) Identify positive, negative or neutral sentiment mentions about Credit Risk
3) Identify qualitative versus quantitative information regarding the Bank's Credit Risk


For example consider the following text which is part of an actual Bank report :

"A substantial increase of credit risk and provisions is also expected, as from 2009 on, theeconomy will be entering a period of low growth."

The sentence above contains qualitative information ("substantial increase of credit risk and provisions") and negative Sentiment referring to the future ("also expected" and "will be entering a period of...").

while the following sentence :

"The Group’s ongoing efforts to manage efficiently credit risk led the level of loan losses to 3.3% in December 2008"

contains quantitative information ("level of loan losses to 3.3%") with a positive sentiment about Credit Risk handling in the past.

After receiving some PDF samples of Bank reports from Christos, i began feeding these reports to the GATE Text Analysis toolkit in order to assess the feasibility of such analysis. After some tutorials through Skype, Christos -who had no prior knowledge of programming- started using the toolkit on his own in a very short amount of time. Here is a snapshot of GATE in action for the analysis :






The snapshot shows how GATE correctly identified a part of text that communicates a negative sentiment for Credit Risk in a qualitative manner for the future (notice that "QualitativeBadNewsFuture" is checked).

After running GATE in many documents, Christos had the necessary metrics (=how many mentions of different Risk types exist in a document) to test his hypotheses using a 2-tailed Wilcoxon test. To identify correlations, Spearman coefficient was also used.

Since this is work which has not been submitted yet, it is not permitted to post the findings of this research. The post shows however another application of Text Analytics and the many sources of unstructured information that could be mined for knowledge.

Minggu, 08 Agustus 2010

Interview with Kaggle CEO Anthony GoldBloom


For those that haven't heard of Kaggle before, Kaggle is a team of people that provide the functionality and support to host Data Mining contests. Here is how it works : Suppose that you are working for a Telco and wish to implement a new Churn prediction model. Rather than running this project in-house, you submit your data to Kaggle. What happens next is that -hopefully- many statisticians globally will each analyze your dataset, produce a model and then submit their prediction model(s) to Kaggle. The best model (and hence its creator) gets the prize which is given by the Telco company. Here is the interview with Kaggle CEO, Anthony GoldBloom :




- What is Kaggle and what new ideas brings to the predictive analytics arena?

Kaggle aims to help companies and researchers make predictions more precise by providing a platform for data prediction competitions. Competitions turn out to be a great way to get the most out of a dataset. This is because there are infinitely many approaches to any data modeling problem. By opening up a data prediction problem to a wide audience, a competition makes it possible to get to the frontier of what is possible given a dataset's inherent noise and richness.

- Can you tell us more about "real-time science" and how it could help Research globally?

Data modeling competitions can facilitate real-time science. Consider the recent announcement about the discovery of genetic markers that correlate with extreme longevity. Work on the study began in 1995, with results published in 2010. Had the study been run as a data modeling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision).

Data modeling competitions also benchmark, in real time, new techniques against old. A technique that performs well in competitions can prove its mettle long before any paper can be published, helping the science to progress more quickly.

Competitions also help to avoid situations where valuable techniques are overlooked by the scientific establishment. This aspect of the case for competitions is neatly illustrated by Ruslan Salakhutdinov, now a postdoctoral fellow at the Massachusetts Institute of Technology, who had a new algorithm rejected by the NIPS conference. According to Ruslan, the reviewer ‘basically said “it’s junk and I am very confident it’s junk”’. It later turned out that his algorithm was good enough to make him an early leader in the Netflix Prize (he called his Netflix Prize team NIPS_reject).

- How can companies benefit through Kaggle?

Companies can use Kaggle to gain an advantage over their competitors. Consider a bank that wants to improve the algorithms that vet loan applicants. If a bank can develop a more effective algorithm they will have fewer defaults and can charge lower interest rates than their competitors. Kaggle has proven to be an effective way to improve existing models very quickly.

Competitions are also really useful to companies that want to develop new products and capabilities. Consider a hedge fund that wants to be able to generate long-range weather forecasts in key agriculture regions. They can attempt to hire a weather forecasting expert or they can use Kaggle to throw the problem open to a wide audience. Using Kaggle they can be sure they'll get great results very quickly.

- How is the best model selected?

The competition host will typically split their dataset into two parts - a training dataset and a test dataset. The training dataset includes all explanatory variables as well as the dependent variable (or the answer). The test dataset also includes all the explanatory variables but the dependent variable (or answer) is withheld.

Participants train their models on the training dataset. They then apply their models to generate predictions on the test dataset. Those predictions are then scored on-the-fly against the actual answers (using one of several evaluation methods). Once the competition deadline passes, the team that generates the most accurate predictions gives the winning methodology to the competition host in exchange for the prize money.