Selasa, 23 Maret 2010

Predictive Analytics and Politics - Part 2

In the previous post we have seen an example of analyzing messages sent from citizens regarding a new taxation plan. We identified some correlations between keywords and concepts but there are more ways to gain knowledge from such unstructured information.

By using Cluster Analysis we can extract groups of similar concepts among thousands of comments written by citizens but also presenting an order within them. Let's assume that Cluster Analysis reveals the following clusters (or similar concepts) within submitted messages :

- battling tax fraud

- requests for a fair tax plan

- requests for less taxation for large families

- various incentives for citizens

Our problem is finding the order of importance that people place on the various concept categories shown above : Is battling tax fraud considered more important (=discussed more frequently by citizens) than requesting a fair tax plan? How about taxation for larger families?

A cluster analysis can reveal to us the size of each cluster and -as a consequence- how important each cluster is :



We make the assumption that in the text representation shown above Cluster 5 (which contains 329 citizen messages) is about requests for a fair tax plan and Cluster 10 contains messages with requests that tax fraud should be minimized. It appears that significantly less people are concerned with a battle against fraudulent activity but they request -more immediate- benefits through a fair tax plan.

Collecting and analyzing information found in blogs and forum entries is another area of analysis that could prove very interesting. Let's see an example with the Political / Social / Economic situation in Greece : The goal is to identify and extract trends and co-occurences of key concepts from blog titles and forum posts such as :

- Names of major Political parties
- Names of Politicians
- Economy (words/phrases such as "austerity plan")
- Negative characterizations
- Company Names
...etc

For this kind of data several applications can emerge. We could track specific concepts through time and see their trends. We can also identify which concepts are discussed together. As an example we could identify the reasons on why Giorgos Papandreou (PM of Greece) is characterized in a bad way in blog posts. (= what other concepts are found in Blog posts containing keywords 'Giorgos Papandreou' AND Bad Characterizations?) :


(Note : PASOK = Governmental Political Party )

Politics = 120
Economy=72
Economy, Politics=40
PASOK=24
Politics, PASOK, Referendum=8
Economy, Politics,PASOK,Referendum, Immigrants=8
Economy, Politics, Society=8
Society, PASOK=4


In other words : Giorgos Papandreou is criticized mainly for his Political decisions and the Economy followed by criticism on PASOK. Negative sentiment also exists because of the fact that a percentage of Greek citizens require that a referendum should take place concerning the latest decision of the Greek government to give to a large proportion of Immigrants the Greek citizenship.

Jumat, 12 Maret 2010

Predictive Analytics and Politics - Part 1

One of the most interesting applications of Data/Text Mining and Information Extraction is Politics. I started collecting information from various blogs, websites and forums and applying Information Extraction and Data/Text Mining techniques to extract potentially useful knowledge in this area. By combining different pieces of information one could come up with trends that may tell us what lies ahead of us.

The latest developments in Greece are more or less known to most of people that read International News. The situation is difficult and the voice of citizens in various blogs and forums could give us the sentiment of Greek Web Users. For example :

- Which are the most frequently occurring words?

- Which are the most frequently occurring thoughts?

- What are the things that have to be changed by Greek politicians?

To answer these questions i have started collecting information found on the top 120 Greek blogs, the OpenGov website (a state-run website where Greek citizens express their opinions) and a couple more Greek sites of economic content. For blogs and forums a Java program scans every 20 minutes for new information :

This information is then sent to an annotation engine which analyzes the textual content. Once the text is analyzed we can -for example- produce a keyword vector that we can later use to understand what citizens are saying on the Web. We can then find out answers to many interesting questions such as :

- With which words is Mr George Papandreou (PM of Greece) associated with?

-When there are some very negative words (such as swearing) what other words are found in the same text?

- What does keyword trending tell us? (For example, we identify an increasingly number of swear words in citizen posts)


First let's see some examples regarding the OpenGov website where thousands of citizens have expressed their opinions on the tax policy of the Greek state. The following chart shows us a number of pairwise correlations between written words in these comments :



Under the red rectangle appear two words (dikigoros,iatros) which in Greek mean "Lawyer" and "Medical Doctor" respectively. This essentially tells us that these two professions are used together frequently in citizen discussions. By looking closely at these messages one can reveal that professionals in these two sectors are said to avoid taxes by not issuing receipts.

Next we could use association rule learning to look for some more -potentially interesting - rules :


The highlighted rule although one of low support it could prove interesting : A subset of citizens are requesting that freelancers and the self-employed should be more closely monitored for tax fraud.

Apart from rule learning, it is interesting to identify the proportion of the total dataset for which each rule holds. That also gives us a sense of order with which different ideas and thoughts exist on the mind of citizens.

In the next post : What is the Voice of the Citizen tells us in Blogs and forums?