In the previous post we have seen an example of analyzing messages sent from citizens regarding a new taxation plan. We identified some correlations between keywords and concepts but there are more ways to gain knowledge from such unstructured information.
By using Cluster Analysis we can extract groups of similar concepts among thousands of comments written by citizens but also presenting an order within them. Let's assume that Cluster Analysis reveals the following clusters (or similar concepts) within submitted messages :
- battling tax fraud
- requests for a fair tax plan
- requests for less taxation for large families
- various incentives for citizens
Our problem is finding the order of importance that people place on the various concept categories shown above : Is battling tax fraud considered more important (=discussed more frequently by citizens) than requesting a fair tax plan? How about taxation for larger families?
A cluster analysis can reveal to us the size of each cluster and -as a consequence- how important each cluster is :
We make the assumption that in the text representation shown above Cluster 5 (which contains 329 citizen messages) is about requests for a fair tax plan and Cluster 10 contains messages with requests that tax fraud should be minimized. It appears that significantly less people are concerned with a battle against fraudulent activity but they request -more immediate- benefits through a fair tax plan.
Collecting and analyzing information found in blogs and forum entries is another area of analysis that could prove very interesting. Let's see an example with the Political / Social / Economic situation in Greece : The goal is to identify and extract trends and co-occurences of key concepts from blog titles and forum posts such as :
- Names of major Political parties
- Names of Politicians
- Economy (words/phrases such as "austerity plan")
- Negative characterizations
- Company Names
...etc
For this kind of data several applications can emerge. We could track specific concepts through time and see their trends. We can also identify which concepts are discussed together. As an example we could identify the reasons on why Giorgos Papandreou (PM of Greece) is characterized in a bad way in blog posts. (= what other concepts are found in Blog posts containing keywords 'Giorgos Papandreou' AND Bad Characterizations?) :
(Note : PASOK = Governmental Political Party )
Politics = 120
Economy=72
Economy, Politics=40
PASOK=24
Politics, PASOK, Referendum=8
Economy, Politics,PASOK,Referendum, Immigrants=8
Economy, Politics, Society=8
Society, PASOK=4
In other words : Giorgos Papandreou is criticized mainly for his Political decisions and the Economy followed by criticism on PASOK. Negative sentiment also exists because of the fact that a percentage of Greek citizens require that a referendum should take place concerning the latest decision of the Greek government to give to a large proportion of Immigrants the Greek citizenship.



