Selasa, 27 Juli 2010

Summarization of Blog posts with "Web Pulse" Reports

In the past couple of months i was looking for a way to best capture and understand what happens on the Web -and more specifically what people write in blogs- in terms of sentiment and emerging trends. The first thing that i came up with was the the idea of creating a "Web Pulse" Report : A way to summarize what people are discussing on the web. Although the implementation was not as complex as i expected, i was pleased to find that the knowledge that can be extracted is -to say the least- very useful and interesting. Before looking at an actual Report examples here are the elements that comprise it :


1) Concept Frequencies : Identifies the concepts that bloggers most frequently write about

2) Global co-occurence Matrix : Identifies most frequent word bigrams

3) Keyword Associations for Concepts : Which keywords tend to co-exist with a specific concept?

4) Most frequent n-grams associated with a given Concept (where n=2,3,4,5)


As an example we will identify what bloggers were discussing in Greek blogs on July 27th, 2010 and specifically the Blog titles in more than 300 Greek blogs.

Here are the concept frequencies found (in descending order) on that date :


[Turkey]=178
[Politics]=128
[Economy]=101
[International Monetary Fund - IMF]=62
[Banking]=61
[Public Sector]=50
[Negative Characterizations]=30
[Political Parties]=29
[George Papandreou]=29 (=Prime Mininster of Greece)
[Loans]=22
[Society]=20


The first interesting fact was that "Turkey" appears to be in the top of the list of Greek blog articles, even though Greek mass media did not place so much weight in the latest Turkish behavior in the Aegean sea on that day. The second concept is Politics with the Economy following next.

Here is the top part of the Global Co-occurence Matrix found (in Greek) :


ΣΚAΦΟΣ,ΤΟΥΡΚΙΚΟ : 25
ΡΕΙΣ,ΤΟΥΡΚΙΚΟ : 25
ΠΙΡΙ,ΤΟΥΡΚΙΚΟ : 25
ΡΕΙΣ,ΣΚAΦΟΣ : 24
ΕΛΛAΔΑ,ΧΩΡΑ : 23
ΥΠΟΥΡΓΕΙΟΥ,ΟΙΚΟΝΟΜΙΚΩΝ : 22
ΡΕΙΣ,ΕΡΕΥΝΗΤΙΚΟ : 22
ΠΙΡΙ,ΕΡΕΥΝΗΤΙΚΟ : 22
ΡΕΙΣ,ΠΙΡΙ : 21
ΑΝΑΜΕΝΕΤΑΙ,ΣΥΜΦΩΝΑ : 21
ΠΟΛΙΤΙΚΗ,ΧΩΡΑΣ : 19
ΟΙΚΟΝΟΜΙΑ,ΕΛΛΗΝΙΚΗ : 19
ΜΟΝAΔΩΝ,ΔΕΗ : 19
ΚΥΒΕΡΝΗΣΗ,ΠΑΠΑΝΔΡΕΟΥ : 19
ΗΓΕΣΙΑ,ΠΟΛΙΤΙΚΗ : 19
ΕΡΕΥΝΗΤΙΚΟ,ΤΟΥΡΚΙΚΟ : 19
ΥΠΟΧΡΕΩΣΕΙΣ,ΜΝΗΜΟΝΙΟ : 16
ΧΩΡΑ,ΜΝΗΜΟΝΙΟ : 15

The top 4 frequent keyword associations is -again- about the latest problems of Greece with Turkey and more specifically with the fact that a Turkish boat named "Piri Reis" (in Greek : ΠΙΡΙ ΡΕΙΣ) has been repeatedly entering without a permission a Greek part of Aegean Sea.

Let's look at the Associations frequencies found between specific Concepts : The following is an example of concepts associated with "Giorgos Papandreou" (Greek Prime Minster)

International Monetary Fund - IMF=32
Politics=28
Political Reform=6
Nea Dimokratia=3 (=Oppositional Political Party)
Politics, International Monetary Fund, Loans,Political Parties=2
Negative Sentiment=2
Public Sector=2
Uncertainty=2

It appears that George Papandreou is frequently mentioned where the IMF is involved and also a political reform might be on its way.

The fourth element of the report shows phrases that are commonly found in Blog posts. Since many blogs tend to use the same titles, with this functionality one is able to look at the distribution of the information from one blog to another.

The report can be enhanced in various ways : For example by tokenizing Blog posts in sentences i have added the option of performing chi-square tests to identify co-occurences in a more concise way, rather than using strictly absolute term frequencies. Through different types of analysis and knowledge representation we are able to look to our subject(s) of interest in different ways, which -hopefully- leads us to better insights.

From my experience so far, this type of report is a simple but efficient way to summarize the content of Blogs and also show what is 'hot' at the moment and why.