Selasa, 09 Oktober 2012

Personal Data Mining

I believe that i have an overactive immune system. I get recurring bouts of Perennial Conjunctivitis and i also experience at times pain on my neck lymph nodes and my right maxillary sinus. I always believed that all of these symptoms were somehow related. Because of my conjunctivitis i was not able to wear contact lenses. My ophthalmologist confirmed that my eye problems were "allergy-related".

Since September 22nd, 2011 i began a personal experiment. I decided to keep a detailed record of various elements of my everyday life : Whether i had a good night sleep and spent time outdoors, what i ate and how much stress i felt. 

I carry almost always my smartphone with me. So I used a Text-Editing application to record every day as much detailed information as i could. Here is an example of two consecutive dates as these appear in my daily log :

10/11/12, slept/bad, vitamin_c/0, coffee/1, self/ok, stress/low, sausages,cholesterol_food, sugar/5, pasta, tomatoes, mushrooms,  next_sleep/ok 
10/12/12, slept/ok, vitamin_c/500, coffee/2, self/ok, stress/high, bread, honey, milk, sugar/10, meat, garlic, yoghurt, conjunctivitis,  icecream,  next_sleep/ok  

So on the first example date, i did not sleep well the previous night. I had one coffee and roughly 5 teaspoons of sugar the whole day. I was feeling ok with myself, i had sausages and eggs (tagged as cholesterol_food) for breakfast and pasta with mushroom (red) sauce for lunch but no dinner. I managed to sleep well at night. I did not have any signs of an overactive immune system. However, the next day i had conjunctivitis.

I then had to somehow transform the entries to a suitable format - a .csv file- which could then be used by Data Mining Software (such as R and WEKA) for analysis. To do that, a simple Java program was used to transform all log entries to a .csv format using the following rules :


1) Each line represents a day.
2) Each entry is separated by comma (",")
3) If an entry does not contain a forward slash character ("/") then it is treated as a Boolean feature. 
4) If an entry contains a forward slash then it is treated as a Numerical or Categorical feature.

So our two example dates, are transformed like this ( Not all features are shown) :




R was used to perform several pre-processing steps such as coding a function called addfeature which i use to derive new variables from old ones :


data.df<-addfeature("fiber",c("beans","stringbeans","oats","okra","lentils"),data.df)
data.df<-addfeature("cholesterol_food",c("eggs","mayo","octopus","squid"),data.df)
data.df<-addfeature("nuts",c("hazelnuts","walnuts","peanuts","cashews","almonds"),data.df)
data.df<-addfeature("immunity",c("itchyeyes","lymphpain","sinuspain","conjunctivitis"),data.df)   
  
So if on any day i had eggs, mayo(nnaise), octopus, squid or any combination of these foods an entry of cholesterol_food will be used to replace these entries.

 Having the log transformed to the format shown above, i was ready to analyze a 1-year worth of data (in this case IMMUNITY is the target), extract patterns and several hypotheses - for example that "there appears to be a connection between high stress and over-activity of my immune system."

However we must be aware of the dangers that might lead us to incorrect findings. For instance we must take into account the fact that conjunctivitis usually lasts more than one day and also that some features -like Vitamin-C intake- are special in the sense that the representation shown above does not take into account the compounding effect of Vitamin intake. In other words, i might have to take for  n number of days,  an x amount of Vitamin C consecutively to see any effect. Furthermore, this analysis does not take into account the sequence of events. Should we remove foods/ingredients that  normally co-occur or not? How would that affect results? The list of questions and considerations goes on (and then when we finally have some results from the analysis, the first thing to do is to question them).


Using Data Science i was able to almost stop getting sinus and lymph pain and wear again my contact lenses ( i still get symptoms but very-very rarely). Two foods  appeared to be moderately correlated with  signs of an overactive immune system - with one of them being garlic. One would probably argue that i could find that using a simple food diary -  i  doubt about it since things were not so evident. Once this information was found these two foods were eliminated from my diet to see the outcome.

Analysis has also identified a particular Vitamin that was able -in my opinion- to regulate my immune system response so that i could have no food restrictions. Several other patterns (or hypotheses) emerged that could be used for further evaluation by specialized personnel. Whatever i tried, i tried it under the close supervision and consent from Doctor specialists.

It's a logical next step to imagine the potential knowledge and hypotheses extracted by implementing the same experiment on a wider scale (for example by using Kaggle) .

On the next post : More thoughts, results and warnings    




Selasa, 19 Juni 2012

Food Data : The Next Target of Massive Analytics

It has been a very busy period since my last post but also a very interesting one.  At the Social Media Analytics panel of the European Text Analytics Summit there was a question on "What would you suggest to new Entrepreneurs when it comes to Text Analytics". The answer from most of us was "Specialize" :  Build an Exceptionally Good vertical solution.   

Text Analytics has been put to use for several verticals : Finance, Telecommunications, Pharmaceuticals to name a few. Perhaps the next important vertical for Text Analytics is something as Basic -but necessary- as Food. 

Using Analytics for the Food Market is not just about analyzing millions of Tweets to understand and detect Trends on Food consumption, identifying ingredient associations that are liked by Consumers (e.g Olive Oil => Garlic) and the sentiment that a Food experience creates.

Food Sector is a tremendous Market  : Super Markets, Restaurants, Chefs, Books, Magazines, Television Series and Consumers. So Insights from Food Data Analytics could be used by all the "knowledge consumers" mentioned above.

In other words :

- Can we identify emerging trends on the Food Market? And if we can, who are the possible recipients of this knowledge?

- Can we understand and suggest new Food Experiences according to several metrics found whenever Food is discussed in Social Media?

-What other potential sources can be used to collect and then analyze Food Data?

- Can we understand how consumers make choices when it comes to Food?

-Can we Predict Popular Recipes? And how can we monetize from this knowledge?

Text Analytics is  a key technology for transforming all the unstructured information on Food found on the Web. Predictive Analytics can be put to use if we can combine unstructured information with a target variable that we wish to predict.

One of the interesting tasks of a Data Miner is to be able to identify several -actionable and interesting- applications of both Data Mining and Text Mining given some Data. Of importance is also to find and/or to create new Data sources that can help in making better predictions. This is a challenging task but with careful considerations and lots of testing it may well prove to be a worthwhile and rewarding experience.

Coming back to Food Data we could potentially use mentions from Tweets, FB Posts and "Likes", Blog and  Website Posts to capture unstructured information. The hardest part is to be able to somehow incorporate more information about Consumer Behavior as this knowledge -and also to be able to predict Consumer Behavior - would be particularly interesting.

There is a limitation on what Analytics can do especially when we are talking about Predicting Consumer Behavior. As always, proper Data Collection, Pre-processing and thorough Testing is required to reach consistent results.





Senin, 19 Maret 2012

Text Analytics in Telecommunications - Part 3

It is well known that FaceBook contains a multitude of information that can be potentially analyzed.  A FaceBook page contains several entries (Posts, Photos, Comments, etc) which in turn generate Likes. This data can be analyzed to better understand the behavior of consumers towards a Brand, Product or Service.

Let's look at the analysis of the three FaceBook pages of MT:S, Telenor and VIP Mobile Telcos in Serbia as an example.  The question that this analysis tries to answer is whether we can identify words and phrases that frequently appear in posts that generate any kind of reaction (a "Like", or a Comment) vs words and topics that do not tend to generate reactions . If we are able to differentiate these words then we get an idea on what consumers tend to value more : If a post is of no value to us then we will not tend to Like it and/or comment it.

To perform this analysis we need a list of several thousands of posts (their text) and also the number of Likes and Comments that each post has received. If any post has generated a Like and/or a Comment  then we flag that post as having generated a reaction. The next step is to feed that information to a machine learning algorithm to identify which words have discriminative power (=which words appear more frequently in posts that are liked and/or commented and also which words do not produce any reaction.)

After performing this analysis we essentially come up with a list of words and a metric which tells us the discriminative power of each word. Here is an example of identifying these words :


(Note : Results based on a very limited Data Sample)

Keeping in mind that results shown are extracted from a very limited amount of data, the decision tree depicted above shows us that :

The presence of word Dragi (which means "Dear" in Serbian) means that a post usually does not receive reactions. This makes sense as many posts that reply to subscriber questions start with the word "Dear" and then the first name of the subscriber is added.

novo (="new") is a word that receives a lot of reactions along with hocu (="i want") and dopuna (=recharging credit for prepaid subscriptions). In the same manner we identify more words that are selected to be important in discriminating interesting vs non-interesting posts.  Note that we have to identify the correct context. For example we have to identify what the word novo refers to most of the time : A new cell phone or a new promotion?  From the sample analyzed It appears that :


 1) Subscribers "like" posts that discuss  New devices such as Cell phones and tablets (The Next Step could  be the identification of these devices)

2) Subscribers want new promotions (but we then need to find which types of promotions exactly)

3) Issues with incorrect re-charging are creating a very negative sentiment (but then we need to find  which operator co-occurs with this sentiment and for which cases)


In this way we are able to better understand subscribers, extract the Topics that they are interested in and take all this information into account when creating future initiatives. Note that with this way we can have hints on several potential "hot" topics such as Cell Phone and Tablet Brands, Tariffs, Services, Marketing campaigns, and that this can be performed for each Telco Provider page which means that we can analyze and identify the "hot topics" applicable for each Telco provider.

All the above along with several other uses of Predictive and Text Analytics for Telecommunications i will present in the upcoming European Text Analytics Summit in London, UK.

In the event that a Marketing or PR Agency uses -as in the ways shown above- Social Media Analytics to identify hot topics in News, Sports, TV, Banking and Consumer Goods  a Knowledge Base is created which has many uses : Imagine a scenario where a Telecommunications provider wishes to use a Sport event for a Marketing campaign. We could take into account the hot topics found from a "Sports" analysis and suggest ideas in a much more informed way.  More for this on the next post.

Senin, 13 Februari 2012

Text Analytics for Telecommunications - Part 2

In the previous post we have seen the problems that a highly inflected language creates and also a very basic example of Competitive Intelligence. The Case Study that i will present in the forthcoming European Text Analytics Summit is about the analysis of Telco Subscriber conversations on FaceBook and Twitter that involve Telenor, MT:S and VIP Mobile located in Serbia.

It is time to see what Topics are found in subscriber conversations. Each Telco has its own FaceBook page which contains posts and comments generated by page curators and subscribers. Each post and comment also generates "Likes" and "Shares". Several types of analysis can be performed to find out :

1) What kind of Topics are discussed in posts and comments of each Telco FaceBook page?
2) What is the sentiment?
3) Which posts (and comments) tend to be liked and shared (=generate Interest and reactions)?



For each FaceBook page post, an identifier is added to the post text which designates the origination page (either MT:S, Telenor or VIP Mobile) of the post. Prior the analysis of  a FaceBook Post which says "We want more promotions" we need to be aware that this text originated -for example- from the MT:S FaceBook Page and not Telenor's.

Identifying the topics discussed in Telco subscribers posts and comments has a number of benefits. We gain a better understanding in the areas that a Telco should focus on. If we find that the topic of INTERNET is on the top  list of discussions repeatedly then this is where a Telco should pay attention. If we find Network mentions  to be associated with a competitor Telco repeatedly (which most likely is not good) we can choose the right time to air commercials implying that we are constantly working for better Network coverage.  We can also identify how much "buzz' was created from a new marketing campaign or  a new phone offer and the sentiment associated with it.

Let's have a look at the first type of analysis, namely Topic Detection. Using Information Extraction we  identify the Topics mentioned in thousands of FaceBook posts and comments for a particular period. Here are the results :




In terms of user engagement on FaceBook, MT:S is the winner since in absolute numbers, its FaceBook page contains more posts and comments than the other two FaceBook Telco Pages. Notice the frequencies of other topics found such as SMILEYs, INTERNET, PROMOCIJA (= Promotion), MREZA (=Network), POSTPAID and ANDROID.

With the chart shown above we become aware of the distribution with which Topics are discussed on all three FaceBook Pages. We do not know what is being discussed for now but we know that subscribers  talked more about Internet, then Promotions (=PROMOCIJA), then Network (=MREZA) and so on.

Let's look at  what Topics exist with PROMOCIJA (=Promotion). In other words, which other Topics are found in FaceBook Posts when Promotions are mentioned? Here are the results :




Most posts collected that discuss about Promotions are actually posts found on the MT:S FaceBook page. Notice also that the presence of Topic HOCU (=I want) which tells us that subscribers simply state that they want new promotions. Here is what the picture looks like for topic INTERNET :




So Telenor is found more frequently in INTERNET mentions. However caution is required since we do not know if all of  these Topic distributions found and their associations with specific Telcos can be attributed to pure chance or not.

It is very important to be confident enough to communicate to any Telco that being associated with Network mentions or any other Topic  is -or is not- simply a random event.

Rabu, 25 Januari 2012

Text Analytics for Telecommunications - Part 1

As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which i will present in the European Text Analytics Summit is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.


The Telcos used for the Case Study  are Telenor, MT:S and VIP Mobile which are located in Serbia. The analysis aims to identify  the perception of Customers for each of the  three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers - Subscribers.


By analyzing several thousands of Tweets and FaceBook posts and comments we can have a first glimpse of Competitive Intelligence. For example when we wish to identify which words frequently occur with mentions about postpaid packages this is what we find  :




Red boxes show Telco Brands - notice "mts" and "mtsa" which point to the same Telco, namely mt:s.  Blue boxes indicate similar words that should be merged.  From a first look at the results above we see that : 

a) mt:s is found more frequently when users mention PostPaid packages.

b) Telenor and VIP Mobile are not found as frequently as MT:S in PostPaid package conversations.

c) We see several  problems from insufficient pre-processing : Kredit and Kredita (=credit) should merge into one word, the same applies for telefona - telefon, internet - interneta and mts - mtsa.



Notice that we can perform the same High-level analysis for several Telco Topics such as Network, Billing, Customer Care, Promotions, Questions of subscribers and so on. The next task is to identify the reason(s) why MT:S was found to have more mentions about PostPaid packages. Note that at this point we do not know why this is so : It could be the fact that MT:S prices of prepaid packages are high, very cheap or something else is happening that needs to be identified.


The Serbian Language poses extra work because it is a highly inflected language : Even the ending  of  Brand names change according to the usage.  Consider the following :

U mts-u (at mts)
Sa mts-om (With mts)
Bez mts-a (Without mts)


It is evident that a highly inflected language explodes our feature space and for this reason R can come to the rescue with some success. We can use R for changing several synonyms to one word, removing (Serbian) stop words, removing URLs and performing several other pre-processing steps that are necessary prior to an extensive analysis. More on the next post.

Senin, 09 Januari 2012

Case Study : Competitive Intelligence for Telecommunications

Telcos are a good example of a fast moving business environment and a good candidate for using Competitive Intelligence analysis from Social Media sources. The Case Study involves three major Telcos located in an Eastern European Country and shows the results from the analysis of thousands of Tweets and FaceBook wall posts to understand the following :


- How subscribers perceive each Telco Brand? 

- Which information do subscribers tend to Re-Tweet and "Like" on FaceBook Wall Posts? 

- Which words and Topics are commonly found with Intense feelings / thoughts?

- Which topics are mostly discussed when subscribers compare two or more Telco operators?

- What do subscribers discuss about  Network Quality and Speed, Billing, Promotions, Marketing Events, Customer Care, TV Commercials etc.

- How do they prioritize these topics and which of them are interesting and why?  

- What do subscribers talk about in general (i.e without any Telco Brand being mentioned) regarding Internet speed, Charges and what would they expect to see more?

I will present the Case Study mentioned  above in the forthcoming 9th Annual European Text Analytics Summit in April in London - UK. The Case Study is an example of application of Text Analytics to a language for which currently no tools exist and thus all difficulties and possible solutions will also be discussed. Examples will be also given on analyzing information to different conceptual levels and how this technique provides even more insights in consumer behavior.

The following tools were used for the analysis : 

- GATE to annotate all Topics that occur within Telco conversations (such as "sms", "internet", "dropped call", "network","promotion") and for setting up Conceptual Levels.

- R for pre-processing Text and performing Text Classification, Topic Detection and Cluster Analysis.

- WEKA  for Feature Selection and Text Classification.

- Finally,  Java is used to manage the information that is generated from GATE such as  understanding how subscribers prioritize various Telco Concepts and Topics and also identify important phrases and/or words that frequently occur when these Topics are being discussed.