Kamis, 16 Oktober 2014

Sequence Data Mining for Health Applications

An often overlooked type of Analysis is  Sequence Data Mining (or Sequential Pattern Mining).


Sequence Data Mining is a type of Analysis which aims in extracting patterns sequences of  Events. We can also see Sequence Data Mining as an Associations Discovery Analysis with a Temporal Element.

Sequence Data Mining has many potential applications (Web Page Analytics, Complaint Events, Business Processes) but here today we will show an application for Health. I believe that this type of Analysis will become even more important as wearable technology will be used even more and therefore more Data of this kind will be generated.

Consider the following hypothetical scenario : 

A 30-year old Male patient complaints about several symptoms which -for simplicity reasons- we will name them as Symptom1, Symptom2, Symptom3,etc.

His Doctor tries to identify what is going on and after the patient takes all necessary Blood work and finds no problems. After thorough evaluation the Doctor believes that his patient suffers from Chronic Fatigue Syndrome. Under the Doctor's supervision the patient will record his symptoms along with different supplements to understand more about his condition. Several events (e.g a Visit to the Gym, a stressful Event) will also be taken under consideration to see if any patterns emerge.

-How Can we easily record Data for the scenario above?
-Can we extract sequences of events that occur more frequently than mere chance?
-Can we identify which sequences of Events / Food / Medication may potentially lead to specific Symptoms or to a lack of Symptoms?


Looking the problem through the eyes of a Data Scientist, We have :

A series of Events that happen during a day : A Stressful event, A sedentary day, Cardio workouts, Weight Lifting, Abrupt Weather Deterioration, etc

A Number of Symptoms : Headaches, "Brain Fog", Mood problems, Insomnia, Arthralgia, etc.


Let's begin with Data Collection. We first suggest to the patient to use an Android app called MyLogsPro (or some other equivalent application) to easily input information as this happens :


  
So if the patient feels a specific Symptom he will press the relevant Symptom button on his  mobile device. The same applies for any events that have happened and any Food or Medication taken. As the day passes we have the following data collected :



The snapshot shows what happened starting on the 20th of August 2014, where our patient has logged the intake of Medication (at 08:22 AM) and/or Supplements upon waking up then a Food entry was added at 08:47. At 11:06 the patient had a Symptom and immediately reached his phone and pressed the relevant Symptom (Symptom No 4) button.

After many days of Data Collection we decide that its time to analyze this information. We export the data from the application as a csv file which looks as follows :



We will use KNIME to read the csv file, change the contents of the entries accordingly so that an Algorithm can read the events and then perform Sequence Data Mining. We have the following layout :



 The File Reader reads the .csv file, then during the Pre-processing block (shown in yellow), a String Manipulation node which removes colon (:) from time field (e.g 12:10 becomes 1210). The Sorter sorts the data according to date then time as the second field and a Java snippet uses replaceAll() function to remove all leading zeros from Time field (e.g 0010 becomes 10).

The R Snippet loads the CSPADE Algorithm and then uses this Algorithm to extract pattern of sequences.


After executing the stream we get the following output :


The information consists of two outputs : The first one is a list of sequences along with their support and the second one contains the output from rule induction which gives us two more useful metrics (namely the lift and the confidence for each rule).

We immediately notice an interesting entry on the first output :

Medication1->Symptom2

and on the second output we see that this particular rule has a lift of 1.4 and 0.8 confidence.

However, as Data Scientists we should always double-check the extracted knowledge and must be aware of pitfalls. Let's see some examples (list not exhaustive) :

1) The algorithm does not account for time as it should : As an example, consider the following entries :

10/09/14,08:00,Medication1
10/09/14,08:05,Symptom2

We assume that Medication1 is taken by mouth and needs 60 minutes to be properly dissolved and that these entries occur frequently enough in that order in our data set. Even though the algorithm might show a statistically significant pattern , it is not logical to hypothesize that Medication1 could be related to Symptom2. The Analyst should first examine each of these entries to see which proportion of the records has a time difference of at least -say- or greater than 60 minutes.

Apart from the example shown above we must consider the opposite effect. Consider this entry :

10/09/14,08:00,Medication1
...
...
...
10/09/14,21:05,Symptom2

In other words : Is it possible that a Medication taken in the morning to generate a Symptom 12 hours later?


2) The algorithm is not able to account for the compounding effect of a Medication. For example, the patient might have low levels of Taurine and for this level to be replenished, an x amount of days of Taurine supplementation is needed. The algorithm cannot account for this possibility.


 3) The patient should also input entries of "No Symptoms". It is not clear however when this should be done (e.g at the end of each day? assess every 6 hours and add 2 entries accordingly?)


However, this does not mean that a Sequence Mining algorithm should not be used under these circumstances. This technique can generate several potentially interesting hypotheses which Doctors and/or Researchers may wish to pursue further.
 




Kamis, 03 Juli 2014

Becoming a Data Scientist : A RoadMap

I receive a lot of questions regarding which books one should read to become a Data Miner / Data Scientist. Here is a suggested reading list and also a proposed RoadMap (apart from the requirement of having an appropriate University degree) in becoming a Data Scientist. 

Before going further, it appears that a Data Scientist should possess an awful lot of skills : Statistics, Programming, Databases, Presentation Skills, Knowledge of Data Cleaning and Transformations.
 

The skills that ideally you should acquire are as follows :

1) Sound Statistical Understanding and Data Pre-Processing
2) Know the Pitfalls : You must be aware of the Biases that could affect you as an analyst and  also the common mistakes made during Statistical Analysis
3) Understand how several Machine Learning / Statistical Techniques work.
4) Time Series Forecasting
5) Computer Programming (R, Java, Python, Scala)
6) Databases (SQL and NoSQL Databases)
7) Web Scraping (Apache Nutch, Scrapy, JSoup)
8) Text Data




Statistical Understanding :  A good Introductory Book is Fundamental Statistics for the Behavioral Sciences by Howell. Also IBM SPSS for Introductory Statistics - Use and Interpretation and IBM SPSS For Intermediate Statistics by Morgan et al. Although all of the books (especially the two latter) are heavy on  IBM SPSS Software they are able to provide a good introduction to key statistical concepts while the  books by Morgan et al give a methodology to use with a practical example of analyzing the High-Scool and Beyond Dataset.

Data Pre-Processing : I must re-iterate the importance of thoroughly checking and identifying problems within your Data. Data Pre-processing guards against the possibility of feeding erroneous data to a Machine Learning / Statistical Algorithm but also transforms data in such a way so that an algorithm can extract/identify patterns more easily. Suggested Books :

  •  Data Preparation for Data Mining by Dorian Pyle
  • Mining Imperfect Data: Dealing with Contamination and Incomplete Records by Pearson
  • Exploratory Data Mining and Data Cleaning by Johnson and Dasu


Know the Pitfalls : There are many cases of Statistical Misuse and biases that may affect your work even if -at times- you do not know it consciously. This has happened to me in various occasions. Actually, this blog contains a couple of examples of Statistical Misuse even though i tried (and keep trying) to highlight limitations due to the nature of Data as much as i can. Big Data is another technology where caution is warranted. For example, see : Statistical Truisms in the Age of Big Data and The Hidden biases of Big Data.

Some more examples :

-Quora Question : What are common fallacies or mistakes made by beginners in Statistics / Machine Learning / Data Analysis

-Identifying and Overcoming Common Data Mining Mistakes by SAS Institute

The following Book is suggested :

  • Common Errors in Statistics (and how to avoid them) by P. Good and J. Harding

In case you are into Financial Forecasting i strongly suggest reading Evidence-Based Technical Analysis by David Aronson which is heavy on how Data Mining Bias (and several other cognitive biases) may affect your Analysis . 


Understand how several Machine Learning / Statistical Algorithms work : You must be able to understand the pros and cons of each algorithm. Does the algorithm that you are about to try handle noise well? How Does it scale? What kind of optimizations can be performed? Which are the necessary Data transformations? Here is an example for fine-tuning Regression SVMs:

Practical Selection of SVM Parameters and Noise Estimation for SVM Regression 

Another book which deserves attention is Applied Predictive Modelling by Khun, Johnson which also gives numerous examples on using the caret R Package which -among other things- has extended Parameter Optimization capabilities.


When it comes to getting to know Machine Learning/ Statistical Algorithms I'd suggest the following books  :

  • Data Mining : Practical Machine Learning Tools and Techniques by Witten and Frank
  • The Elements of Statistical Learning by Friedman, Hasting, Tibishirani 


Time Series Forecasting : In many situations you might have to identify and predict trends from Time Series Data. A very good Introductory Book is Forecasting : Principles and Practice by Hyndman and Athanasopoulos which contains sections on Time Series Forecasting. Time Series Analysis and its Applications with R Examples by Shumway and Stoffer is another book with Practical Examples and R Code as the title suggests.

In case you are interested more about Time Series Forecasting i would also suggest ForeCA (Forecastable Component Analysis) R package written by Georg Goerg -working at Google at the moment of writing- which tells you how forecastable a Time Series is (Ω = 0:white noise, therefore not forecastable, Ω=100: Sinusoid, perfectly forecastable).

Computer Programming Knowledge: This is another essential skill. It allows you to use several Data Science Tools/APIs that require -mainly- Java and Python skills. Scala appears to be also becoming an important Programming Language for Data Science. R Knowledge is considered a "must". Having prior knowledge of Programming gives you the edge if you wish to learn n new Programming Language. You should also constantly be looking for Trends on programming language requirements (see Finding the right Skillset for Big Data Jobs). It appears that -currently- Java is the most sought Computer Language, followed by Python and SQL. It is also useful looking at Google Trends but interestingly "Python" is not available as a Programming Language Topic at the moment of writing. 

Database Knowledge : In my experience this is a very important skill to have. More often than not, Database Administrators (or other IT Engineers) that are supposed to extract Data for you are just too busy to do that. That means that you must have the knowledge to connect to a Database, Optimize a Query and perform several Queries/Transformations to get the Data that you want on a format that you want.

Web Scraping: It is a useful skill to have. There are tons of useful Data which you can access if you know how to write code to access and extract information from the Web. You should get to know  HTML Elements and XPath.  Some examples of Software that can be used for this purpose : 

-Scrapy
-Apache Nutch
-JSoup

Text Data: Text Data contain valuable information : Consumer Opinions, Sentiment, Intentions to name just a few. Information Extraction and Text Analytics are important Technologies that a Data Scientist should ideally know.

Information Extraction :

-GATE
-UIMA

Text Analytics

-The "tm" R Package
-LingPipe
-NLTK

The following Books are suggested :

  • Introduction to Information Retrieval by Manning, Raghavan and Schütze
  • Handbook of Natural Language Processing by Indurkhya, Damerau (Editors)
  • The Text Mining HandBook - Advanced Approaches in Analyzing Unstructured Data by Feldman and Sanger

Finally here are some Books that should not be missed by any Data Scientist :

  • Data Mining and Statistics for Decision Making by Stéphane Tufféry (A personal favorite)
  • Introduction to Data Mining by Tan, Steinbach, Kumar 
  • Applied Predictive Modelling by Khun, Johnson
  • Data Mining with R - Learning with Case Studies by Torgo
  • Principles of Data Mining by Bramer


Kamis, 13 Februari 2014

Analyzing PubMed Entries with Python and NLTK

I decided to take my first steps of learning Python with the following task : Retrieve all entries from PubMed and then analyze those entries using Python and the Text Mining library NLTK. 

We assume that we are interested in learning more about a condition called Sudden Hearing Loss. Sudden Hearing Loss is considered a medical emergency and has several causes although usually it is idiopathic (a disease or condition the cause of which is not known or that arises spontaneously according to Wikipedia). 

At the moment of writing, the PubMed Query for sudden hearing loss  returns 2919 entries :

  


In order to be able to collect this information from PubMed we can use Entrez library from BioPython which enables us to save to a csv file the Title (Column A) and the Abstract (Column B) of each PubMed entry :


Now that we have the PubMed entries in place we may start Text Analysis with the  NLTK Toolkit. The Python code simply reads the csv file which was created, it removes stop words and uses a simple function to search and replace specific words. For example for this type of Data it is a good idea to replace occurrences of : 

 genes to gene,
 induced, induces,inducing to induction,
antagonists to antagonist,
agonists to agonist,
..etc.

This pre-processing work will help for more efficient retrieval of (possibly) interesting findings. For this example we want to find Collocations of terms and to do this we will use the BigramCollocationFinder from the NLTK Toolkit. After running the Bigram Collocator the program prints the top 100 most-important word-pairs scored using Pointwise Mutual Information :


Let's try to "relax" our requirements by increasing the amount of words that are fed to our analysis. Here are our new results :


We immediately notice the differences between the first analysis and this one since on the second instance we see much more potentially interesting word pairs (more Medical Conditions, Substances, etc are shown) as opposed to the first set of results.

Let's suppose that we are interested in finding which gene(s) could be involved. To do this a Python function is used which scores the co-occurrence of a word of interest (in this case 'gene') with other words.

Here are the results :



The result is not very useful. Nevertheless, it reminds us that we should probably replace polymorphisms with polymorphism
We may then decide to relax again the way with which bigrams are created and we increase the number of subsequent words that are searched and re-run the code :



The Top-Rated results from 'gene' analysis return a term named MTHFR which actually is a gene called Methylenetetrahydrofolate reductase. The same happens with the occurrences of the genes in the bigram co-occurrence analysis just before our 'gene' inspection. We also notice that Co-enzyme Q10 ( a well known and popular supplement) shows on the top of the list. After a bit of searching within PubMed entries it was found that CoQ10 is used for treatment of Sudden hearing Loss and also that CoQ10 was found in low concentrations of people having this condition. 
We can use BioGraph to submit a query for sudden hearing loss and see which concepts are associated with this Condition :



So MTHFR was found on Biograph as well, however at the moment of writing CoQ10 was not in the list (not shown because of its length).  We submit a query to the same engine  for CoQ10 and also filtering specifically for diseases:




Again - at the moment of writing - Sudden hearing loss was not found on Biograph as an associated condition. Of course it is not suggested here that Biograph entries are incomplete. Different types of Analysis may be used and the data that i used for this example were much more targeted (which in its own right should warrant extreme caution) to the specific problem.

Biograph is a wonderful resource (more on this later) which enables researchers to form several hypotheses (Notice Known, Inferred keywords in the results) with which new solutions to medical problems may be found.

The subject of the analysis was not random. For more than 2 years a person who i will not disclose had several incidents of Sudden Hearing Loss which -luckily- were not permanent. Several ENTs have consulted him and dismissed this event as "Too much Stress" and "Idiopathic" after making sure that no other problems (e.g acoustic neuroma) were present.

Upon further investigation the person found to have an MTHFR C677 homozygous polymorphism and additional testing revealed elevated Homocysteine levels. After administration of 5-Methyltetrahydrofolate (an activated form of Folic Acid - Levomefolic Acid)  there were no further incidences of Sudden Hearing Loss.

The solution was originally found using BioGraph.