An often overlooked type of Analysis is Sequence Data Mining (or Sequential Pattern Mining).
Sequence Data Mining is a type of Analysis which aims in extracting patterns sequences of Events. We can also see Sequence Data Mining as an Associations Discovery Analysis with a Temporal Element.
Sequence Data Mining has many potential applications (Web Page Analytics, Complaint Events, Business Processes) but here today we will show an application for Health. I believe that this type of Analysis will become even more important as wearable technology will be used even more and therefore more Data of this kind will be generated.
Consider the following hypothetical scenario :
Sequence Data Mining is a type of Analysis which aims in extracting patterns sequences of Events. We can also see Sequence Data Mining as an Associations Discovery Analysis with a Temporal Element.
Sequence Data Mining has many potential applications (Web Page Analytics, Complaint Events, Business Processes) but here today we will show an application for Health. I believe that this type of Analysis will become even more important as wearable technology will be used even more and therefore more Data of this kind will be generated.
Consider the following hypothetical scenario :
A 30-year old Male patient complaints about several symptoms which -for simplicity reasons- we will name them as Symptom1, Symptom2, Symptom3,etc.
His Doctor tries to identify what is going on and after the patient takes all necessary Blood work and finds no problems. After thorough evaluation the Doctor believes that his patient suffers from Chronic Fatigue Syndrome. Under the Doctor's supervision the patient will record his symptoms along with different supplements to understand more about his condition. Several events (e.g a Visit to the Gym, a stressful Event) will also be taken under consideration to see if any patterns emerge.
His Doctor tries to identify what is going on and after the patient takes all necessary Blood work and finds no problems. After thorough evaluation the Doctor believes that his patient suffers from Chronic Fatigue Syndrome. Under the Doctor's supervision the patient will record his symptoms along with different supplements to understand more about his condition. Several events (e.g a Visit to the Gym, a stressful Event) will also be taken under consideration to see if any patterns emerge.
-How Can we easily record Data for the scenario above?
-Can we extract sequences of events that occur more frequently than mere chance?
-Can we identify which sequences of Events / Food / Medication may potentially lead to specific Symptoms or to a lack of Symptoms?
Looking the problem through the eyes of a Data Scientist, We have :
A series of Events that happen during a day : A Stressful event, A sedentary day, Cardio workouts, Weight Lifting, Abrupt Weather Deterioration, etc
A Number of Symptoms : Headaches, "Brain Fog", Mood problems, Insomnia, Arthralgia, etc.
Let's begin with Data Collection. We first suggest to the patient to use an Android app called MyLogsPro (or some other equivalent application) to easily input information as this happens :
So if the patient feels a specific Symptom he will press the relevant Symptom button on his mobile device. The same applies for any events that have happened and any Food or Medication taken. As the day passes we have the following data collected :
The snapshot shows what happened starting on the 20th of August 2014, where our patient has logged the intake of Medication (at 08:22 AM) and/or Supplements upon waking up then a Food entry was added at 08:47. At 11:06 the patient had a Symptom and immediately reached his phone and pressed the relevant Symptom (Symptom No 4) button.
After many days of Data Collection we decide that its time to analyze this information. We export the data from the application as a csv file which looks as follows :
We will use KNIME to read the csv file, change the contents of the entries accordingly so that an Algorithm can read the events and then perform Sequence Data Mining. We have the following layout :
The File Reader reads the .csv file, then during the Pre-processing block (shown in yellow), a String Manipulation node which removes colon (:) from time field (e.g 12:10 becomes 1210). The Sorter sorts the data according to date then time as the second field and a Java snippet uses replaceAll() function to remove all leading zeros from Time field (e.g 0010 becomes 10).
The R Snippet loads the CSPADE Algorithm and then uses this Algorithm to extract pattern of sequences.
After executing the stream we get the following output :
We immediately notice an interesting entry on the first output :
Medication1->Symptom2
and on the second output we see that this particular rule has a lift of 1.4 and 0.8 confidence.
However, as Data Scientists we should always double-check the extracted knowledge and must be aware of pitfalls. Let's see some examples (list not exhaustive) :
1) The algorithm does not account for time as it should : As an example, consider the following entries :
10/09/14,08:00,Medication1
10/09/14,08:05,Symptom2
We assume that Medication1 is taken by mouth and needs 60 minutes to be properly dissolved and that these entries occur frequently enough in that order in our data set. Even though the algorithm might show a statistically significant pattern , it is not logical to hypothesize that Medication1 could be related to Symptom2. The Analyst should first examine each of these entries to see which proportion of the records has a time difference of at least -say- or greater than 60 minutes.
Apart from the example shown above we must consider the opposite effect. Consider this entry :
10/09/14,08:00,Medication1
...
...
...
10/09/14,21:05,Symptom2
In other words : Is it possible that a Medication taken in the morning to generate a Symptom 12 hours later?
2) The algorithm is not able to account for the compounding effect of a Medication. For example, the patient might have low levels of Taurine and for this level to be replenished, an x amount of days of Taurine supplementation is needed. The algorithm cannot account for this possibility.
3) The patient should also input entries of "No Symptoms". It is not clear however when this should be done (e.g at the end of each day? assess every 6 hours and add 2 entries accordingly?)
However, this does not mean that a Sequence Mining algorithm should not be used under these circumstances. This technique can generate several potentially interesting hypotheses which Doctors and/or Researchers may wish to pursue further.












