I decided to take my first steps of learning Python with the following task : Retrieve all entries from PubMed and then analyze those entries using Python and the Text Mining library NLTK.
We assume that we are interested in learning more about a condition called Sudden Hearing Loss. Sudden Hearing Loss is considered a medical emergency and has several causes although usually it is idiopathic (a disease or condition the cause of which is not known or that arises spontaneously according to Wikipedia).
At the moment of writing, the PubMed Query for sudden hearing loss returns 2919 entries :
In order to be able to collect this information from PubMed we can use Entrez library from BioPython which enables us to save to a csv file the Title (Column A) and the Abstract (Column B) of each PubMed entry :
Now that we have the PubMed entries in place we may start Text Analysis with the NLTK Toolkit. The Python code simply reads the csv file which was created, it removes stop words and uses a simple function to search and replace specific words. For example for this type of Data it is a good idea to replace occurrences of :
genes to gene,
induced, induces,inducing to induction,
antagonists to antagonist,
agonists to agonist,
..etc.
This pre-processing work will help for more efficient retrieval of (possibly) interesting findings. For this example we want to find Collocations of terms and to do this we will use the BigramCollocationFinder from the NLTK Toolkit. After running the Bigram Collocator the program prints the top 100 most-important word-pairs scored using Pointwise Mutual Information :
Let's try to "relax" our requirements by increasing the amount of words that are fed to our analysis. Here are our new results :
We immediately notice the differences between the first analysis and this one since on the second instance we see much more potentially interesting word pairs (more Medical Conditions, Substances, etc are shown) as opposed to the first set of results.
Let's suppose that we are interested in finding which gene(s) could be involved. To do this a Python function is used which scores the co-occurrence of a word of interest (in this case 'gene') with other words.
Here are the results :
The result is not very useful. Nevertheless, it reminds us that we should probably replace polymorphisms with polymorphism.
We may then decide to relax again the way with which bigrams are created and we increase the number of subsequent words that are searched and re-run the code :
The Top-Rated results from 'gene' analysis return a term named MTHFR which actually is a gene called Methylenetetrahydrofolate reductase. The same happens with the occurrences of the genes in the bigram co-occurrence analysis just before our 'gene' inspection. We also notice that Co-enzyme Q10 ( a well known and popular supplement) shows on the top of the list. After a bit of searching within PubMed entries it was found that CoQ10 is used for treatment of Sudden hearing Loss and also that CoQ10 was found in low concentrations of people having this condition.
We can use BioGraph to submit a query for sudden hearing loss and see which concepts are associated with this Condition :
So MTHFR was found on Biograph as well, however at the moment of writing CoQ10 was not in the list (not shown because of its length). We submit a query to the same engine for CoQ10 and also filtering specifically for diseases:
Again - at the moment of writing - Sudden hearing loss was not found on Biograph as an associated condition. Of course it is not suggested here that Biograph entries are incomplete. Different types of Analysis may be used and the data that i used for this example were much more targeted (which in its own right should warrant extreme caution) to the specific problem.
Biograph is a wonderful resource (more on this later) which enables researchers to form several hypotheses (Notice Known, Inferred keywords in the results) with which new solutions to medical problems may be found.
The subject of the analysis was not random. For more than 2 years a person who i will not disclose had several incidents of Sudden Hearing Loss which -luckily- were not permanent. Several ENTs have consulted him and dismissed this event as "Too much Stress" and "Idiopathic" after making sure that no other problems (e.g acoustic neuroma) were present.
Upon further investigation the person found to have an MTHFR C677 homozygous polymorphism and additional testing revealed elevated Homocysteine levels. After administration of 5-Methyltetrahydrofolate (an activated form of Folic Acid - Levomefolic Acid) there were no further incidences of Sudden Hearing Loss.
The solution was originally found using BioGraph.







