Rabu, 11 Desember 2013

Venture Capitals in an Age of Algorithms (Revisited)


Some time ago i wanted to explore the idea of analyzing several kinds and sources of Information (e.g TechCrunch, TheNextWeb,  News sites and Twitter) to identify promising Investment opportunities in Technology and more specifically Startups. 

Here is a snapshot of a Webpage from TechCrunch :



In many posts in this Blog it was discussed how our Reactions for almost any kind of information are recorded. This was not possible when everyone was reading newspapers in its paper form whereas  now any kind of Text is associated with a number of Views, Re-Tweets, Number of Comments and FaceBook "Likes".

The second important information that is being generated is our Emotions for any Topic as these are expressed within Comments, Twitter and FaceBook posts. The intensity of our emotions is also captured and this information is very important since  whatever we associate with intense emotions really stays within our psyche, fuels our interest and (usually) drives our purchase decisions.

We may then continue with some Exploratory work as follows : We can collect Posts from various Tech sources and their associated Reactions, annotate the text with Sentiment, Events and Topics and analyze this information to understand which Topics and/or Events appear to have an affinity for a high number of Reactions or High Sentiment intensity for Startups or Tech Topics .

As an example, 10K posts from various Tech sources were collected and each one of the posts was marked as generating either HIGH or LOW interest based on the amount of Reactions (Re-Tweets, FB Likes, Comments)  that each post generated. Special filtering is applied for the frequencies of the words that appear in each post :




Then this information is fed to KNIME for further analysis. The implementation which  is shown here is rather naive and simplistic for many reasons :  Only keywords are used as input -as opposed to Topics, Events- and many other parameters that are involved and which will be discussed later but for our example we will keep things simple.

The workflow uses 3 algorithms namely PART (so that some rules are generated), SMO and Random Forests :


 
This -again- is a very naive approach which gave a result of 61.9% (F-Measure) in identifying keywords that commonly appear with posts that generate Interest vs posts that do not. We keep in mind that this knowledge alone is not enough with which a decision can be made but we decide to explore things a little further.

We may find that some words that we expected do appear in posts of High interest (such as Google, Apple, Pinterest). There could be however some words that deserve more of our attention such as Education and Schools which during the analysis appeared to exist more frequently in High Interest posts.

So how can this information be used for a potential investment on a Startup and is there really a way to model new ideas and predict their performance?  Again, it is not suggested here that if you come across a startup aimed in Education you should immediately put your money in but this observation could be one parameter to consider. There are so many other considerations such as whether the idea is novel or not, how many competitors exist, who are the people behind the Startup, whether its founders have created a successful Startup in the past, which people have already invested in the particular Startup, what is the "buzz" that this Startup has generated so far and so on.

Whenever we read about a new startup there are some immediate thoughts going through our minds : Does this sound like a good idea? Is it applicable to me and would it make my life easier? Is this idea truly disruptive or not? What does our "gut feeling" tells us?  

We should always keep in mind that there are limitations to what Predictive Analytics can do but perhaps we can extract some hints that we may then use to make better decisions.

It was also interesting to read this post (hence the use of word "Revisited" in this Post's Title) on Gigaom regarding the same Subject. This is a fascinating area that i started looking at and there will be similar posts in the future on this Subject.


Kamis, 06 Juni 2013

Finding the Right Skillset for Big Data Jobs

Perhaps one of the key skills of a Data Scientist is the ability to be able to collect and access data that are not readily available. 

I was wondering about the trends in Job Postings and more specifically which skills and qualities employers (or agencies) search for on a candidate for a job in "Big Data" so i decided to use R to answer this question.

Of course, one must first find Data (in this case Job Postings) so that they may be analyzed. This is possible by using the library scrapeR of R to scrape content from websites that contain Job Postings. Once this is done, tm package can be used to analyze thousands of Job Advertisements so we may extract useful knowledge.

The analysis which you will see below is based at around one thousand Job Postings that contain the phrase "Big Data". Better pre-processing could help in getting better term co-occurrences but here i aimed in presenting the application.  Once the data are collected we can start by looking at the Frequency distribution of the words found (after removal of stop words) :

Note that the word 'big' is removed from the bar chart. Notice also how the term "experience" (which also includes occurrences of term "experienced") was frequently found in Big Data Job Postings.

Interestingly, the term "skill" (which also counts the term "skilled") is found way below in the frequency diagram.

Next we can use Text Analytics to find which words co-occur with topics of interest. We start by looking at which terms co-occur in Job Postings where Hadoop is mentioned :


Suppose that one wishes to better understand which skills are discussed along with the Java programming language :


 When it comes to skills, it appears that communication skills are those which are important  (as expected):

In the same manner we can  :

-Find the frequencies of skills of interest (e.g Java, Python, Ruby, NoSQL, Oracle DB) and generate trend charts for each of them.

-Run term co-occurrence analysis on the skills which are "good to have" or "preferred".

-Capture early trends on emerging skills (in the "Big Data" case, this could be Pig)

The idea of analyzing Job postings and CVs using Information Extraction (and then using Predictive Analytics once this information becomes structured) is quite interesting. The ability to extract inferred knowledge is also quite challenging : For example could we infer from the text found in CVs  :

-The total number of years of experience in Project Management of an Applicant in case that this has not explicitly been stated in his/her CV?

-Whether an Applicant shows a coherent Career growth through the years ?

-The years needed for an Applicant to move to a Managerial Position?


Selasa, 19 Februari 2013

Personal Data Mining - (Part 2)

On the previous post i described the way that i used to capture a 1-year worth of personal data using my Smartphone with the purpose of identifying trends with my immune system which at times gave me perennial conjunctivitis and also swollen lymph nodes. Now it was time to analyze all of this data in hope that some useful knowledge could be found.

 I had to make a decision for which tools to use. I used WEKA and also decided to give KNIME a try so here is an example of a KNIME workflow :






I first use the File Reader to read in my 1-year worth of life data, then an R Script which is used for several data transformations. I then send the streams of Data first to an R Script which runs the FSelector package with which several Feature Selection algorithms (about 10 of them) are applied to get an understanding of what are the important Features for the problem at hand.

Then another stream sends the Data to an R node which creates dummy Variables and then sends the transformed Features to a Linear Correlation node for further inspection.

A third stream (not shown) sends the data to 3 Machine-Learning algorithms (namely an SVM, Decision Tree and Random Forest) and the Scorer shows how each algorithm performed.

I first executed the FSelector node using 10-fold Cross-Validation because i wanted to get a first feel of the features that are important in identifying some patterns about my perennial conjunctivitis.  7 out of 10 of  FSelector algorithms agreed that :

1) Vitamin D3
2) Garlic
3) Yoghurt

..Appear to have the most predictive power. The problem is that at this point we do not know if any of the features actually help or aggravate my condition. However, the output of FSelector gives an idea on which features should be looked at more closely.

Then the second stream was run, namely the one which sends the data to 3 machine learning algorithms so that i could get a first feel of how the algorithms perform. All three algorithms gave an F-Score of around  59 - 62%.

By looking at the results some patterns appeared to arise (Note the word "appeared")

1) A rather large daily dose  (>5200 IU) of Vitamin D3 appears to be associated with smaller incidences of conjunctivitis
2) Garlic consumption appears to increase my conjunctivitis incidences.
3) Yoghurt consumption appears to increase my conjunctivitis incidences.


For Pattern (1) we need to be aware that Vitamin D3 dosage has a compounding effect so it is rather naive to think that boolean logic applies (see previous post for more).

Next i had to look at patterns (2) and (3). One of the things that i realized when searching the web for the effect of various nutrients in functions of the human body is the fact that you can find for any several entries that some times contradict  each other. My very brief web search has found Garlic and Yoghurt to be "immune boosters". Of course caution should be exercised in drawing any conclusions because of the way the data have been collected and also the problematic origin of the analysis. Moreover, i am not a doctor and i cannot possibly know whether Garlic or Yoghurt can aggravate an immune response in such a way. 

 I began taking Vitamin D3 and eliminating Garlic and Yoghurt from my diet. The result was that over a period of one month i stopped getting bouts of conjunctivitis and incidences of swollen lymph nodes. So has Vitamin D3 acted as an "immune response regulator" and Garlic - Yoghurt as "immune boosters"?


Although my bouts of conjunctivitis have ceased, I am not in any position to make any claims because there are a lot of uncontrolled variables :

- It could be a placebo effect.
-There may be unknown hidden variables that are important
- (My) Genetics
- Environment
- Variations in Dosage and Nutrient Content
- Interactions between nutrients

and lots of others that could not possibly be accounted for under these circumstances.

What i can say (and this is the reason for writing this post) is that analytics may help us to identify several patterns that may then be used to guide a sound knowledge discovery process. If people had the ability to collect data on a daily basis (see Quantified Self) and then analyze them on a massive scale, several unknown patterns that call for closer investigation could emerge.