Kamis, 06 Juni 2013

Finding the Right Skillset for Big Data Jobs

Perhaps one of the key skills of a Data Scientist is the ability to be able to collect and access data that are not readily available. 

I was wondering about the trends in Job Postings and more specifically which skills and qualities employers (or agencies) search for on a candidate for a job in "Big Data" so i decided to use R to answer this question.

Of course, one must first find Data (in this case Job Postings) so that they may be analyzed. This is possible by using the library scrapeR of R to scrape content from websites that contain Job Postings. Once this is done, tm package can be used to analyze thousands of Job Advertisements so we may extract useful knowledge.

The analysis which you will see below is based at around one thousand Job Postings that contain the phrase "Big Data". Better pre-processing could help in getting better term co-occurrences but here i aimed in presenting the application.  Once the data are collected we can start by looking at the Frequency distribution of the words found (after removal of stop words) :

Note that the word 'big' is removed from the bar chart. Notice also how the term "experience" (which also includes occurrences of term "experienced") was frequently found in Big Data Job Postings.

Interestingly, the term "skill" (which also counts the term "skilled") is found way below in the frequency diagram.

Next we can use Text Analytics to find which words co-occur with topics of interest. We start by looking at which terms co-occur in Job Postings where Hadoop is mentioned :


Suppose that one wishes to better understand which skills are discussed along with the Java programming language :


 When it comes to skills, it appears that communication skills are those which are important  (as expected):

In the same manner we can  :

-Find the frequencies of skills of interest (e.g Java, Python, Ruby, NoSQL, Oracle DB) and generate trend charts for each of them.

-Run term co-occurrence analysis on the skills which are "good to have" or "preferred".

-Capture early trends on emerging skills (in the "Big Data" case, this could be Pig)

The idea of analyzing Job postings and CVs using Information Extraction (and then using Predictive Analytics once this information becomes structured) is quite interesting. The ability to extract inferred knowledge is also quite challenging : For example could we infer from the text found in CVs  :

-The total number of years of experience in Project Management of an Applicant in case that this has not explicitly been stated in his/her CV?

-Whether an Applicant shows a coherent Career growth through the years ?

-The years needed for an Applicant to move to a Managerial Position?


Tidak ada komentar:

Posting Komentar