Rabu, 16 Februari 2011

7th Annual Text Analytics Summit



I would like to say a few words about an upcoming major event for all of those interested in Text Analytics and its various uses in Social Media, Marketing and Business Intelligence. Starting on May 18th, the annual Text Analytics summit (the only conference dedicated completely to Text Mining) will take place in Boston, MA with a total of 28 speakers presenting material on applications of Text Analytics including :




  • Social Media Analytics
  • Sentiment Analysis
  • Voice of the Customer
  • Marketing
  • Semantics


Well-known names in the industry will be there ( Seth Grimes, Tom Anderson, Gregory Piatetsky-Shapiro, Ronen Feldman) as well as experts from companies such as SAS, IBM, Forrester Research, Attensity, Adobe, J.D. Power&Associates, Clarabridge and others.   


My presentation will be about Behavior Mining in Social Media using Text Analytics and i will be giving a step-by-step tutorial  on the analysis of data originating from Twitter regarding a major Electronics Brand in USA. More specifically :

  • I will show how Tweets can be transformed and then analyzed using various statistical NLP techniques and software. 
  • Discuss the various problems that are found when one wants to analyze Text data
  • Discuss and introduce new ways of seeking for valuable information and extracting insights when it comes to Mining Behavior.

Although the Case Study will be using data from Twitter, the techniques shown can be applied to any other Text Data such as those found in FaceBook, Blog posts, User comments, etc.

I am looking forward to seeing the work done by others, learning about successful applications of Text Analytics and the knowledge gained and also seeing the issues that professionals come across and how these are faced by them. 

Kamis, 10 Februari 2011

Forex Trading with R : Part 2

In the previous post the first steps were given for building the basis for trading forex. Now it is time to build the actual classifiers that  can give us future buy / hold / sell signals.

Assuming that everything is in working order and the instructions given in  the previous post were followed we can start building these classifiers.

First let's train a Neural Network. The following command trains a Neural Network and then applies the trained model on our test data and outputs the predictions for buy/sell/hold signals :

set.seed(134)
nn <- nnet(class~.,traindata, size = 3, rang = 0.1,decay = 0.001, maxit = 3000,trace="F")
table(actual=testdata$class,predicted=predict(nn,newdata=testdata,type="class"))

Note that a seed number was used.  You should either try different seed numbers (so that network weights are re-initialized) or omit the set.seed() directive. You should also experiment with other Neural Net parameters such as the number of iterations (maxit), the learning decay (decay), etc.


The confusion matrix shows us the necessary information for calculating TP, FP, TN,FN rates for each class (ie for each signal type).


Similarly we can train and test a Random Forest :


 rf.model<-randomForest(class~.,data=traindata,nodesize=40,importance=FALSE,mtry=3,ntree=100)
table(actual=testdata$class,predicted=predict(rf.model,newdata=testdata,type="class"))



Now let's train an SVM for our data. We can issue the following command : 

###train SVM
sv<-svm(class~.,traindata,gamma=0.01,cost=5,kernel="radial")

To see how the classifier did on the test set, we enter :

table(actual=testdata$class,predicted=predict(sv,newdata=testdata,type="class"))


Next we can try to optimize parameters of the SVM classifier as follows :


#find optimal values of Gamma and Cost for an RBF- SVM classifier
tuned <- tune(svm, class~., data = traindata,ranges = list(gamma = c(0.0001,0.001,0.05,0.1,0.2,0.3), cost = c(1,5,10,20,50,100,120,130)),tunecontrol = tune.control(sampling = "cross"),cross=10)


tuned


The first command uses 10-fold cross validation to identify the best gamma and cost parameters among some predetermined values. We then issue the command tuned to see which combination of parameters  gives us the lowest classification error. Knowing these parameters we can then use these parameters to train an SVM classifier and see how this model performs (as was shown previously).

Be aware of the following key points :


  1. Three sets of data should be used : Training, Test and Validation. The Validation set should not be a part of the optimization (=finding the best algorithm parameters) process.
  2. Make sure that you create classifiers for several time periods. Test the performance of any classifier according to the percentage of available data you use for training / testing / validation and the number of periods you use for the sliding window.
  3. Make also sure that once you have chosen your model, you use a correct way to test your system by simulating buy / hold / sell signals and taking under consideration all associated trading costs.