Senin, 28 November 2011

New Insights from Text Analytics

Text Analytics has gained the attention it deserves in the past few years. Sentiment Analysis is perhaps the most frequently discussed type of analysis but there will be always new ways to analyze and gain insights from text data.  

Examples of new types of analysis -and they have a vast potential- are in my opinion two :  Sequence Detection and Concept Mining. I am not aware whether  these types of analysis are currently being implemented by any Text Mining practitioner at the moment and if there is one, feel free to add your comments below.


So what is Sequence Detection and Concept Mining ? Some examples :

Suppose that you receive several similar e-mails sent from customers as the one seen below :

"I have been trying repeatedly to solve my billing problem through customer care. I first talked with someone called  Mrs Jane Doe. She said she should transfer my call to another representative from the sales department. Yet another rep from the sales department informed me that i should be talking with the Billing department instead. Unfortunately my bad experience of being transferred through various representatives was not over because the Billing department informed me that i should speak to the......"

Currently Text Analytics software will identify key elements of the above text but a very important piece of information goes unnoticed. It is the sequence of events which takes place :

 (Jane Doe => Sales Dept =>Billing  Dept =>...)


Being able to detect the sequence of events is an important element in understanding customer interaction. In our example above, imagine the possibility of detecting similar sequences through thousands of e-mails or call center transcripts and running a sentiment analysis, a process which then could correlate sentiment with specific event sequences.

Next,  is the usage of Concept Mining (this is just a phrase i coined for this post) : Being able to analyze information to different conceptual levels. A very powerful technique indeed and let's see why this is so.

People that have attended the 7th annual Text Analytics Summit in Boston had the opportunity to listen to several presentations regarding Semantics. The discussions between experts from the Semantics Panel and the attendees revealed that people could not find Semantics practical for several reasons. Yet, in Semantics lies the power of being able to find patterns on different conceptual levels.

As a -very basic- example, if we use Information Extraction to annotate -say- the Tweets containing mentions of American Telcos we can tag each one as a more general category called TELCOS. We can also tag individual prepaid packages as a more general category called PREPAID_PACKAGES. By doing that we can then search for patterns in a more general conceptual level than searching for patterns only at a Telco Brand level or a specific Telco's prepaid package. As an example we can run  a sentiment analysis on all prepaid packages mentions,  identify patterns of negative or positive sentiment and see which Telco is the winner of positive sentiment at a conceptual level.

The possibilities are endless.

Senin, 19 September 2011

Big Data : Case Studies, Best Practices and Why America should care

We know that Knowledge is Power. Due to Data Explosion more Data Scientists will be needed and being a Data Scientist becomes increasingly a "cool" profession. Needless to say that America should be preparing for the increased need for Predictive Analytics professionals in Research and Businesses.

Being able to collect, analyze and extract knowledge from a huge amount of Data is not only about Businesses being able to make the right decisions but also critical for a Country as a whole. The more efficient and fast this cycle is, the better for the Country that puts Analytics to work.

This Blog post is actually about the words and phrases being used for this post : All words and phrases on the title of the post (and the introductory text) were carefully selected to produce specific thoughts which can be broken down in three parts :

  •  Being a Data Scientist has high value. 
  • "Case Studies" and "Best Practices" communicate to readers successful applications and knowledge worthwhile reading.
  • "America should". This phrase obviously creates specific emotions and feelings to Americans.

"Case Study" and "Best Practices" were phrases found to be commonly associated with posts of high visibility. You might also get many views if you create a post which proves that whatever concept you are writing about is the right thing to do (for example write a post that clearly demonstrates yet another reason to use Social Media and have this post shown to Social Media Professionals).  Regarding our example : It is very probable (and logical) for Data Miners to look at and then re-tweet (or otherwise share) information which is a "proof" about Data Mining being useful  and also a "cool" profession. The higher concept / motive which works behind the scenes is that "I am doing the right job and this post proves it".

You might also get many views by submitting a post which disproves well-accepted concepts or posts that demonstrate the difficulties that well-accepted concepts face : For example, if you were a Data Scientist or a BI Professional, you would be inclined to read a post titled "Big Data is a Big Hype".  Whether you will re-tweet or share the post is of course under your discretion. At this point it should be noted that there is a big difference between number of clicks of a post and the number of shares it got (by Retweeting it, Liking it, etc) because sharing a post means that this post is considered worthwhile to read.

All of the above (and much more) have been found by analyzing thousands of Blog posts along with their number of clicks and shares they got (either by RT's , FaceBook "Likes", etc) and this is what i will be presenting in Text Analytics World in New York this October. It was also very interesting to see that some findings are in tandem with findings discussed by Joseph Carrabis during the Text Analytics Summit 2011 in Boston back in May. 

Of course it is not suggested  that   by using specific words and phrases you are guaranteed a successful post being re-tweeted from thousands of people and there are many reasons for this which i will not get into here. Additionally, Text Analytics cannot infer the higher meaning and concepts suggested within Text and this problem deserves a post on its own. This analysis however identifies concepts and/or phrases that point Bloggers and Marketers to look at a specific direction and with this knowledge to have increased probabilities for a successful Web presence. Again, this is an example of true Social Media Intelligence. Not (just) Reports.

So, in case that this post title immediately got your attention from other posts, you've just had a little taste of Predictive Analytics in action.

Jumat, 09 September 2011

Do Social Media Monitoring tools provide True Intelligence?

Having recently read a report from WebLiquid one of the interesting facts to consider is that around 70% of Marketers replied finding the insights gleaned from Social Media Monitoring tools "Somewhat Valuable".  Slightly more than 20% of them found these insights "Extremely valuable". The report also shows that most Marketers plan to invest more in SMM tools with few of them retreating from any further investment.

This is Big News. 70% of Marketers finding insights gleaned from SMM tools "Somewhat" valuable is not a good thing and perhaps there are reasons for this. It would be very interesting to know what do Marketers consider Insights, how they prioritize those Insights and how easily they can act once they have those insights . The problem can be summarized in one sentence:


- Marketers do not want (just) Reports.


There is a lot of useful information provided by many Social Media Monitoring tools  : The number of mentions of a Brand (or Product or Service) per channel, which users talk frequently about your Brand  (and which of them are considered influential). Sentiment Analysis provides Marketers with the perception of a Brand but also the perception about competitive Brands leading to what is known as Competitive Intelligence. Perhaps Social Media Monitoring platforms have many types of metrics still to offer : For example, a potentially useful metric could be the ability to identify Consumer  Intentions ("I will definitely buy...") and how these intentions differentiate - such as "I would buy 'ABC' if it was cheaper" or "I would buy 'ABC' if i hadn't  purchased 'XYZ' already".


Notice that SMM tools provide metrics  : Number of mentions per channel, Top influential users, percentage of positive / negative / neutral sentiment and sentiment intensity, how mentions of a new product disperse through different social media channels, etc. 


But what is considered Intelligence in Social Media? Would someone identify as intelligence the fact that during the past 2 months there was an increase in specific Brand Mentions on Twitter but not on YouTube? Or is it Intelligence when we notice that there has been a decline in positive sentiment about a product?  All of this information is Reporting and Feedback. It is not meant that this is not useful information :  It is important to know what is happening and why.

So what True Intelligence is all about?

True Intelligence is about knowing how to successfully Promote and Market a Brand, Product or Service. To do that a Marketer wants to know the Best Practices : With Social Media Reports, Marketers know what is happening (a decline in positive mentions on our new smartphone) and why this is happening (a potential hardware problem). Social Media Analytics can identify the right strategies to make things happen. True Social Media Intelligence is about knowing which parameters (channels, number of mentions) are important in achieving a result. Is it important to have a product associated with intense (positive) sentiment? Or could it be more important to have a Product being highly associated with Rumors?

There is still a long way to go in terms of Insights from Social Media Monitoring tools. There are many processes and parameters that will eventually used for deriving more Insights and better Strategies. The answer to true Social Media Intelligence is the use of Predictive Analytics (Data and Text Mining)  applied to Social Data : One area that is currently untouched by most Social Media Monitoring tools.

Jumat, 15 Juli 2011

More Trends of the Greek Debt Crisis

Here are some more results on mentions of various Concepts being discussed in Greek Blogs about the Greek Debt Crisis. Using Text Analytics, thousands of Greek Blogs are being annotated on a daily basis with the purpose of identifying the frequency with which several aspects of the Greek Debt crisis are discussed.

First let's have a look at the trend line of the Indignant Citizens Movement :






We can see that there is a clear down-trend in the number of Blog Mentions. This is also supported by a very significant reduction of the total Tweets found for this subject.


Next let's see how the trend of the mentions of "Greek default" looks like in the past month :



 We notice a severe spike beginning from July 12th because several Blogs and News sites were having mentions on a possible "Selective Default" which could happen to Greece.  

Interestingly, the trend on mentions of a US Default is also rising in Greek Blogs but is found with a much smaller frequency :





Jumat, 17 Juni 2011

The Greek Debt crisis - Some Trends

Several friends and blog readers ask me very frequently on what i think about Greece and the problems that Greece has on a Social and Economic level. Since this is not a blog about Politics or the Economy i will try to give my point of view with some analytics added. 

 It is always interesting to know how people feel and what do they think about the economy,their future, the politicians and how the general sentiment is. Also of great importance is the trend of all opinions and/or sentiment as this is recorded in Blog posts and other Social Media sources.

Here are some examples from data that i collect on a daily basis, several times a day from Greek blogs. Hundreds of Concepts are annotated within thousands of Blogs entries and collected for further analysis.


The results that i will show here are for :

- the latest Government Reform

- words that communicate Negative Sentiment.

-The "Indignants Movement" : Citizens that do not agree with the practices of both 2 largest Greek political parties  during the past 30 years and spending cuts directed by the IMF.

- Debt Crisis

Let us begin with the trend of "Government Reform" which at the time of writing (17/06/11 - Note that date format is  DD/MM/YY) has just happened. Here is the trend of mentions :





Notice how during the previous days not many mentions were captured and how much the trend increases until June 17th were the reform took place.


Next, let's look at entries that communicate "economic default" and their trend :





Again notice how on previous days mentions of Greek default start to rise (starting from June 3rd) and gradually the trend appears to fade out (French and German leaders said they will back up Greek debt on June 17th). It was no surprise that on June 8th and 9th (yet more) Greeks rushed in Banks to withdraw their money.

Here is the Trend of "The Indignants" movement :



Notice dates May 29th-30th, June 5th, Jun 12th-13th. All of these dates are Sundays (or close to Sundays) which is the day that most people gather in Syntagma square to express their anger for the IMF and Government practices. The trend however appears to be falling but  this may well be changing in the next days. Time will tell.

How about the words that communicate Negative Sentiment? Here is the trend :





Negative sentiment words appear to be somewhat rising after 31/05 but are coming down to previous levels.

FYI,  words that frequently occur with the concept "Politicians" are : "leaders", "cheats", "traitors".


More on the next post.

Kamis, 16 Juni 2011

Apple Products on Twitter - A Text Analytics example


My presentation on the 7th annual text analytics summit was a tutorial in one of the methodologies one could use to analyze unstructured text. The sample consisted of 365000 tweets that contained keywords of Apple products and concepts such as iPad, iPhone, iPod, Apple Store, Mac, Steve Jobs and the goal was to get an understanding of what people where tweeting about each product or concept.

The first step is to use a text analysis toolkit (i used GATE) to annotate the tweets and identify which concepts and keywords occur within the tweets. But this is not always easy. Take the word Mac for example. According to the context, Mac could be a computer type,  a burger type, the MAC beauty products or Mac Arthur airport. So when a query sent to Twitter API that contains the word  Mac we end up with lots of erroneous information.

So one of the things that have to be done to ensure good results  is word sense disambiguation. We know for example that if a tweet contains a word such as fries, lettuce and/or salad then quite likely the word Mac that was also found within this tweet was about the Big Mac (even though the word Big may not be present). If we find the word Arthur next to the word Mac then the tweet is about the Mac Arthur airport, etc. Here is GATE in action, identifying different keywords and concepts in Tweets :







Now we can see which concepts and keywords appear frequently in Re-Tweets ('USER' denotes that a '@' was present in the Tweet, 'URL' that a URL link was found in the Tweet,etc)





 We can also see which words frequently occur with iPhone5 :

















Selasa, 26 April 2011

Event Detection: Analytics becoming more personal

Sentiment Analysis is a hot technology at the moment. Marketers are interested in the perception that consumers have about  a specific brand, product or service as this is found in unstructured text. Some people claim that Sentiment Analysis does not meet their expectations but also that it is not straightforward for a company to find the "right" solution.  Comparing different Sentiment Analysis solutions could prove a difficult task.

Marketers and Decision makers need insights with which they can make better decisions - They need both Reports and Intelligence. Therefore the question that always follows the finding that "Your product has a 35% negative sentiment in the past 10 days" is "Why".  Social Media Monitoring tools must also  provide actionable Intelligence. 

All this is important information as it shows why your Brand / Product / Service could be losing customers. You monitor what is being said, identify whether a negative or positive Sentiment Trend is declining or rising and take necessary actions accordingly.

One of the questions i often get is what other applications can emerge from using Text Analytics and Data Mining. With Text Analytics and Data Mining we can find behavior patterns on many levels and -assuming that information such as Tweets will keep coming- the understanding of consumers can  go to the next -and sometimes more personal- level.

One of these applications is Event Detection. I am not aware if Event Detection is provided by any tool at the moment but i believe that this type of analysis could become a next major source of consumer insights. But what exactly is "Event Detection"?

Since we are able to have a computer automatically identify whether a phrase contains positive, negative or neutral sentiment, perhaps we could use Text Analytics and Machine Learning to detect that a specific event has occurred to an individual from the Tweets that someone posted such as "i've just returned from holidays". But that's not all. We can mine for patterns of consumer behavior given the fact that an event has occurred. And that potential knowledge from such an analysis could be very powerful. Because apart from the emotions that a product / service / person generates, the same applies for events happening in our lives. These events and the emotions they create can sometimes change our lives and also drive our decisions. A logical next step is to collect several behavioral Data and use Data Mining to analyze this information.

I will discuss an example of using Event Detection towards the end of my presentation on the 7th Annual Text Analytics Summit in Boston this May along with the reasons for such an analysis being important and i am looking forward to the reactions.

The fact is that with more insights, privacy issues arise even more and I get an increasing number of people asking me about privacy. I was also interviewed by a major British newspaper last month on what companies can learn by applying "Super Crunching" on Tweets. I tried to show both worlds of "Super Crunching" but the truth is that consumer insights become more personal as companies understand the value of structured and (more recently) unstructured information. 

Selasa, 15 Maret 2011

Social Media Data and what analysts can do with it

It is worth looking at what having our lives "digitalized" means since all of the information currently generated from usage of Social Media is available for analysis : "Collective Intelligence" and "Behavior Mining" are terms that are becoming increasingly known.


But what exactly is Social Media Data? Here are some examples :

  • The number of followers you have on Twitter and number of friends on FaceBook.
  • The number of links you provide, groups you join, retweets you make and how often you talk with other friends / followers.
  • The number of re-tweets, FaceBook "likes", comments and views that a blog post generates.
  • The personal information you provide (such as Twitter Bio)
  • The concepts being discussed in Tweets and FaceBook walls.



By applying Predictive Analytics to all of this information an impressive number of applications arises such as :

- Analysis of your Twitter Bio and words that are contained in your Tweets. For example we can identify what do people stating in their Bio being "Computer Geeks" discuss more frequently (in terms of Electronic Brands, technology trends etc). (See more here)

- Analyze thousands of Twitter accounts and find words that could make a difference in your follower count. (It appears that you should  keep things positive -at least most of the time-. See why here).

- Identify best practices on how to use Social Media  : When to post your new blog post, which words and concepts to avoid writing about and ultimately what concepts (such as Personal Branding) you should focus on. ( See more here).

- Understand consumer behavior : What people liked, how they feel and what they would like to see in upcoming products and/or experiences. See this example on how different aspects of consumer behavior in shopping malls is "mined".

Note that these are just some examples. The list goes on.

There is no doubt that new exciting Social Media apps will become available. This in turn will produce even more Social Media data (such as ones that contain location information). Being able to combine Data Mining and Text Mining techniques to extract insights from Social Media Data will become a very  important skill to have.

Rabu, 16 Februari 2011

7th Annual Text Analytics Summit



I would like to say a few words about an upcoming major event for all of those interested in Text Analytics and its various uses in Social Media, Marketing and Business Intelligence. Starting on May 18th, the annual Text Analytics summit (the only conference dedicated completely to Text Mining) will take place in Boston, MA with a total of 28 speakers presenting material on applications of Text Analytics including :




  • Social Media Analytics
  • Sentiment Analysis
  • Voice of the Customer
  • Marketing
  • Semantics


Well-known names in the industry will be there ( Seth Grimes, Tom Anderson, Gregory Piatetsky-Shapiro, Ronen Feldman) as well as experts from companies such as SAS, IBM, Forrester Research, Attensity, Adobe, J.D. Power&Associates, Clarabridge and others.   


My presentation will be about Behavior Mining in Social Media using Text Analytics and i will be giving a step-by-step tutorial  on the analysis of data originating from Twitter regarding a major Electronics Brand in USA. More specifically :

  • I will show how Tweets can be transformed and then analyzed using various statistical NLP techniques and software. 
  • Discuss the various problems that are found when one wants to analyze Text data
  • Discuss and introduce new ways of seeking for valuable information and extracting insights when it comes to Mining Behavior.

Although the Case Study will be using data from Twitter, the techniques shown can be applied to any other Text Data such as those found in FaceBook, Blog posts, User comments, etc.

I am looking forward to seeing the work done by others, learning about successful applications of Text Analytics and the knowledge gained and also seeing the issues that professionals come across and how these are faced by them. 

Kamis, 10 Februari 2011

Forex Trading with R : Part 2

In the previous post the first steps were given for building the basis for trading forex. Now it is time to build the actual classifiers that  can give us future buy / hold / sell signals.

Assuming that everything is in working order and the instructions given in  the previous post were followed we can start building these classifiers.

First let's train a Neural Network. The following command trains a Neural Network and then applies the trained model on our test data and outputs the predictions for buy/sell/hold signals :

set.seed(134)
nn <- nnet(class~.,traindata, size = 3, rang = 0.1,decay = 0.001, maxit = 3000,trace="F")
table(actual=testdata$class,predicted=predict(nn,newdata=testdata,type="class"))

Note that a seed number was used.  You should either try different seed numbers (so that network weights are re-initialized) or omit the set.seed() directive. You should also experiment with other Neural Net parameters such as the number of iterations (maxit), the learning decay (decay), etc.


The confusion matrix shows us the necessary information for calculating TP, FP, TN,FN rates for each class (ie for each signal type).


Similarly we can train and test a Random Forest :


 rf.model<-randomForest(class~.,data=traindata,nodesize=40,importance=FALSE,mtry=3,ntree=100)
table(actual=testdata$class,predicted=predict(rf.model,newdata=testdata,type="class"))



Now let's train an SVM for our data. We can issue the following command : 

###train SVM
sv<-svm(class~.,traindata,gamma=0.01,cost=5,kernel="radial")

To see how the classifier did on the test set, we enter :

table(actual=testdata$class,predicted=predict(sv,newdata=testdata,type="class"))


Next we can try to optimize parameters of the SVM classifier as follows :


#find optimal values of Gamma and Cost for an RBF- SVM classifier
tuned <- tune(svm, class~., data = traindata,ranges = list(gamma = c(0.0001,0.001,0.05,0.1,0.2,0.3), cost = c(1,5,10,20,50,100,120,130)),tunecontrol = tune.control(sampling = "cross"),cross=10)


tuned


The first command uses 10-fold cross validation to identify the best gamma and cost parameters among some predetermined values. We then issue the command tuned to see which combination of parameters  gives us the lowest classification error. Knowing these parameters we can then use these parameters to train an SVM classifier and see how this model performs (as was shown previously).

Be aware of the following key points :


  1. Three sets of data should be used : Training, Test and Validation. The Validation set should not be a part of the optimization (=finding the best algorithm parameters) process.
  2. Make sure that you create classifiers for several time periods. Test the performance of any classifier according to the percentage of available data you use for training / testing / validation and the number of periods you use for the sliding window.
  3. Make also sure that once you have chosen your model, you use a correct way to test your system by simulating buy / hold / sell signals and taking under consideration all associated trading costs.


Senin, 10 Januari 2011

Forex Trading with R : Part 1

I recently started learning R - probably something i should have done a long time ago - and since learning by doing is the best way to learn something i decided to  use  R to generate buy/see/hold signals for the EUR/USD Pair. For those that wish to use R for making Trading decisions, this series of posts is a short introduction with which one can pursue the subject further. By no means it is implied that this post's methodology is the one that  you should use  to trade :   different response variables, signal thresholds, technical indicators and classifiers than the ones presented here should be tried. Then, elaborate testing methods should be put to use to assess the  performance of  each classifier and the worth of your trading strategy. 

First the resources needed to be downloaded and installed :

1)  Packages : quantmod, nnet, e1071, tseries, randomforest
2) A file that contains OHLC Data. For this example EUR/USD Data are used. You can download an example file here. I downloaded the EUR/USD historical data here

I would also highly recommend getting Data Mining with R : A concise book that goes through an introduction of using R and then presents various case studies one of which is about Using R to predict variations of the S&P Index. The author also provides package "DMwR" that includes all necessary functionality for generating signals, extracting precision / recall metrics of generated models, performing Monte Carlo Estimates and evaluating trading strategies.

After downloading the file from step (2), place the file to a directory of your choice. Now copy and paste the following commands in R :

library(e1071)
library(nnet)
library(randomforest)
library(quantmod)
library(tseries)


Next we import the csv file from the directory that was originally saved  (change "YOUR_PATH" accordingly with the directory path you saved the csv file) :

#get data OHLC from csv file
raw<- read.delim2("/YOUR_PATH/EURUSD.csv",header=TRUE,sep=",")


Now, paste the following to R (change again YOUR_PATH accordingly) :


#convert date
stripday<-strptime(raw$DATE,format="%Y%m%d")
fxdata<-data.frame(stripday,raw)


fxdata$TIME<-NULL
fxdata$TICKER<-NULL
fxdata$DATE<-NULL
colnames(fxdata)<-c("Date","Open","Low","High","Close")

#write data to .csv
write.table(fxdata,"/YOUR_PATH/eurusd.csv",quote=FALSE,sep=",",row.names=FALSE)

##transform to an xts object
EURUSD<-as.xts(read.zoo("YOUR_PATH/eur-usd.csv",sep=",",format="%Y-%m-%d",header=T))


Now we define some technical indicators, the model to work on and a function that generates our trading signals  :

#setup Technical Indicators


myATR <- function(x) ATR(HLC(x))[,'atr']
mySMI <- function(x) SMI(HLC(x))[,'SMI']
myADX <- function(x) ADX(HLC(x))[,'ADX']
myAroon <- function(x) aroon(x[,c('High','Low')])$oscillator
myBB <- function(x) BBands(HLC(x))[,'pctB']
myChaikinVol<-function(x)Delt(chaikinVolatility(x[,c("High","Low")]))[,1]
myCLV <- function(x) EMA(CLV(HLC(x)))[,1]
myMACD <- function(x) MACD(Cl(x))[,2]
mySAR <- function(x) SAR(x[,c('High','Close')]) [,1]
myVolat <- function(x) volatility(OHLC(x),calc="garman")[,1]
myEMA10 <- function(x) EMA(Cl(x),n=10)[,1]
myEMA20 <- function(x) EMA(Cl(x),n=20)[,1]
myEMA30 <- function(x) EMA(Cl(x),n=30)[,1]
myEMA50 <- function(x) EMA(Cl(x),n=50)[,1]
myEMA60 <- function(x) EMA(Cl(x),n=60)[,1]

data.model <- specifyModel(Delt(Cl(EURUSD)) ~
myATR(EURUSD) + mySMI(EURUSD) + myADX(EURUSD) + myAroon(EURUSD) +
myBB(EURUSD) + myChaikinVol(EURUSD) + myCLV(EURUSD) +myEMA10(EURUSD) +myEMA20(EURUSD) +myEMA30(EURUSD) +myEMA50(EURUSD) + myEMA60(EURUSD) +
CMO(Cl(EURUSD)) + EMA(Delt(Cl(EURUSD))) +
myVolat(EURUSD) + myMACD(EURUSD) + RSI(Cl(EURUSD)) +
mySAR(EURUSD) + runMean(Cl(EURUSD)) + runSD(Cl(EURUSD)))

Tdata.train <- as.data.frame(modelData(data.model,
data.window=c('2008-01-01','2010-01-01')))

Tdata.eval <- na.omit(as.data.frame(modelData(data.model,
data.window=c('2010-01-02','2010-11-01'))))



# a very simple signal function
signals<-function(x) {

if(x>=-0.005&&x<=0.005) {result<-"hold"} else
if(x>0.005) {result<-"buy"} else
if(x<-0.005) {result<-"sell"}

result
}

#create class vector that holds TRAINING buy,sell,hold signals

class<-sapply(Tdata.train$Delt.Cl.EURUSD,signals)

#paste both to a new list that holds everything
traindata<-cbind(Tdata.train,class)

#remove Delt.Cl.EURUSD - not needed anymore.
traindata$Delt.Cl.EURUSD<-NULL



#create class vector that  holds TESTING buy,sell,hold signals

class<-sapply(Tdata.eval$Delt.Cl.EURUSD,signals)


#paste to a new list that holds everything
testdata<-cbind(Tdata.eval,class)
testdata$Delt.Cl.EURUSD<-NULL

#get a summary of our traindata
summary(traindata)


The last command prints out some summary statistics of the traindata sample. Notice the 'class' attribute and the distribution of buy, hold and sell signals .

Now we are ready to apply some modeling techniques using traindata and testdata as the datasets to work with. Although i would suggest using also a third sample for validation (remember the "elaborate testing" discussed at the beginning), for this example we will keep things  simple.  More on the next post