Kamis, 18 Desember 2008

Personalizing your RSS Feeds


78.8%. This is the estimated accuracy with which an algorithm is able to predict what kind of information i like to read and thus what kind of news seem to be more interesting to me.

The system proves itself every day and it actually helps a lot because it can spot instantly the information that i like among hundreds of RSS news headers. Such service is so much more than simple keyword matching because it takes into account the combination of words and thus it can differentiate news (in terms of how interesting they are) even if these news are about similar concepts.

Personalization of RSS feeds is a well-known application of text classification. The amount of information -the header- is almost always 2-3 sentences long which makes it ideal for feeding it to a classifier. The software that i built is quite simple : First i have a list of about 10 RSS sources : Financial, Medical, International News, Tech news etc. The application scans the RSS feeds every 20 minutes and each new header is appended to a text file on my hard disk.

When i ran the application for the first time, i simply saved all headers on the hard disk and built my first text classifier. But that was back then. Today the classifier scans the RSS feeds and automatically appends an RSS header to either the "Interesting" text file or the "Uninteresting" text file...and it does so correctly most of the time.

When i have some spare time, i look at the classified headers and correct the errors my classifier made by putting the right headers to the right place. I then re-train the classifier and everything is ready for the next run.


Also interesting is the fact that, the more i am using the classifier the better it gets in terms of the training frequency it requires. During the first week of usage i had to produce a model almost every day...but not any more.

So RSS personalization is an application to look at. First of all it saves a lot of time. Second, really useful applications can emerge. For example, consider one investor that wishes to know if something significant enough occurs on the news that might affect the markets for better or for worse. Notice that on a previous post, i described how i am flagging every news header as an important or unimportant one. Therefore if a classifier is able to differentiate accurately enough as to what is important, then an investor can receive e-mail alarms -or even SMS messages if he is not online- of the event. Perhaps, the message might even include how much a stock or an index is likely to be affected by the breaking news.

There is also the personalization that results from collaborative filtering. However, i believe that a "personal news classifier" -if i may call it like that- after sufficient training time, does a much better job in terms of its predictive accuracy.



Senin, 08 Desember 2008

When Telecom customers complain-Pt. 2

On the previous post i explained the first steps in deploying Information Extraction, Text Mining and Computational Linguistics to capture the essence of Telecom customers complaints.

We have already discussed about the big picture : Retrieve data (essentially user messages from forums) and then use Information Extraction to transform unstructured information to a structured form. This transformation is done by building a set of matching rules for specific phrases or keywords such as

-signal
-antenna
-customer care

and words of sentiment such as

-worse
-worst
-better
-best
-outraged

among many others.


So here comes the interesting part : Suppose that a telecom company has in its possession an application that is able to search and extract sentiment from unstructured information. Having such a tool means that :

  • A user can query directly on user forums for -example- specific network problems and break down those problems by area name.
  • A user can directly query for hot phrases such as "canceling my subscription" and cluster keywords around those messages. If the telecom company is also running (and most likely it is running) churn prediction models, then analysts have yet another source to cross-check and/or enhance the conclusions of their churning models with this new information.
  • Special matching rules can be applied to extract why users prefer company XYZ over company ABC.
  • This technology can be applied to e-mails and/or free text complaints to the customer care center, which means that analysts can further enhance their churning models with additional data.
  • Matching rules can be built that associate keywords to Telecom companies in terms of their co-occurrence. So telecom company XYZ has the phrase "good signal" associated with its brand whilst company ABC has the phrase "bargain" as the associated keyword.
  • Match billing plan keywords and then cluster them with sentiment keywords. In other words, how do customers perceived the new billing plan and what is the sentiment about it?


It is easy to realize that Information Extraction combined with Text Mining and linguistics is a powerful combination that can extract many "knowledge nuggets". The fact that such an application cannot be 100% accurate may arise acceptance problems but its sure worth the effort in the end if potential problems are clearly presented before implementation of this application.

Let us not forget that a complaint given by a customer to the customer center remains there - between the boundaries of the company. A complaint posted on a forum can be seen by hundreds of thousands of others (and it will most likely stay there for a long time ) ,influencing potential and existing customers in a non-positive way.

A Sentiment analysis application may be also used for :

  • Banking
  • Pharmaceuticals
  • Insurance
  • Consumer Products (Customer Reviews)
.
and of course for capturing the sentiment of citizens for politicians (...)


Minggu, 07 Desember 2008

When Telecom customers complain

Probably one of the best uses of Information Extraction, Text mining and Computational Linguistics combined together, is their ability to show us the sentiment of customers. Today we are going to see an example for capturing the sentiment of Telecom customers.

When a customer writes his/her opinion on a forum, a wealth of information is generated because -more importantly- a customer uses words and phrases that cannot be found during a controlled study. The words, phrases and expressions are far more emotionally powerful than a Likert scale answer of type "Totally disagree / agree".

So let us see the steps required :


First Step : The first thing of course is to actually find the data : User forums where people talk about mobile phones and mobile companies is obviously the place to look and there are lots of those places. Perhaps the volume of the messages is not enough but usually the available information is more than enough. Special code can be written to extract text from posts but without loss of the nature of the posting. As an example, the fact that a post has generated 20 replies is considered valuable information. The more posted replies, the more sentiment exists and this information has to be taken into consideration.

Second Step : Deploy information extraction techniques to identify phrases of good or bad sentiment (and actually many other things) about Telecom keywords such as :

- Signal
- Customer Care
- Billing

....etc

The following screen capture shows an example which is in Greek but i will provide all necessary explanation - Please also note that this is a simplified version of the process :





Notice that on the right hand-side there are some bars that denote the type of keywords found : The first category is called "Characterization" and if it is checked (which on the above screen capture it is) the software will highlight posts that only have some kind of characterization, whether good or bad. Notice also the yellow bar which has the name "Network". Because it is checked, words that are synonyms of "Network" are highlighted and indeed this is the case because

Signal = σήμα (in Greek) and
Flawless = άψογο

so the highlighted phrase άψογο σήμα means "flawless signal", which is a good characterization for the signal of two particular telecom companies. Notice also a line under the "Features" tab which says that between positions 3425 to 3429 there is a mention about signal ("mentionsSignal = true").

Again, i have to point out that this is a simplified version of the process. Text Mining and Information Extraction is actually very hard work but it is also very rewarding for those that ultimately deploy and use it. On the next post we will see the problems (and there are many of them) but also how this unstructured information is turned to "nuggets of gold".


Rabu, 03 Desember 2008

Analytics and the Financial Markets

On previous posts, i explained ways to analyze the financial markets by using data mining and text mining techniques. I also went through some potential pitfalls and perils during such type of analysis.

By combining different data sources (worldwide indices, moving averages, oscillators, clustering or categorization of financial news) an investor could take better decisions on where and when to invest. After such an analysis our goal is to make sufficiently better predictions than mere chance.

Some days ago i came across a website called Inner8. Inner8 is a really interesting idea : Collaborative filtering of stock picking. Combine this with analytics and an investor has on his arsenal -yet- another investing tool. Imagine thousands of Inner8 subscribers making stock predictions and giving their ideas, insights and sentiment for the stock market. After a few months some users will be "prediction super stars" from mere chance, so one has to proceed with caution. Nevertheless it is a website to keep looking at in the future, especially if the subscriber volume increases significantly.

So let us go back to our problem : We have to think of a good way to combine the information in our possession (aka problem representation) and feed this data on one or more algorithms with the goal of achieving models of high predictive value.

Some of the things to consider :

1) Should the "sliding window" technique be used? Could repetition of training data (because there is repetition of data in sliding window training) affect the predictive power of the model?

2) How many variables? Which are good predictors?

3) Do we care only about predictive power of the model? How about the interpretation of why a stock behaves as it does?

4) How can we represent the "additive effect" of 2 straight days of bad market news if a sliding window is not used?

5) Prediction Goal : Are we after price prediction (Regression) or price limits? (Classification)

Unfortunately the list does not end here : Since i am after predictions of stock prices in the Greek Stock Exchange, the data should be presented to the learning algorithm in a coherent way. European Markets are affected by the closing of US Markets and Asia. During Greek trading hours the US Markets open (approx. 45 mins before the end of trading - at 16:30 EET) , a fact that should be also taken into account.

I am sure that there are many users out there that have read a couple of data mining books, downloaded an open-source data mining tool, fed some data in and expect to see results. My only advice to them without the slightest sign of criticism: Paper-trade first...

Selasa, 25 November 2008

Predicting popular stories on Digg

On its latest news, KDNuggets mentions a paper from HP Labs that outlines a process of analyzing and predicting the popularity of a Digg story or a YouTube video submission.

No question that this is interesting material. On my post dated October 16th, 2007 i presented an analysis that i performed on what keywords seem to play a part on a post being popular on Digg. You can find all 3 parts of the post here.

So i made a new run to collect the stories from Digg and this is an example of what i came with (please note : For illustrative purposes only) :










The paper from HP Labs takes a different route and makes its predictions based on the popularity of a submitted story in the first few hours rather than after some days. The authors also conclude that after a digg story is out, users tend to vote for it in the beginning but when a specified threshold time has passed the rate with which the story is digged fades away. On the contrary, videos submitted on YouTube are being viewed by users on a linear trend after the video is submitted.

It is true that there is an inherent nature of seasonality on the news and the way that users 'digg' stories. It is also interesting to see at buzzwords that seem to keep repeating (in terms of how interesting they are -or not) over time.

Between the previous runs that i have made and the current one, i have seen some repeating patterns. One of these patterns shows Microsoft on a declining trend in terms of how much of an interesting subject it appears for digg users. Here is what Google Trends shows about the term 'Microsoft' :






Could such a trend may be a glimpse on Microsoft's 'future' somehow?

I have already built a text classifier which accepts phrases and shows the probability of this phrase being highly 'digged' based on the keywords that the phrase has. More on this on a future post.


Jumat, 21 November 2008

Reality Mining vs Life Analytics

We will be taking a short break from making predictions for the financial markets because i just came across a novel term (at least for me) regarding an analytics application : It is called Reality Mining

In short, Reality Mining is about using smart phones to record the interaction of a user with his device and the interaction of the user with other cell phone users. By analyzing this data, patterns of behavior can be extracted that can potentially be interesting to social researchers. For more information on Reality Mining see here.

What made me start this blog was the following question: If there could be a way to capture -and thus record- what a person feels, thinks and otherwise experiences everyday, what kind of patterns might emerge from analyzing this information? I think that this is one step (or even more steps) further from Reality Mining. What would happen if this kind of information was recorded for a vast amount of people and then analyzed? What if we could predict how thoughts we make and what we experience might affect our life later on and the decisions we will make? This is full-blown 'Life Mining'.

I get some e-mails from readers on "how this life analytics project is going" and for this subject there will be some future posts very soon.




Minggu, 16 November 2008

Text Mining on Financial News

As discussed previously, an analyst should give specific attention to problem representation particularly when we are dealing with text data. A way to do this will be discussed below, however something has to give and there is no perfect solution for such a task.

First of all we have to find the source of the news : It could be financial news sites such as Bloomberg, Financial Times, or RSS Feeds URLs such as the ones provided by MarketWatch. RSS Feeds might be a better solution because there is already some predetermined categorization of news according to the feed type and this can be great help for some analysts.

After finding the news sources and making the necessary code to get the actual information we could end up with the following text file :



You can see that i use a '^' separator to differentiate between :

1) A date stamp,
2) A date string
3) The news string
4) A characterization of the news (important or unimportant)
5) A categorization of the financial news.


This simple file could provide the basis for a training file for text categorization. Assuming that we have trained algorithms to automatically classify news, we could use a news classifier to first categorize news to important or unimportant and pass only the important news to a second classifier which will do the detailed classification of the news.

Another option is to use clustering : You can imagine that the solution detailed above has a tremendous amount of work depending on how much data you are planning to collect...so too much data means too much work, less data could mean -usually but not always- less accuracy.

But how could clustering be performed on such data? Simply, we just use field number (4) on our training text file to train a clustering algorithm and then see what 'classes' the algorithm has come up with.


So let's see a small example about clustering : This is a capture from WEKA just before the clustering process :


I have produced a training file which essentially contains the 'buzzwords' of financial news : barrel, recession, Yen, Euro, ECB, price, consumer, etc. The file is then analyzed by K-means algorithm to extract clusters of the same 'buzzwords'. Each cluster is assigned a number so each news header ultimately falls onto one cluster number.


After running the K-Means algorithm i ended up with 16 clusters. Let's see two instances that K-Means decided that they should fall under cluster '6' :


Instance_number : 130.0

Fear
Decrease
US
Economy
Futures

and


Instance_number : 174.0

Fear
Decrease
US
Price
Oil
Banking
Recession


So the first instance is about fears for the US Economy which results in US Futures dropping and the second instance must be -something about- a decrease of Oil prices and Banking stocks because of the fear of US recession. Not bad at all...

But not so fast : Clustering presents a lot of problems later in the process. Remember that what we are after, is to combine text mining and data mining together to better understand how the markets react. Should one use classification or clustering? There are many more things to take under consideration and for obvious reasons i cannot disclose all the details of such a project...but i am hoping to give to the interested reader a good enough introduction on the subject.

Jumat, 14 November 2008

Capturing the Financial Facts

So far, we have seen the data mining part on analyzing the financial markets and some of the problems that arise during such an analysis : Data have to be collected and pre-processed accordingly. There are dangers of over-fitting and the analyst must make sure that the model(s) created have the expected quality. The analyst has also to choose relevant attributes with which the analysis will be performed and how the training of the algorithms will be made.

The markets react to financial news and there is no question about this. Of course there are other factors that make people buy or sell : For example if a stock price has hit a support or resistance level then some investors are going to either buy or sell when such a price level is reached. Investors are also going to buy or sell when specific technical indicators such as MACD or oscillators show the signals to do so. Even when bad news are out, markets after an -unknown- number of consecutive drops will go up by an -unknown- percentage and vice-versa.

People that are involved with Machine Learning know that the representation of the problem at hand is of high importance...so first we are going to see how financial news can be represented in a helpful way for the analysis.

We have to see with what we are dealing here. To do this, we have to analyze and categorize accordingly the financial information as this is created. Financial News can be news about a number of things :

1) The number of jobless claims in US is higher than last year.
2) Automotive company's XYZ sales were dropped by 15%
3) Oil prices hit -yet- another record high
4) The dollar is dropping


....and the list goes on.


So the first problem arises : Should we categorize the information according to its content and present it to the algorithms? We could do that by having a boolean field for each type of news on our training file and set it accordingly to TRUE or FALSE values. By using this method we could easily reach thousands of input fields, since for the "jobless claims" news type we could have the following variants :

-A specific country for the jobless claim report (not only the US, it could be any country)

-Jobless claims could be higher than expected or higher than last year or the highest in the last decade.


It is easy to see that this gets way too fast out of control. Perhaps a better solution would be to try to create clusters of (more or less) the same news. The idea of clustering the financial news might seem an interesting one and an analyst could define a number of clusters -say he is after 100- and let the clustering process categorize accordingly all the news. But is clustering the solution? More on this on the next post...

Rabu, 12 November 2008

Model Testing

Once a model has been created (such as the decision tree for our example), the analyst is required to test the model. During model testing, an analyst performs specific tests that show the actual predictive power of a model.

Many methods can be used for model testing, depending on the problem. For our example and since the available volume of data is sufficiently large, the model training and testing methodology i used was as follows :

1) 50% of data were used for model training
2) 25% of data were used for model validation - fine tuning
3) 25% were used for testing of the model.

In other words, 75% of the data were used for training the algorithm and assessing the impact that changes on algorithm parameters have on the accuracy of the model. For a decision tree algorithm (and depending on the type of decision tree used) an analyst might try different settings for splitting criteria and/or number of minimum cases per branch, etc.

Unfortunately, numerous times an analyst finds that the predicted accuracy of the model given during training - model validation phases (ie steps 1 and 2 shown above) is in no way representative when the model is tested on unseen cases ( Step 3).

During my analysis, numerous models were showing an estimated accuracy of 85% or more but when they were presented on actual data, the accuracy was dropping down to 50-53%, suggesting that overfitting was present. Consequently, the use of these biased models to predict new cases would have detrimental effects in actual stock trading.

When all models are built, the analyst should choose a model (when there is a requirement to use only one model) according to :

1) (Statistically significant) best accuracy.
2) Misclassification costs, if these are not taken into account during the model building process.


On the next post we will see how text mining may help us in making better predictions for the markets.



Kamis, 30 Oktober 2008

Decision Tree Interpretation

On the previous post i went through some basic steps required for predicting the price changes of a specific stock of the Greek stock exchange market. As a result of this process, the following decision tree was generated :





To interpret a decision tree, the analyst starts from the root of the tree and reads through it until a leaf node is reached. For example a rule that can be extracted from the decision tree above is the following:

"IF aseStockExchange > 0.360 AND aseStockExchange > 1.985 THEN price>+2"

The rule above can be found by starting from the root of the tree, moving on the left branch and then continuing to the right sub-branch. In the same way an analyst is able to find the rest of the rules identified by the decision tree.

When using decision tree learners or rule extractors, analysts record the precision and recall of a rule which are not shown in the decision tree above. However, for matters of simplicity i will omit this information and describe the insights provided from the analysis. Decision Trees possess the two following qualities :


1) They provide easy model interpretation

and

2) They show us the relevant importance of the variables

When confronted with many variables, analysts usually start by building a decision tree and then using the variables which the decision tree algorithm has selected with other methods that suffer from the complexity of many variables, such as neural networks. However, decision trees perform worse when the problem at hand is not linearly separable. For the purpose of our example though, a decision tree 'explains' the behavior of the stock nicely.

It should be noted that during the Feature Selection analysis of our stock example we have found that features 'aseStockExchange' and 'DAX' are important. Other features such as 'xaaPersonalHouseProducts' were flagged as important from the Feature Selection algorithm and were not used in the decision tree. Different feature selection methods produce different results (and one might say that this is not very assuring) but usually most methods produce a common feature subset that is of high predictive value.

The importance of the attributes can be seen from the level that they appear on the decision tree (the higher the level, the better is the prediction power of the attribute). So in our example, the 'aseStockExchange' feature is the most important (since it is the attribute with which the decision tree starts) and less important attributes seem to be 'xaaLeisure' and 'xaaBenefit'.

Rabu, 15 Oktober 2008

Insights from a Decision Tree

Assuming that an analyst has made all necessary pre-processing tasks prior to the data mining phase, we are ready to deploy analytical methods such as decision tree learners that can classify unseen cases. For the goal of stock prediction we assume that we have the following data collected :




The column named as XAACLASS is the target column that we wish to classify. Essentially here we have the following classes :

-price change percentage greater than 2%
-price change percentage less than -2%
-price change percentage greater than 0% and +2% inclusive
-price change percentage between -2% inclusive and 0% inclusive

In other words, each line shows us the state of the stock we wish to predict, that occurs given the rest of the market indices (such as realTimeFTSE, realTimeDAX, etc).

So, let us assume that we are ready to build such a model. However, we have to decide the time window that our predictions will be made for...do we wish to predict what the stock price change will be 2 hours ahead? How about 1 day ahead?

Before dealing with this issue, i wanted to see how good a predictive model is by predicting the stock price percentage change right now, based on the current market conditions. Here is a decision tree that is created from such data:





More to come on the next post where the model seen above will be explained in detail. Until then please read the post from this blog about the same problem. If you can, read Fooled By Randomness also...

Kamis, 09 Oktober 2008

So...What's important??

A step of a Knowledge Discovery Process is to perform what is known as Feature Selection, which essentially is the identification of a subset of features with high predictive value.

Feature selection can potentially help in increasing the accuracy of prediction models. Methods such as Naive Bayes can perform better when presented with a subset of selected features, rather than the whole feature set (because of feature redundancy).

Even if feature selection does not prove to help too much, it is important to know the predictive power of each feature. There are numerous methods to do this and -as normally is the case- there is no universally better method to perform an optimal feature selection. The following is a representation of all available Feature Selection methods in WEKA:




Let us stick to our example with stocks, to make things more clear. Suppose that i would like to know which features seem to be important for predicting the behavior of a stock. For our example we will try to find out about how the stock of NBG reacts.

By using a feature selection method we extract the following information :



The feature selection method above shows us how many times each attribute was selected during a 10-fold cross validation. We can see that some attributes are used more times than other attributes during each cross validation . For example :

realTimeDax
aseStockExchangeIndex
xaaPersonalHouseProducts
xaaTechnology
bankAgrotiki
bankAlpha
bankPiraeus
bankEuro


are present in all 10 folds of our cross-validation and hence the 10(100%) entry. xaaFinancialServices index has been selected fewer times (8 out of 10) and hence the 8(80%) entry. Other features never appear to any of the cross validation folds.

Of course feature selection does not stop here and there are many ways to enhance the process. Data Mining is both an art and a science. However for our purpose, we were able to identify those attributes that seem to be important in the prediction of the NBG stock. We immediately see for example that DAX index and the Athens Stock Exchange Index are two important features, plus the stocks of four specific banks. Other methods of feature selection produce weights that essentially rank the importance of each attribute for class prediction.


Senin, 06 Oktober 2008

Always know your data!

Before rushing in analyzing and predicting the Financial Markets (and actually anything else) it is essential that we get an idea about the data at hand. So after data collection (ie getting values of different market indices) i wanted to understand first what is going on to the markets. And a correlation Matrix tells us just that. Let's see what happens on the Greek Stock Exchange :







By looking at the matrix we can immediately see some interesting things :

1) There is a high correlation (=0.847) between DAX index and the Greek stock exchange index (marked as aseStockExchangeIndex)

2) The Insurance index sector (xaaInsurance) and the Media sector (xaaMedia) have a low correlation with the aseStockExchangeIndex. Consider the following scatter chart that shows the poor correlation between Insurance sector stocks and the aseStockExchangeIndex :




Those two facts alone can help significantly in trading: For example if an investor's trading decision is heavily based on aseStockExchangeIndex then the investor should also keep a close look on the DAX Index as opposed to other European indices (such as FTSE,CAC40,etc).

A lot of problems later in the analysis can be prevented if one pays attention to the "Data Understanding" phase. Plus, we also get an insight as to what kind of results should we expect from the learning algorithms.

Kamis, 25 September 2008

Predicting the Financial markets

After a very long break i decided to start writing again. It is very interesting to see that people are still answering on my questionnaire (see Links area) and every week i see more answers coming in....but more data is always welcome!

Since February i have been involved with yet another Data & Text Mining Application, namely predicting the financial markets by using financial - world news (the Text Mining side) and key financial indices (The Data Mining side). There are numerous blogs and entries i have found about this problem. One blog example is neural market trends and also a series of articles which i originally found in kdnuggets.

There is no question that such an application of Predictive Analytics in Financial Markets is interesting but also it can be (potentially) dangerous. However my experience so far on the subject has shown me that by a) getting a grip on the risk of a trading decision and b) using Predictive Analytics to make the trading decision, a user has more chances in making a successful trade.

On the next post, we will go through the data mining side of the problem.

Senin, 11 Februari 2008

Analyzing the Real Estate Market - Part 2

In the previous part i listed the first steps required that can turn unstructured information of flat adverts for rent to a suitable form for further analysis of the Greek Real Estate Market.

Once the Information Extraction step is finished, the characteristics of each flat advert (price, square meters, type of heating, years old etc) are inserted into a database. Once flat adverts data are inserted, we are able to extract key information about price trends for specific areas of Athens such as Nea Smyrni. The following screen capture shows a portion of the records that exist in the database, after the information extraction phase :



With the advert data in place we are ready to deploy data mining algorithms that can reveal to us potentially useful patterns. For example, a classification analysis aimed in finding which characteristics are important to obtain a high renting price produces the following decision tree :




The decision tree depicted above essentially gives us the following information :

  • The most important characteristic for obtaining a high renting value (in terms of Euros per square meter) is the provision of a parking space with the flat.
  • If a flat provides a parking space, has a storage area and has up to two bed rooms then the flat obtains the highest renting rate, (ie 7.54 Euros per square meter)
  • If a flat does not provide a parking space but has at least one bedroom and is located at the fourth floor (or higher) then the flat obtains the highest renting rate per square meter.









Minggu, 06 Januari 2008

Analyzing the Real Estate Market - Part 1

Over the next days i will present an example of using Data Mining and Information Extraction techniques to analyze Real Estate in the Greek Market.

The problem is as follows : In a specific suburb of Athens in Greece (let's say Nea Smyrni) what are the key factors (or characteristics) that contribute to a high renting price of a flat? Which is more important? Having a parking space, or whether the house is less than 5 years old?

This piece of information is particularly valuable for flat owners, real estate investors and real estate agents (to name a few) according to my experience.

I really like this example of analysis given, because it shows the power of Information Extraction and Data Mining combined and the insight that these techniques can reveal.

In order to implement this analysis, the first required action is the collection of information. For this reason, special software collects flat adverts for rent from Greek websites. The next step is to extract each flat's information from each advert. Information extraction is used to extract these characteristics as shown below :


The goal of Information Extraction is to transform unstructured information to a form suitable for further analysis. More specifically, after the Information Extraction phase, the characteristics of each flat advert are inserted into a database. More on this on Part 2...