A data science approach to trend investing

Updated: May 17



In an earlier article, I have highlighted that semi passive investing methodology might provide better returns than a simple buy-and-hold strategy. This got me thinking about trend investing and also how we could apply data science into creating a model for trend investing.


Trend investing has been around for long, with many momentum traders advocating it. “Buy the trend, not the dip” has always been the mantra. No matter how good the fundamentals are or how undervalued the company is, trend investors do not believe in buying them unless the trend says so


On the other side of the camp, you will have value investors who claim that PE ratio, PB ratio etc are the most important elements in the world besides air, water and fire. Throwing any concept of momentum or trend to the wind, they will only buy a stock if it’s undervalued.

Benjamin Graham and Warren Buffett are two very well-known value investors in the world. They are considered the fathers of value-investing. And this got me thinking. Do they also follow value investing strictly and ignore everything about trend when they invest?


An article here answers this. While there were many occasions in history when Graham thought that the stock market was overvalued, he did not actually liquidate his stocks then. Instead, he held on to them. And this resulted in more than one half of Graham portfolio's returns to come from following the trend throughout the bull market between 1950s and 1960s . Should he had decided to sell his stocks in 1953 when they were overvalued by his own metrics, he could have missed out on these 2 decades of bull market and would probably not have beaten the market.


So, I’m convinced trend investing is important. Then the next question is how do we apply data science on trend investing?


Perhaps a data science model which could provide you with recommendations on either “Buy”, “Hold” or “Sell” based on key technical indicators such as (Simple Moving Average) SMA 20, SMA 50 and SMA 200?


Let’s give it a try.


To create any data science model, we will first need data. Thankfully, some of the historical data of stocks could be easily extracted online. From Yahoo Finance, I exported all the daily prices of S&P 500 from 28th Oct 2008 to 6th April 2019.


Next, I created an excel sheet with the various columns such as

- S&P 500 (reflecting the daily price)

- SMA 20

- SMA 50

- SMA 200

- Price above SMA 200?

- SMA 20 above SMA 50?

- SMA 50 above SMA 200?



The last three columns represent some of the technical strategies which people use to understand if it’s the right time to buy a stock. For example, some investors will choose to buy a stock only when SMA 50 crosses above SMA 200 as it indicates a positive trend or momentum.


These various columns represent features which I will use in the model.


With any data science model, you will need a target variable which represents the prediction/classification outcome of your model. What do you want to know here?

In this case, I want to know if I should “Buy”, “Hold” or “Sell” based on the various features listed above. And here is my criteria.


- “Buy” when the price has a 10% rise in the next 50 trading days

- “Sell” when the price has a 10% drop in the next 50 trading days

- “Hold” for all other scenarios


Using historical data, I then create an additional column in the spreadsheet with the actual values of this target variable for all the trading days from 28th Oct 2008 to 6th April 2019.


Using all these information, I could then proceed to create a model out of it. I’m interested to know how well do my model fare in terms of predictions for the target variable as compared to the actual values. But before that, let me dive a bit deeper into the construction of this data science model.


The methodology which I adopt in the creation of my model is random forest. Random forest is usually used as a type of classification algorithm consisting of building blocks which are decision trees. To understand random forest, you must first understand decision trees. Decision tree is a way of asking a series of yes/no questions with the aim to come to an eventual conclusion on a classification. For instance, if the answers to the questions “Is this an animal?”, “Does it have four legs?” and “Does it purr?” are all “Yes”, you could be fairly certain that it is a cat.

The tricky thing though is how to come up with the questions for the decision tree. The common algorithm used to form these questions is the CART algorithm. Essentially, CART algorithm refers to the creation of a binary decision tree by repeatedly splitting a node into two child nodes, with the root node which contains the whole learning dataset as the starting point. This splitting of nodes results in an eventual output variable which could then be used to make a prediction. Basically, the objective of each node is to split the dataset into different groups which are as different from each other as possible, and the data points in each group should be as similar to each other as possible.


Now, random forest is the use of many of such decision trees which each functions as a model to operate as an ensemble together and produce predictions that are more accurate than any single model. With each decision tree having low correlation to each other, we are ensured that the eventual model created is protected from the fallacy of depending on a single decision tree making a wrong prediction and having it propagates through the model.


To build your own machine learning model, you could either code it in python (eg. scikit-learn is a free machine learning library you can use) or use some other existing analytical software such as Mind Foundry.


How do we then measures the success of the model? A common diagnostic measure will be to use a confusion matrix. Here's the results.

(Diagram shows that the model is able to made a correct "Buy" call 89.5% of the times, correct "Hold" call 87.7% of the times, and correct "Sell" call 88.5% of the times)


A confusion matrix allows you to measure two main components of any classifier model- Recall and Precision.


Recall is the true positive rate and is measured by dividing True Positive over (True Positive + False Negative). Precision is the positive predictive value and is measured by dividing True Positive over (True Positive + False Negative). Basically, you want these values to be as high as possible. Besides Recall and Precision, you could also measure two other values which are Accuracy and F-measure. I shall not bore you with the details here and you could probably read up more on the technical details here:


True Positive/False Negative

Recall, Precision etc


In short, my model is able to achieve a value of 0.878 for Recall and 0.879 for Precision. I personally think that these values are of statistical significance and am definitely giving my model a try to aid me in making "Buy", "Hold" or "Sell" decisions in my future investments/trades. Of course, the use of machine learning here is not to replace other analytical work which you need to do. Rather, I will think that it's an excellent tool to assist you in your investing decisions.

In case you are wondering what recommendations have my model been churning out for S&P 500 of late, please take note that..


This model's recommendation will be updated on a regular basis for the patrons (https://www.patreon.com/datascienceinvestor)


Of course, big disclaimer ahead: Do remember this is not professional financial advice.

532 views
  • Facebook

©2019 by datascienceinvestor