A Peek on How Statistical Analysis and Machine Learning Work in Constructing Quantitative Strategy

I remember talking about statistical arbitrage strategy with a peer years back. The concept was simple, but I would not know the specific steps to implement this strategy on financial data as my knowledge on programming and machine learning models were limited. In this blog, I will brief introduce what is statistical arbitrage strategy, and applications of data analysis and machine leaning techniques in a practical case study.

Statistical arbitrage, also referred as stat arb, is a type of investing strategies that involves investing thousands of stocks in a short period of time by utilizing mean reversion analysis on stock prices.

This strategy aims to reduce beta exposure as much as possible. Beta is a measure of a stock’s volatility in relation to the overall market such as S&P500. For example, if the price of SPY (S&P500 ETF) went up 1%, a stock price with beta of 2 would go up 2%. It is same for the opposite case that if SPY went down 1%, this stock price would fall 2%. Theoretically speaking, if a stock has a beta that is not 0, It is exposing to the market risk, meaning its price movement is correlated with that of broad marker. However, stat arb is trying to achieve a market neutral stance by involving both a long(buy) and short(sell) position simultaneously to take advantage of market inefficiencies in price in correlated securities. Market neutral: a strategy has a beta of zero, which means its returns are not affected by market’s price movement.


A Simple Example How Stat Arb Works

Suppose this is the chart of price actions of Coca-Cola vs. Pepsi. The blue line represents the price movement of Coca-Cola, and the yellow line is that of Pepsi. Because these are two similar companies in the same industry that produce identical products, we assume that the prices of the two are correlated. Traders would enter a pair trade in the green area by buying Coca-Cola and short the same value of Pepsi stock.

What happened after January, Pepsi stock has fallen more than Coca-Cola. There is a big gap in the pair’s prices in the red circle. And traders would likely close their positions in the red circle, where they have profit more from the short position than the long position. The pair trade is profitable.

What if the price moves in the opposite directions such as Coca cola went down more than Pepsi? Then investor would be losing more money in long position than the short. Then the stop loss system kicks in to exit the trade.




Application of Data Analysis and Machine Learning in building Stat Arb

This is a showing of a general steps in building this strategy.

The research steps:

  • ·       Categorizing stocks into different groups (aggressive vs defensive)
  • ·       Identifying pairs with the cointegration test
  • ·       Constructing portfolio
  • ·       Forecasting stocks prices by machine learning algorithm (LSTM)
  • ·       Calculating trading profits

First, we need to perform some statistical analysis. In practice, we need to decide when to enter a trade and when to exit. We calculate the spread of two stock prices and normalize the spread with Z-score. In the chart below, Black line is the mean spread based on 250 trading days period. If the spread departures from the normal range (space between red and green lines), the trading starts, and when it reverts to the normal range, the trading stops.

 

Quick Concepts: Stationary vs Non-Stationary in Time Series

  • Time series: a set of observations for a variable over successive periods of time.
  • Stationary: a series whose statistical properties like mean, variance, covariance does not vary with time.
  • Non-Stationary: a series whose statistical properties like mean, variance, covariance shows an increasing or decreasing trend.


The requirement to be selected as a pair is that the pair’s price series are non-stationary, but the series of price spread is stationary(mean-reverting). Here we need the spread series to be stationary because a model cannot forecast on non-stationary time series data. The pair would then be selected for cointegration test, which means that the two time series are linked or follow the same trend and they cannot deviate from equilibrium in the long term. The two tests are named Augmented Dickey-Fuller and Augmented Engle-Granger cointegration test respectively. These tests can all be run in statsmodel package in Python. 


Machine learning is a branch of artificial intelligence and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

The primary mission of ML in this strategy is to use the input data to train a model used for price prediction purposes and then test the accuracy of predictions. To achieve the goal, we must separate our datasets into 2 groups: one is used to train and find an optimal model, and another is to test the prediction accuracy of the model. Once a machine learning algorithm learns the underlying patterns of the training dataset, it needs to be tested on new data that it has never seen before.

Machine learning algorithm for stat arb in predictive modeling:

  • ·       Random Forest (RF)
  • ·       Adaptive-Neuro Fuzzy Inference System (ANFIS)
  • ·       Conventional Neural Networks (CNN)



The basis of machine learning tools used in this case is Recurrent Neuron Networks (RNN). However, RNN is not very good at predicting long term temporal dependencies or predicting long term price in this case. So, we have long short-term memory network (LSTM)

LSTM is a subtype of RNN. Compared to RNN, LSTM includes a “memory cell” that can maintain information for long periods of time. This architecture lets it learn longer-time dependencies. The algorithm is shown in the chart below.




Risks of strategy

  • The strategy is based on studying historical price actions, and past movements cannot guarantee the future movement. So, the longer the trades take, the more factors that could affect the correlations between stocks. Events like bankruptcy and black swans’ event would break correlation of the pairs and the strategy would lose its track.
  • Short trade interest. Because the strategy is involving short trade, it faces the situation when interest on borrowing stock increase enormously.
  • Paying for market impact caused by HFT. Because more investors and major players are using the strategy plus high frequency trader, trades can be executed on price that is not ideal, further squeezing the strategy profit. 


Conclusion

Thanks to Victor, Xiaowen, Qianwen, and Robert’s fascinating paper and these informative contributors online. I learn a ton from them and form this framework on the process of how to conduct research on a potential strategy by utilizing data analysis and machine learning techniques. This is just the start of much more complicated but fascinating research. I’d dig deeper in the machine learning algorithm and neuron networks in the future.

 

 

 

 

 

References:

Victor Chang, Xiaowen Man, Qianwen Xu, and Robert Hsu. (2020). Pairs trading on different portfolios based on machine learning.

https://www.investopedia.com/terms/s/statisticalarbitrage.asp#:~:text=Statistical%20arbitrage%20is%20a%20group,risk%20as%20much%20as%20possible.

https://analyzingalpha.com/statistical-arbitrage

https://www.ibm.com/cloud/learn/machine-learning

https://www.youtube.com/watch?v=nPYPyh20gGo

https://corporatefinanceinstitute.com/resources/knowledge/other/cointegration/

https://stats.stackexchange.com/questions/222584/difference-between-feedback-rnn-and-lstm-gru#:~:text=LSTM%20networks%20are%20a%20type,output%2C%20and%20when%20it's%20forgotten.



Comments

Popular posts from this blog

Are Machine Learning And AI the Future of Investing?

An Introduction of Recurrent Neural Networks