A Peek on How Statistical Analysis and Machine Learning Work in Constructing Quantitative Strategy
I remember talking about
statistical arbitrage strategy with a peer years back. The concept was simple,
but I would not know the specific steps to implement this strategy on financial
data as my knowledge on programming and machine learning models were limited. In
this blog, I will brief introduce what is statistical arbitrage strategy, and
applications of data analysis and machine leaning techniques in a practical
case study.
Statistical arbitrage, also referred
as stat arb, is a type of investing strategies that involves investing
thousands of stocks in a short period of time by utilizing mean reversion
analysis on stock prices.
This strategy aims to reduce beta
exposure as much as possible. Beta is a measure of a stock’s volatility
in relation to the overall market such as S&P500. For example, if the price
of SPY (S&P500 ETF) went up 1%, a stock price with beta of 2 would go up
2%. It is same for the opposite case that if SPY went down 1%, this stock price
would fall 2%. Theoretically speaking, if a stock has a beta that is not 0, It
is exposing to the market risk, meaning its price movement is correlated with
that of broad marker. However, stat arb is trying to achieve a market neutral
stance by involving both a long(buy) and short(sell) position simultaneously to
take advantage of market inefficiencies in price in correlated securities. Market
neutral: a strategy has a beta of zero, which means its returns are not
affected by market’s price movement.
A Simple Example How Stat Arb Works
Suppose this is the chart of price
actions of Coca-Cola vs. Pepsi. The blue line represents the price movement of
Coca-Cola, and the yellow line is that of Pepsi. Because these are two similar
companies in the same industry that produce identical products, we assume that
the prices of the two are correlated. Traders would enter a pair trade in the green
area by buying Coca-Cola and short the same value of Pepsi stock.
What happened after January, Pepsi
stock has fallen more than Coca-Cola. There is a big gap in the pair’s prices
in the red circle. And traders would likely close their positions in the red
circle, where they have profit more from the short position than the long
position. The pair trade is profitable.
What if the price moves in the opposite
directions such as Coca cola went down more than Pepsi? Then investor would be losing
more money in long position than the short. Then the stop loss system kicks in
to exit the trade.
Application of Data Analysis and Machine Learning in building
Stat Arb
This is a showing of a general
steps in building this strategy.
The research steps:
- ·
Categorizing stocks into different groups
(aggressive vs defensive)
- ·
Identifying pairs with the cointegration test
- · Constructing portfolio
- ·
Forecasting stocks prices by machine learning
algorithm (LSTM)
- · Calculating trading profits
First, we need to perform some
statistical analysis. In practice, we need to decide when to enter a trade and
when to exit. We calculate the spread of two stock prices and normalize the
spread with Z-score. In the chart below, Black line is the mean spread based on
250 trading days period. If the spread departures from the normal range (space
between red and green lines), the trading starts, and when it reverts to the
normal range, the trading stops.
Quick Concepts: Stationary vs
Non-Stationary in Time Series
- Time series: a set of observations for a variable over successive periods of time.
- Stationary: a series whose statistical properties like mean, variance, covariance does not vary with time.
- Non-Stationary: a series whose statistical properties like mean, variance, covariance shows an increasing or decreasing trend.
The requirement to be selected as a
pair is that the pair’s price series are non-stationary, but the series of
price spread is stationary(mean-reverting). Here we need the spread series to
be stationary because a model cannot forecast on non-stationary time series data.
The pair would then be selected for cointegration test, which means that the two
time series are linked or follow the same trend and they cannot deviate from
equilibrium in the long term. The two tests are named Augmented Dickey-Fuller
and Augmented Engle-Granger cointegration test respectively. These tests
can all be run in statsmodel package in Python.
Machine learning is a branch
of artificial intelligence and computer science that focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its
accuracy.
The primary mission of ML in this
strategy is to use the input data to train a model used for price prediction
purposes and then test the accuracy of predictions. To achieve the goal, we must
separate our datasets into 2 groups: one is used to train and find an optimal
model, and another is to test the prediction accuracy of the model. Once a machine
learning algorithm learns the underlying patterns of the training dataset, it
needs to be tested on new data that it has never seen before.
Machine learning algorithm for stat
arb in predictive modeling:
- · Random Forest (RF)
- · Adaptive-Neuro Fuzzy Inference System (ANFIS)
- · Conventional Neural Networks (CNN)
The basis of machine learning tools
used in this case is Recurrent Neuron Networks (RNN). However, RNN is not very
good at predicting long term temporal dependencies or predicting long term
price in this case. So, we have long short-term memory network (LSTM)
LSTM is a subtype of RNN. Compared to
RNN, LSTM includes a “memory cell” that can maintain information for long periods
of time. This architecture lets it learn longer-time dependencies. The algorithm
is shown in the chart below.
Risks of strategy
- The strategy is based on studying historical price actions, and past movements cannot guarantee the future movement. So, the longer the trades take, the more factors that could affect the correlations between stocks. Events like bankruptcy and black swans’ event would break correlation of the pairs and the strategy would lose its track.
- Short trade interest. Because the strategy is involving short trade, it faces the situation when interest on borrowing stock increase enormously.
- Paying for market impact caused by HFT. Because more investors and major players are using the strategy plus high frequency trader, trades can be executed on price that is not ideal, further squeezing the strategy profit.
Conclusion
Thanks to Victor, Xiaowen, Qianwen,
and Robert’s fascinating paper and these informative contributors online. I learn
a ton from them and form this framework on the process of how to conduct research
on a potential strategy by utilizing data analysis and machine learning techniques.
This is just the start of much more complicated but fascinating research. I’d dig
deeper in the machine learning algorithm and neuron networks in the future.
References:
Victor Chang, Xiaowen Man, Qianwen
Xu, and Robert Hsu. (2020). Pairs trading on different portfolios based on
machine learning.
https://analyzingalpha.com/statistical-arbitrage
https://www.ibm.com/cloud/learn/machine-learning
https://www.youtube.com/watch?v=nPYPyh20gGo
https://corporatefinanceinstitute.com/resources/knowledge/other/cointegration/
https://stats.stackexchange.com/questions/222584/difference-between-feedback-rnn-and-lstm-gru#:~:text=LSTM%20networks%20are%20a%20type,output%2C%20and%20when%20it's%20forgotten.

Comments
Post a Comment