A comprehensive comparison between the predictive models
Volatility prediction is a cornerstone of risk management and options pricing in financial markets. Traditionally, models like GARCH have been the go-to for forecasting volatility, relying on past price movements and volatility patterns. However, with the rise of machine learning (ML) and deep learning, these conventional methods are being challenged by more advanced models that claim to capture non-linear relationships and complex market behavior.
In this two-part series, we will dive deep into three approaches — GARCH, ML models, and Artificial Neural Networks (ANNs) — for predicting volatility. This first part focuses on the theoretical foundations behind these models, along with the process of data extraction and preparation using ORATS’ API endpoints. In the second part, we will implement these models, evaluate their performance, and compare their predictive capabilities.
Let’s start by understanding the fundamental concepts and preparing the data needed to determine if the classic econometric models still hold their ground or if ML and ANNs offer a more robust alternative.
Theoretical Foundations of Volatility Prediction
Volatility prediction is crucial in financial markets, especially in options pricing, risk management, and portfolio optimization. It reflects the magnitude of price fluctuations in an asset, providing insight into the underlying risk and potential for price swings.
Let’s break down the theoretical foundations behind the three approaches we’ll explore: GARCH, Machine Learning models, and Artificial Neural Networks (ANNs).
Implied Volatility vs. Historical Volatility
Before diving into the models, it’s essential to distinguish between two key types of volatility: Implied Volatility (IV) and Historical Volatility (HV).
Implied Volatility (IV) is forward-looking and derived from option prices. It represents the market’s expectations of future volatility.
Historical Volatility (HV), on the other hand, is backward-looking, based on the asset’s past price movements. It measures the actual volatility observed over a specific time period, typically using standard deviation.
Mathematically, historical volatility over a period T can be expressed as:
Where:
rₜ is the asset’s return at time t,
T is the total number of periods.
Implied Volatility is not calculated directly from historical returns but is instead inferred from the prices of options using pricing models like the Black-Scholes formula. Since implied volatility reflects market expectations, it’s a valuable predictor in many volatility forecasting models.
Volatility Clustering and Fat Tails
A key characteristic of financial markets is volatility clustering. Periods of high volatility tend to be followed by more high volatility, while calm periods are followed by more calm periods. This clustering is observed in many asset classes and is a cornerstone of volatility models like GARCH. Additionally, asset returns exhibit fat tails, meaning extreme price movements are more likely than predicted by a normal distribution.
GARCH (Generalized Autoregressive Conditional Heteroskedasticity)
The GARCH model is one of the most widely used econometric models for volatility prediction. It extends the ARCH (Autoregressive Conditional Heteroskedasticity) model by accounting for both lagged volatility and lagged squared returns. In GARCH, volatility is dependent on past squared returns (representing shocks) and past variances, which makes it particularly effective in capturing volatility clustering.
GARCH Formula:
The standard GARCH(1,1) model can be described with the following equations:
Where:
rₜ is the asset return at time t,
ϵₜ is the error term (shock),
σₜ² is the conditional variance (volatility),
zₜ is a standard normal random variable,
α₀ is the constant variance term,
α₁ and β₁ are parameters capturing past shocks and past volatility, respectively.
Explanation:
The GARCH model predicts that today’s volatility (σₜ²) depends on the previous period’s volatility and the squared residual from the previous period, meaning it combines both the memory of past volatility and the impact of recent market shocks.
The GARCH model is favored for its ability to model volatility clustering and handle leptokurtic distributions (those with fat tails). However, it has limitations when it comes to modeling non-linear relationships and sudden regime shifts, which is where ML models often perform better.
Machine Learning Models for Volatility Prediction
Unlike GARCH, Machine Learning (ML) models don’t rely on specific assumptions about market behavior. Instead, they leverage historical data to learn patterns and predict future volatility. ML models like Random Forest, XGBoost, and Support Vector Machines (SVMs) are particularly useful when volatility depends on complex, non-linear relationships between multiple variables.
Key Concepts in ML for Volatility:
Non-Linearity: Many financial markets exhibit relationships that are non-linear. ML models can capture these complex dependencies without requiring explicit functional forms, unlike GARCH which assumes linear relationships in residuals and variance.
Feature Engineering: In ML models, feature engineering is critical. Features such as lagged returns, volatility, stock price changes, and macroeconomic variables provide the model with a diverse set of inputs to make accurate predictions.
Random Forest:
Random Forest is an ensemble method that creates a collection of decision trees, each trained on a random subset of the data. It averages the predictions of each tree to provide a more robust forecast.
XGBoost goes further, using gradient boosting to iteratively improve model accuracy by focusing on the instances the model predicted poorly in previous iterations.
Support Vector Regression (SVR):
SVR is a powerful algorithm that uses kernel functions to map input data into higher-dimensional spaces, allowing it to capture complex patterns in volatility that linear models cannot.
Each of these ML models has its strengths and weaknesses. Random Forest is great at handling a large number of features, XGBoost is optimized for speed and accuracy, and SVR is powerful in finding non-linear relationships in the data.
Artificial Neural Networks (ANNs) for Volatility Prediction
Artificial Neural Networks (ANNs), particularly Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) models, have gained significant popularity in financial time-series forecasting. These models are designed to capture temporal dependencies, making them ideal for volatility prediction where market movements over time are interconnected.
ANN Architecture:
ANNs are inspired by the structure of the human brain, consisting of neurons (nodes) that process inputs and generate outputs. Layers of neurons are connected, and the strength of these connections (weights) is learned during training. An ANN can model complex relationships through its multiple hidden layers.
LSTMs, a type of RNN, are particularly effective in volatility prediction due to their ability to retain information over long time horizons. Unlike traditional RNNs, LSTMs can avoid the problem of vanishing gradients, allowing them to learn long-term dependencies in time-series data.
LSTM Equations:
The inner workings of an LSTM can be explained by the following set of equations that govern the forget gate, input gate, and output gate:
Forget Gate:
Input Gate:
Output Gate:
Where:
fₜ, iₜ, and oₜ are the forget, input, and output gates,
Cₜ is the cell state, and hₜ is the hidden state.
Explanation:
LSTMs use gates to regulate the flow of information, allowing them to remember or forget past information as necessary. This makes LSTMs particularly useful in predicting time-series data where past volatility patterns significantly affect future outcomes.
Each of the models — GARCH, ML models, and ANNs — has its own strengths and weaknesses. GARCH excels at capturing volatility clustering, but it assumes linear relationships, making it less flexible in dynamic markets. ML models like Random Forest and XGBoost provide more flexibility, handling non-linear relationships and learning from complex datasets. ANNs, especially LSTMs, are powerful in handling long-term dependencies in time-series data, capturing patterns that traditional models might miss.
Data Extraction
To build and compare our volatility prediction models, we need to gather relevant data from ORATS. The focus here is on extracting key metrics such as historical volatility (HV), implied volatility (IV), and stock price data. These metrics form the foundation for modeling volatility, helping us understand both past price movements and market expectations for future fluctuations.
For this purpose, we use ORATS’ Core API endpoint, which simplifies the data extraction process by allowing us to retrieve all necessary fields in a single API call. By specifying the required volatility measures — such as 10-day, 30-day, 60-day, and 90-day historical and implied volatilities — we ensure that we capture a broad view of short- to medium-term market dynamics. The pxAtmIv field represents the at-the-money stock price, further enriching our dataset with crucial price information.
import requests
import pandas as pd
from datetime import datetime, timedelta
import time
def generate_dates(start_date, end_date):
date_list = []
current_date = start_date
while current_date <= end_date:
date_list.append(current_date.strftime('%Y-%m-%d'))
current_date += timedelta(days = 1)
return date_list
api_token = 'YOUR ORATS API KEY'
ticker = 'AAPL'
start_date = datetime(2021, 1, 1)
end_date = datetime(2023, 1, 1)
dates = generate_dates(start_date, end_date)
data = pd.DataFrame()
fields = "orHv10d,orHv30d,orHv60d,orHv90d,iv10d,iv30d,iv60d,iv90d,pxAtmIv"
# DATA EXTRACTION
for date in dates:
try:
url = f"https://api.orats.io/datav2/hist/cores?token={api_token}&ticker={ticker}&tradeDate={date}&fields={fields}"
response = requests.get(url).json()
if 'data' in response:
date_df = pd.DataFrame(response['data'])
date_df['date'] = date
data = pd.concat([data, date_df], ignore_index=True)
time.sleep(0.1)
except Exception as e:
print(f"Error fetching data for {date}: {e}")
data.to_csv('volatility_data.csv', index = False)
The data extraction process starts by defining the date range for our analysis, from January 1, 2021, to January 1, 2023. Using a function called generate_dates(), we generate a list of daily dates within this range. For each date, we make an API request to the Core API endpoint to retrieve the specified fields, including various measures of historical and implied volatility as well as the stock price.
Each API call retrieves the volatility and stock price data for a specific date. The response is parsed into a Pandas DataFrame, and each daily DataFrame is appended to a master DataFrame named data. To prevent overwhelming the API with too many requests, we include a small delay (time.sleep(0.1)) between each API call.
The final result is saved as a CSV file, ‘volatility_data.csv’, which contains the daily data for all specified volatility metrics and stock prices over the chosen date range.
Data Preprocessing
After extracting the raw data, the next step is to prepare it for building our volatility prediction models. Data preprocessing ensures that the data is clean, well-structured, and ready for analysis. This involves renaming columns for better readability, handling missing values, and formatting the data index to ensure consistency.
# DATA PREPROCESSING
# Renaming columns
data = data.rename(columns = {
'orHv10d': 'hv_10d',
'orHv30d': 'hv_30d',
'orHv60d': 'hv_60d',
'orHv90d': 'hv_90d',
'iv10d': 'iv_10d',
'iv30d': 'iv_30d',
'iv60d': 'iv_60d',
'iv90d': 'iv_90d',
'pxAtmIv': 'stock_price'
})
# Dropping null values
data = data.dropna()
# Index formatting
data = data.set_index('date')
data.index = pd.to_datetime(data.index)
data.head()
The preprocessing process begins with renaming columns to make them more intuitive. For example, orHv10d becomes hv_10d, indicating 10-day historical volatility, and iv10d becomes iv_10d, representing 10-day implied volatility. This makes the data easier to interpret.
Next, we handle missing values by dropping rows with null entries. This step is essential to maintain the continuity of time-series data, ensuring that the models are not affected by gaps in the data.
Finally, we set date as the index and convert it to datetime format. This is crucial for time-series analysis, enabling efficient calculations like moving averages and time-based shifts.
This is how the output would look like:
This cleaned and structured DataFrame will serve as the foundation for the feature engineering phase, which we’ll explore next.
Feature Engineering
A Gentle Intro to Feature Engineering
Feature engineering is one of the most crucial aspects of building robust machine learning models. It involves transforming raw data into meaningful features that better represent the problem at hand, thereby improving the model’s performance. The idea is to create new variables, or features, from the existing dataset that can reveal hidden patterns or relationships. By carefully crafting these features, we can help our AI models extract more insights from the data.
In the context of financial modeling, where the data can be highly complex and noisy, feature engineering becomes even more important. Whether it’s creating ratios, moving averages, or volatility spreads, these engineered features allow models to pick up nuances in market behavior that the raw data might not capture. With the right features, our models will be better equipped to forecast volatility in the next phase of this series.
Implementing Feature Engineering
In this section, we transform our dataset to create new features that will help our models capture complex relationships between historical and implied volatility measures. These features include volatility ratios, spreads, moving averages, and other relevant indicators.
# FEATURE ENGINEERING
import numpy as np
# 1. Volatility Ratios
data['hv_ratio_10_60'] = data['hv_10d'] / data['hv_60d']
data['iv_ratio_30_60'] = data['iv_30d'] / data['iv_60d']
# 2. Volatility Spread
data['hv_iv_spread_30'] = data['hv_60d'] - data['iv_30d']
# 3. Moving Averages for Historical and Implied Volatility
data['hv_rolling_mean_10'] = data['hv_10d'].rolling(window = 5).mean()
data['iv_rolling_mean_30'] = data['iv_30d'].rolling(window = 5).mean()
# 4. Price Features: Log returns of stock prices
data['log_return'] = (data['stock_price'] / data['stock_price'].shift(1)).apply(lambda x: np.log(x))
# 5. Implied Volatility (IV) Term Structure Slope
data['iv_term_structure'] = data['iv_90d'] - data['iv_30d']
# 6. Standard deviation of the stock's log returns (window of 5 days)
data['price_volatility'] = data['log_return'].rolling(window = 5).std()
data = data.dropna()
data.to_csv('volatility_data_engr.csv')
data.tail()
In this feature engineering process, we derive new variables to enhance our dataset’s predictive power.
1. Volatility Ratio & Spread:
We start with volatility ratios like hv_ratio_10_60 and iv_ratio_30_60, comparing short-term and medium-term volatility. These ratios help capture shifts in market sentiment across different time horizons. The volatility spread, represented by hv_iv_spread_30, measures the difference between 60-day historical volatility and 30-day implied volatility, revealing discrepancies between actual market behavior and market expectations.
2. Moving Average & Log Returns
We then compute moving averages for historical and implied volatility using a 5-day rolling window to smooth out short-term fluctuations, making trends clearer. The features hv_rolling_mean_10 and iv_rolling_mean_30 provide insights into longer-term volatility trends. Additionally, log returns (log_return) of stock prices measure daily percentage changes, a standard metric in financial analysis for understanding price movements over time.
3. IV term structure slope & Price Volatility
Other features include the IV term structure slope (iv_term_structure), which measures the difference between 90-day and 30-day implied volatility, capturing changes in market expectations over time. We also calculate price volatility as the rolling standard deviation of log returns over 5 days, providing a dynamic measure of price fluctuations.
After generating these features, we clean the dataset by removing rows with missing values, ensuring the data is ready for the model-building phase in Part 2.
Conclusion
In this first part of our two-part series on volatility prediction, we laid the groundwork for understanding the theoretical aspects of volatility forecasting and meticulously prepared our dataset. By exploring the concepts behind GARCH, Machine Learning models, and ANNs, we set the stage for a robust comparison of their capabilities.
Using ORATS’ Core API endpoint, we efficiently gathered essential data, including various measures of historical and implied volatility, and through feature engineering, we created variables that can better capture the nuances of market behavior.
With the data ready and the features in place, we’re now equipped to build, train, and evaluate the predictive models. Stay tuned for Part 2, where we will delve into the practical implementation of GARCH, ML models, and ANNs, assess their performance on real market data, and determine which model proves most effective in predicting volatility.
Comments