top of page
Nikhil Adithyan

Forecasting Trade Volume using AI & ML in Python

A deep dive into the intricacies of predictive analysis



Introduction: Forecasting using AI & ML

Data-driven trading has garnered significant interest over the past several years. As our understanding of statistical learning and programming grows, it becomes increasingly feasible and accessible to leverage data in developing sophisticated trading strategies.


We will be utilizing tick data from the US stock market via an API provided by the financial data provider EODHD. Their data includes both current and delisted companies, covering approximately 80,000 tickers and over 16,000 securities, with updates provided daily.


Tick data refers to a type of financial market data that records every individual transaction (trade) executed for a specific financial instrument, such as a stock, commodity, or currency pair.


We will start by learning how to make API calls and construct a dataset. Then, we will analyze our dataset and potentially make a forecast based on our observations.



Constructing our Dataset

EODHD follows the same basic steps as any other API, however, there are some provisions to keep in mind. An example URL is as follows:



The important parameters to keep in mind here are:


  1. from and to: The starting and ending date of the dataframe. The parameters are in UNIX time values with UTC timezone, meaning we can use online applications or custom pieces of code to determine the normal time equivalent. for our purposes, it will be important to remember that the values may not surpass 10 digits.


  2. limit: the amount of data we wish to pull out of each call. This value caps out at 10000.


  3. api_token: Secret EODHD API key which you can obtain by creating an EODHD developer account.


Understanding this, we can write a simple loop that makes calls spanning the entire timeframe (note that this code is designed for an upgraded API plan):



import requests
import pandas as pd

start_ts = 1722499200 # August 1st, 8:00:00 UTC
end_ts = 1722585600 # August 2nd, 8:00:00 UTC
key =  'YOUR EODHD API KEY'
combined_df = pd.DataFrame()

while True:
    url = f'https://eodhd.com/api/ticks/?s=AAPL&from={start_ts}&to={end_ts}&limit=10000&api_token={key}&fmt=json'
    data = requests.get(url).json()
    
    if not data:
        print("No more data available.")
        break
    
    df = pd.DataFrame(data)
    
    combined_df = pd.concat([combined_df, df], ignore_index=True)
    
    last_ts = df['ts'].iloc[-1]
    
    if last_ts // 1000 >= end_ts:
        print("Reached end_ts.")
        break
    start_ts = last_ts // 1000
    
    print(f"New start_ts: {start_ts}")

combined_df

After letting this run, we are left with the following dataset (csv also attached at the bottom):


We see that we have 8 columns with almost 900000 rows giving us around 7 million observations. We can see on the API documentation what each column represents:


  • mkt — market where trade took place

  • price — the price of the transaction

  • seq — trade sequence number

  • shares — shares in transaction

  • sl — sales condition

  • ts -timestamp

  • sub_mkt — sub-market where trade took place

  • ex — the exchange ticker


Now that we have constructed our dataset, we can begin cleaning and making analytical observations.


Data Cleaning and Analysis

Upon making the dataset, one issue I immediately noticed was that the dataset seemed to begin looping towards the end, most likely since we hit the cap on the time frame. We can remove this by simply by finding the first instance of the repeated timestamp and deleting all the rows following it:



specific_ts = 1722556797393

# Find the index of the first occurrence of the specific timestamp
index_of_first_occurrence = combined_df[combined_df['ts'] == specific_ts].index[0]
filtered_df = combined_df.loc[:index_of_first_occurrence].copy()
filtered_df

This shaved roughly 200 rows from our data frame, rows which otherwise could have negatively impacted the performance of our later analysis.


We also notice that our ‘ts’ column is not exactly helpful for giving us exact timestamps for when observations occur. We can add readable time features using the DateTime library:



filtered_df['datetime'] = pd.to_datetime(filtered_df['ts'], unit='ms')
filtered_df['month'] = filtered_df['datetime'].dt.month
filtered_df['day_of_week'] = filtered_df['datetime'].dt.dayofweek
filtered_df['hour'] = filtered_df['datetime'].dt.hour
filtered_df['minute'] = filtered_df['datetime'].dt.minute
filtered_df['second'] = filtered_df['datetime'].dt.second
filtered_df['millisecond'] = filtered_df['datetime'].dt.microsecond // 1000

filtered_df.head()

Running this gives us a much more readable dataframe:



Next, we will scan our dataset for any null values. We can achieve this with a Seaborn heatmap:



import seaborn as sns
import matplotlib.pyplot as plt

null_values = filtered_df.isnull()
plt.figure(figsize=(12, 8))
sns.heatmap(null_values, cbar=False, cmap='viridis')
plt.title('Heatmap of Null Values in the Dataset')
plt.xlabel('Columns')
plt.ylabel('Rows')
plt.show()

We notice that the ‘sub_mkt’ column is primarily composed of null values, essentially making it worthless in the context of our analysis. We can thus drop this column.


Now we can begin more deeply investigating each of our features. Stock Price is notoriously an extremely difficult metric to attempt to forecast, however, trading volume is far more manageable and has utility in its own right. People often use trading volume to confirm trends and gauge momentum. Furthermore, many technical indicators and strategies incorporated volume. For example, Volume-weighted average price (VWAP) is a common technical indicator that incorporates volume to provide a more accurate average price.


Plotting our share column, we get the following graph:



Notice there is a single outlier making it far more difficult to get a good look. We can remove this by implementing a threshold:



threshold = 100000 

filtered_df = filtered_df[filtered_df['shares'] <= threshold]
plt.figure(figsize=(14, 8))
plt.plot(filtered_df.index, filtered_df['shares'], color='tab:red')
plt.xlabel('Time')
plt.ylabel('Shares')
plt.title('Shares Over Time (Outliers Removed)')
plt.grid(True)
plt.show()

Running this gives us:



This plot is far more readable and gives us a better look at what exactly we are trying to forecast.


Modeling

Given the size of this dataset, it is important we select a model with an appropriate aptitude for complexity. Random forests is a model which can achieve this given its flexibility and resistance to having certain features dominate over others.


To apply this model we can first do some feature selection, removing features that may simply act as extra noise for our model:



forecast_df = filtered_df.drop(['ex','mkt','ts','seq','sub_mkt','sl','price','month','day_of_week'],axis=1)

forecast_df

Features like the market, price, and timestamp will not prove to be necessary given our focus on given we desire a volume forecast. On the other hand, features like hour, minute, second, and millisecond all correspond directly relate back to the shares column, hence providing value for our forecast.



With our new data frame, we can apply our model after creating a train-test split:



from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

split_index = int(len(forecast_df) * 0.90)
train_data = forecast_df.iloc[:split_index]
test_data = forecast_df.iloc[split_index:]
X_train = train_data.drop('shares', axis=1)
y_train = train_data['shares']
X_test = test_data.drop('shares', axis=1)
y_test = test_data['shares']
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

Notice some key aspects of this code:


  1. Our train-test split is 90–10. Though this frame may seem small, we actually encompass around 4 hours in our test frame, a large period given that we are handling tick data


  2. We do not shuffle the data when constructing the split. Though time series problems resemble regression problems in many ways, preserving chronology is imperative to preventing data leaks.


Given the size of the model, we see a fairly extensive training period, however, we eventually do finish. Running some evaluations on this, we get the results:




Looking at this plot, we notice that at low share values, our model performs well, however, as the amount increases our model begins to struggle and incorrectly classify points. This sentiment is shared in the line plot:



To improve this model, we perhaps can look into some methods to mitigate the outliers, as well as construct more comprehensive features like rolling means and standard deviations.


Neural Network Approach

Given the complexity and size of our data, it is not unreasonable to attempt a deep-learning approach to this problem. LSTMs (Long short-term memory) are a common high-parameter approach to time series problems. The main benefit they give over alternatives like RNN is that it preserves past memory within cells that it can use to update its parameters (hence the long term in its name).


Similarly to the previous approach, we will create a train-test split, followed up by building/fitting the model and evaluation:



model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(1))  

model.compile(optimizer='adam', loss='mean_absolute_error')
model.fit(X_train_reshaped, y_train_scaled, epochs=10, batch_size=32, validation_data=(X_test_reshaped, y_test_scaled), verbose=1)

Some important components of this code:


  1. Dropout is a regularization technique for preventing overfitting, which works by “turning off” specific neurons in the model

  2. Adam optimizer is essentially the strategy or algorithm the model uses to update parameters. Other, more popular optimizers include gradient descent.

  3. Epochs are the number of times the dataset is passed through the model, larger amounts of epochs can lead to overfitting


After allowing this to train for roughly 10 minutes, we land on the evaluation:




Similarly to Random Forests, our LSTM was strong at capturing the general trends, however, it struggled immensely with capturing variability and higher values. Overall the model performed slightly worse to random forests, likely due to its overall inability to capture the more varied trends (the highest it went for prediction was 75 shares).


Interestingly, the LSTM also predicted no value at exactly 0, a characteristic not shared by the Random Forests model.

This serves as an illustration as to why more complex models do not necessarily always outperform their less nuanced alternatives. Despite the LSTM containing a massive amount of parameters, it was still unable to pick up on patterns that the Random Forests model took into consideration.


Conclusion: Extending our Research to Practical Settings

How can we take our observations from exploring this dataset and apply them to a real-world problem?


Remember that our sample for our dataset, despite containing millions of observations, actually only spaned a single trading day (August 1 8:00:00 UTC to August 1st 24:00:00). Since EODHD API is an application that updates extremely frequently, we can essentially gather data and update our model daily, allowing for more accurate predictions. Over a greater span, we may be able to even observe some long-term patterns, perhaps giving us new trading strategies to practice.


As more data is captured, the potential to create a model that can accurately and consistently make forecasts grows, and the advance toward educated, data-driven decision-making for trading becomes more viable.


With that being said, you’ve reached the end of the article. Hope you learned something new and useful. Thank you for your time.


Disclaimer: While we explore the exciting world of trading in this article, it’s crucial to note that the information provided is for educational purposes only. I’m not a financial advisor, and the content here doesn’t constitute financial advice. Always do your research and consider consulting with a professional before making any investment decisions.

Comments


bottom of page