A hands-on guide to building a deep learning model with Python and APIs
Today, we’re delving into real-time trading using Q-learning, a model-free reinforcement learning algorithm. This approach empowers an agent to discern optimal actions within a given environment by iteratively adjusting Q-values in response to observed rewards.
To facilitate this exploration, we’ll leverage Intrinio’s Nasdaq Basic API, a valuable resource providing real-time data for our analysis and decision-making processes in the dynamic realm of financial markets.
Setting up an Account & Acquiring Data
Before heading to start coding in Python, it is essential to first have an Intrinio account and then purchase the Nasdaq Basic API endpoints. Firstly, to create an account, head over to Intrinio’s homepage and select the Sign Up button on the top-right corner which leads you to the user registration page.
After creating an account, you can purchase the API endpoints from this product page which contains all the details about the real-time stock data that comes with the API: https://intrinio.com/financial-market-data/nasdaq-basic
Now that you have an Intrinio account which allows you to have your own API keys (vital for extracting data) and ownership of the Nasdaq Basic API for real-time stock data, we can now proceed to coding in Python and get our hands dirty.
Importing Packages & Extracting Data
Let’s start by importing the necessary libraries for our analysis. These libraries will provide the basic tools required to explore and implement our trading strategy. To extract the data, we’ll use Intrinio’s Python SDK, making the process straightforward and efficient.
But before moving further, let’s gain some background about Intrinio’s Nasdaq Basic API. The Nasdaq Basic API, facilitated by Intrinio, allows for efficient extraction of financial data in just a few lines of code. Its user-friendly interface provides flexibility, enabling us to customize its utilization according to our specific needs. Additionally, the API furnishes diverse and accurate information, offering a wide array of options for analysis and application.
The following code imports the necessary packages and extracts the data of Apple stock:
!pip install intrinio-sdk
import intrinio_sdk as intrinio
import numpy as np
import pandas as pd
import time
intrinio.ApiClient().set_api_key('YOUR API KEY') # setting up the API key
# NOTE : Replace YOUR API KEY with you actual API key
identifier = 'AAPL' # Setting up the identifier
source = 'nasdaq_basic' # Setting up the source
response = intrinio.SecurityApi().get_security_realtime_price(identifier, source=source) # Extracting the data
print(response)
To retrieve the data, start by installing the Intrinio Python SDK with a straightforward pip command. After installation, set your API key for authorization. Next, input the identifier (e.g., AAPL) and the data source (in this case, nasdaq_basic). With just one line of code, you’ll receive the desired response. It’s a streamlined process, making data extraction quick and simple.
The output will look like this :
Now, let’s step back to focus on understanding the Q-learning algorithm. After that, we’ll return to our data extraction process and identify the key features we need from the data.
Q — learning
Reinforcement Learning (RL) is a subfield of machine learning dedicated to training agents to make decisions through iterative interactions with an environment. In the RL paradigm, agents navigate an environment by taking actions and subsequently receiving feedback in the form of rewards or penalties. This feedback guides the agent’s learning process, enabling it to develop a strategy that aims to maximize the cumulative reward over time. Q-learning is a specific type of reinforcement learning method that plays a crucial role in this process, helping agents optimize their decision-making strategies based on the environmental feedback received during the exploration and exploitation phases.
Q-learning stands out as a model-free reinforcement learning algorithm designed to empower an agent in learning optimal actions within a given environment. Through an iterative process, Q-learning updates Q-values, which serve as representations of the anticipated cumulative rewards linked to taking a particular action in a specific state. This method enables the agent to refine its decision-making strategy over time, honing in on the actions that yield the most favorable outcomes based on the learned Q-values. Now, let’s break down the key components of Q-learning.
Q-Table:
At the core of Q-learning, the Q-table assumes a pivotal role. This matrix features rows corresponding to diverse actions and columns corresponding to various states or environmental features. The Q-values within this table serve as the agent’s estimates for the expected cumulative rewards associated with taking a specific action in a particular state.
Exploration vs. Exploitation:
An inherent challenge in reinforcement learning lies in striking a balance between exploration and exploitation. The agent must explore novel actions to uncover potentially superior strategies while exploiting familiar actions to maximize immediate rewards. The exploration probability plays a pivotal role in mitigating this challenge. At each decision point, the agent faces a choice: either delve into unexplored actions with a certain probability or leverage its existing knowledge by selecting the action linked to the highest Q-value. This probability parameter allows the agent to balance between curiosity-driven exploration and knowledge-driven exploitation intelligently.
When the exploration probability is high, the agent is more likely to opt for random actions, fostering a broad exploration of the environment. This is crucial for discovering potentially superior strategies or navigating unfamiliar scenarios. On the other hand, when the exploration probability is low, the agent tends to exploit its existing knowledge, focusing on actions that have historically yielded higher rewards.
Effectively managing the exploration-exploitation trade-off is essential for the long-term success of reinforcement learning algorithms. Q-learning, with its incorporation of a dynamic exploration probability, exemplifies a strategy that enables agents to learn and adapt to their environments over time, achieving a fine-tuned balance between exploration and exploitation.
Learning Rate (α) and Discount Factor (γ):
The learning rate (α) plays a pivotal role in determining how extensively the agent adjusts its Q-values based on newfound information. A higher learning rate places more emphasis on recent experiences in shaping the agent’s decisions.
Meanwhile, the discount factor (γ) modulates the significance of future rewards. A lower discount factor prompts the agent to prioritize short-term rewards, while a higher discount factor encourages thoughtful consideration of long-term rewards in the decision-making process.
Updating the Q-value:
The process of updating the Q-value function in Q-Learning is a crucial step in reinforcing the learning algorithm. The Q-values represent the expected cumulative rewards for taking a specific action in a particular state. The update is performed using a mathematical equation that accounts for the current Q-value, the immediate reward received, and the maximum Q-value for the next state. This equation iteratively refines the Q-values through the learning process, helping the agent make more informed decisions over time. Essentially, the update aims to balance the immediate rewards with the expected future rewards, guiding the agent to learn optimal strategies for navigating its environment.
Let’s delve deeper into each step of the Q-learning algorithm.
Building the Q-Learning Algorithm
Step-1: Initialization:
At the outset, the Q-table is initialized with zeros. The Q-table is a matrix where each row corresponds to a different action, and each column corresponds to different states or features of the environment. The values in the Q-table represent the agent’s estimations of the expected cumulative rewards for taking a specific action in a specific state. Initialization sets the groundwork for the learning process to begin.
Creating a Qlearning Trader class and defining the Q-table:
class QLearningTrader:
def __init__(self, num_actions, num_features, learning_rate, discount_factor, exploration_prob):
self.num_actions = num_actions
self.num_features = num_features
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.exploration_prob = exploration_prob
# Initialize Q-table with zeros
self.q_table = np.zeros((num_actions, num_features))
# Initialize state and action
self.current_state = None
self.current_action = None
Step-2: Action Selection:
The agent faces a crucial decision in choosing an action. This decision is influenced by the exploration-exploitation trade-off. With a certain probability, the agent may decide to explore new actions, selecting one at random. Alternatively, it may opt to exploit its existing knowledge by choosing the action associated with the highest Q-value. This step reflects the delicate balance between trying out new possibilities and leveraging what the agent has already learned.
Creating a function for choosing an action based on Exploration-Exploitation Trade-Off:
def choose_action(self, state):
# Exploration-exploitation trade-off
if np.random.uniform(0, 1) < self.exploration_prob:
return np.random.choice(self.num_actions) # Explore
else:
feature_index = np.argmax(state)
return np.argmax(self.q_table[:, feature_index]) # Exploit
Step-3: Observation and Reward:
The chosen action is executed in the environment. The agent then observes the resulting state and receives a reward from the environment. This step represents the interaction between the agent and its surroundings, where the consequences of the chosen action become apparent.
Reward Function:
def calculate_reward(action, current_close, next_close):
if action == 0: # Buy
return 1.0 if next_close > current_close else -1.0
elif action == 1: # Sell
return 1.0 if next_close < current_close else -1.0
else: # Hold
return 1.0 if next_close > current_close else -1.0 if next_close < current_close else 0.0
The ‘calculate_reward’ function determines the reward associated with a specific trading action in a Q-learning framework for real-time trading.
If the action is to buy (action == 0), the function assigns a positive reward of 1.0 if the next closing price is higher, indicating a profitable decision, and -1.0 if it is lower, signaling a loss.
Similarly, for a sell action (action == 1), a positive reward is assigned for a successful short position (next closing price lower) and a negative reward for an unsuccessful one.
In the case of holding (action != 0 and action != 1), the function evaluates whether the next closing price is higher, lower, or equal to the current closing price, assigning rewards of 1.0, -1.0, or 0.0, respectively, reflecting the outcome of holding the position.
Function to extract Real-time Data:
def fetch_real_time_data(identifier):
source = 'nasdaq_basic'
response = intrinio.SecurityApi().get_security_realtime_price(identifier, source=source)
return {
'open': response.open_price,
'high': response.high_price,
'low': response.low_price,
'close': response.last_price,
'volume': response.last_size
}
We will be selecting the top five crucial features: 1) open price, 2) high price, 3) low price, 4) last price, and 5) last price.
Function to extract Current real-time and next state:
def observe_real_time_data(self,identifier):
# Fetch real-time data
real_time_data = fetch_real_time_data(identifier)
# Extract features from real-time data
self.current_state = np.array([real_time_data['open'], real_time_data['high'],real_time_data['low'], real_time_data['close'],real_time_data['volume']])
def observe_next_state(self,identifier):
# Update the current state with the observed next state
self.current_state = fetch_real_time_data(identifier)
Step-4: Update Q-Values:
Building on the observed reward, the Q-value for the current state-action pair is updated. The new Q-value is a combination of the agent’s existing knowledge (current Q-value) and the new information gained (reward and the expected future rewards). This step is crucial for the Q-table to adapt and refine its estimations over time. The Q-value is updated according to the following function:
def take_action(self, action, reward):
# Update Q-table based on the observed reward
if self.current_action is not None:
feature_index = np.argmax(self.current_state)
current_q_value = self.q_table[self.current_action, feature_index]
new_q_value = (1 - self.learning_rate) * current_q_value + self.learning_rate * (reward + self.discount_factor *np.max(self.q_table[:, feature_index]))
self.q_table[self.current_action, feature_index] = new_q_value
# Update current state and action
self.current_state = None
self.current_action = action
Step-5: Iteration:
The agent goes through a repeated process, moving to the next state, making choices, and adjusting Q-values. This loop continues until it reaches a predetermined number of iterations. Once done, the algorithm generates an output based on the current Q-table, showing the learned values for effective decision-making in the environment. We also create a function to calculate profit or loss based on actions and formulate a final function to give suggestions and the resulting profit.
Profit-Loss Function:
def calculate_profit_loss(initial_balance, suggested_action, current_close, next_close, quantity):
if suggested_action == "Buy":
return (next_close - current_close) * quantity
elif suggested_action == "Sell":
return (current_close - next_close) * quantity
else: # Hold
return 0.0
Iteration Function:
def calculate_final_profit(identifier, initial_balance, quantity, num_iterations, learning_rate, discount_factor, exploration_prob):
num_actions = 3
num_features = 5
q_trader = QLearningTrader(num_actions, num_features, learning_rate, discount_factor, exploration_prob)
for i in range(num_iterations):
q_trader.observe_real_time_data(identifier)
action = q_trader.choose_action(q_trader.current_state)
current_close = q_trader.current_state[3]
time.sleep(1) # Introduce a delay before fetching the next real-time data
q_trader.observe_next_state(identifier)
next_close = q_trader.current_state['close']
reward = calculate_reward(action, current_close, next_close)
q_trader.take_action(action, reward)
# Fetch real-time data just after the last iteration
final_real_time_data = fetch_real_time_data(identifier)
# Get the final suggested action based on the last state in the Q-table
final_suggested_action = ["Buy", "Sell", "Hold"][np.argmax(q_trader.q_table[:, np.argmax(q_trader.current_state)])]
# Calculate profit based on the final suggested action
final_profit = calculate_profit_loss(initial_balance, final_suggested_action, current_close, final_real_time_data['close'], quantity)
print(f"Final Suggested Action: {final_suggested_action}, Final Profit: {final_profit}")
Through these steps, the Q-learning algorithm continuously improves its grasp of the environment, allowing the agent to make more knowledgeable decisions as time progresses. The balance between exploration and exploitation, along with ongoing updates to the Q-table, lays the groundwork for the algorithm’s learning process.
Putting it all together
Now combining all the functions developed till now we get:
class QLearningTrader:
def __init__(self, num_actions, num_features, learning_rate, discount_factor, exploration_prob):
self.num_actions = num_actions
self.num_features = num_features
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.exploration_prob = exploration_prob
# Initialize Q-table with zeros
self.q_table = np.zeros((num_actions, num_features))
# Initialize state and action
self.current_state = None
self.current_action = None
def choose_action(self, state):
# Exploration-exploitation trade-off
if np.random.uniform(0, 1) < self.exploration_prob:
return np.random.choice(self.num_actions) # Explore
else:
feature_index = np.argmax(state)
return np.argmax(self.q_table[:, feature_index]) # Exploit
def take_action(self, action, reward):
# Update Q-table based on the observed reward
if self.current_action is not None:
feature_index = np.argmax(self.current_state)
current_q_value = self.q_table[self.current_action, feature_index]
new_q_value = (1 - self.learning_rate) * current_q_value + \
self.learning_rate * (reward + self.discount_factor * np.max(self.q_table[:, feature_index]))
self.q_table[self.current_action, feature_index] = new_q_value
# Update current state and action
self.current_state = None
self.current_action = action
def observe_real_time_data(self, identifier):
# Fetch real-time data
real_time_data = fetch_real_time_data(identifier)
# Extract features from real-time data
self.current_state = np.array([real_time_data['open'], real_time_data['high'],
real_time_data['low'], real_time_data['close'],
real_time_data['volume']])
def observe_next_state(self, identifier):
# Update the current state with the observed next state
self.current_state = fetch_real_time_data(identifier)
def fetch_real_time_data(identifier):
source = 'nasdaq_basic'
response = intrinio.SecurityApi().get_security_realtime_price(identifier, source=source)
return {
'open': response.open_price,
'high': response.high_price,
'low': response.low_price,
'close': response.last_price,
'volume': response.last_size
}
def calculate_reward(action, current_close, next_close):
if action == 0: # Buy
return 1.0 if next_close > current_close else -1.0
elif action == 1: # Sell
return 1.0 if next_close < current_close else -1.0
else: # Hold
return 1.0 if next_close > current_close else -1.0 if next_close < current_close else 0.0
def calculate_profit_loss(initial_balance, suggested_action, current_close, next_close, quantity):
if suggested_action == "Buy":
return (next_close - current_close) * quantity
elif suggested_action == "Sell":
return (current_close - next_close) * quantity
else: # Hold
return 0.0
def calculate_final_profit(identifier, initial_balance, quantity, num_iterations, learning_rate, discount_factor, exploration_prob):
num_actions = 3
num_features = 5
q_trader = QLearningTrader(num_actions, num_features, learning_rate, discount_factor, exploration_prob)
for i in range(num_iterations):
q_trader.observe_real_time_data(identifier)
action = q_trader.choose_action(q_trader.current_state)
current_close = q_trader.current_state[3]
time.sleep(1) # Introduce a delay before fetching the next real-time data
q_trader.observe_next_state(identifier)
next_close = q_trader.current_state['close']
reward = calculate_reward(action, current_close, next_close)
q_trader.take_action(action, reward)
# Fetch real-time data just after the last iteration
final_real_time_data = fetch_real_time_data(identifier)
# Get the final suggested action based on the last state in the Q-table
final_suggested_action = ["Buy", "Sell", "Hold"][np.argmax(q_trader.q_table[:, np.argmax(q_trader.current_state)])]
# Calculate profit based on the final suggested action
final_profit = calculate_profit_loss(initial_balance, final_suggested_action, current_close, final_real_time_data['close'], quantity)
print(f"Final Suggested Action: {final_suggested_action}, Final Profit: {final_profit}")
Now, let’s conduct the ultimate experiment to analyze the results:
security_identifier = 'AAPL'
calculate_final_profit(security_identifier, initial_balance = 100, quantity = 10, num_iterations = 180, learning_rate = 0.1, discount_factor = 0.9, exploration_prob = 0.2)
Result:
Here, I’ve selected the identifier as AAPL (Apple), set the initial balance to 100, quantity to 10 and determined a total time of 180 iterations, equivalent to 180 seconds or 3 minutes. The strategy resulted in a profit of 0.3 units upon selling. It’s important to note that these parameters are adjustable based on specific preferences and market conditions. Feel free to modify them to better suit your needs and objectives.
NOTE: This strategy is intentionally kept simple, employing a straightforward reward function and incorporating randomization in exploration. It’s crucial to recognize that due to the inherent randomness and simplicity, different outcomes may arise for the same dataset snapshot. This article serves as an introductory and simplified overview of the strategy. It is recommended to tailor and adjust the approach based on specific requirements and objectives.
Final Thoughts
In conclusion, our exploration of Q-Learning, emphasizing key features such as open, high, low, and last prices, provides practical insights into the algorithm’s adaptability.
The iterative cycles striking a balance between exploration and exploitation contribute to its functionality. Additionally, leveraging Intrinio’s Nasdaq Basic API enhances our experiment, allowing us to incorporate diverse and accurate financial data. The performance evaluation, considering profits and losses, offers a hands-on perspective on the algorithm’s real-world applicability.
In essence, this journey not only sheds light on Q-Learning but also showcases its potential when coupled with powerful financial data extraction tools like Intrinio’s Nasdaq Basic API.
With that being said, you’ve reached the end of the article. Hope you learned something new and useful. If you’ve used Intrinio’s APIs for extracting data, let me know your user experience in the comments. Thank you very much for your time.