Pettingzoo
Create a custom environment for parallelized trading.
Observation state:
The agents will take in the metrics broadcasted out from the exchange analytics server, as well as limit order updates from the exchange server for their own positions only, and create a state observation vector. Analytics metrics should be pre-normalized, but limit order updates like quantity shares owned, price, number of orders unfilled, etc will need to be pct change + z-score normalized here as they are unique to each agent.
Action:
3 numbers. two floats for price and quantity, 1 bool for hold. This may be changed later, to include more complex orders like modify, cancels, stop losses,.
Step:
Step will place the order to the exchange, and then get the next state from the exchange server and the analytics server, and generate a new state observation. Done will be determined if the losses are more than 10%, or a specified time period since the first position was entered, at which we will exit all positions.
Reward:
To determine reward:
- Contribution of multiple trades to over a stretch of time as a whole episode, sum them up, then use reward decomposition like RUDDER https://ml-jku.github.io/rudder/ to determine the contributions of each trade action in the overall reward.
To do this, we would examine our episode look at individual entry or exit positions, wait until that position is exited or re-entered by the same amount of shares, then analyze that stretch for price changes and volatility as a reward.
Example lets say you sell 5 shares at $6, 2 get filled at $7, 3 get filled at $6 making your total shares owned go from 7 to 2
Then some time later you purchase 2 shares at $3, making total shares owned go from 2 to 4.
In between that time price fell from $100 to $80 and volatility was at 30%
We determine the reward for the selling of the last 2 shares out of 5. That's a partial decomposition of the action
We don't know the contribution of the first 3 shares we sold until we own up to 7 shares again.
The reward for those last 2 out of 5 shares sold is:
2*(100-80)*30%
This is the amount the price dropped (that we avoided) * volatility, giving a kind of inverse sharpe ratio
I think this reward model would work quite well in assessing individual actions, but it fails to assess inaction
Because inaction is still an action. Choosing not to buy or sell, and just waiting. Therefore, we determine the reward individual position changes as above, sum them up, then apply RUDDER decomposition to determine the contributions of each individual action along the episode, including inactions.
Pettingzoo
Create a custom environment for parallelized trading.
Observation state:
The agents will take in the metrics broadcasted out from the exchange analytics server, as well as limit order updates from the exchange server for their own positions only, and create a state observation vector. Analytics metrics should be pre-normalized, but limit order updates like quantity shares owned, price, number of orders unfilled, etc will need to be pct change + z-score normalized here as they are unique to each agent.
Action:
3 numbers. two floats for price and quantity, 1 bool for hold. This may be changed later, to include more complex orders like modify, cancels, stop losses,.
Step:
Step will place the order to the exchange, and then get the next state from the exchange server and the analytics server, and generate a new state observation. Done will be determined if the losses are more than 10%, or a specified time period since the first position was entered, at which we will exit all positions.
Reward:
To determine reward:
To do this, we would examine our episode look at individual entry or exit positions, wait until that position is exited or re-entered by the same amount of shares, then analyze that stretch for price changes and volatility as a reward.
Example lets say you sell 5 shares at $6, 2 get filled at $7, 3 get filled at $6 making your total shares owned go from 7 to 2
Then some time later you purchase 2 shares at $3, making total shares owned go from 2 to 4.
In between that time price fell from $100 to $80 and volatility was at 30%
We determine the reward for the selling of the last 2 shares out of 5. That's a partial decomposition of the action
We don't know the contribution of the first 3 shares we sold until we own up to 7 shares again.
The reward for those last 2 out of 5 shares sold is:
2*(100-80)*30%
This is the amount the price dropped (that we avoided) * volatility, giving a kind of inverse sharpe ratio
I think this reward model would work quite well in assessing individual actions, but it fails to assess inaction
Because inaction is still an action. Choosing not to buy or sell, and just waiting. Therefore, we determine the reward individual position changes as above, sum them up, then apply RUDDER decomposition to determine the contributions of each individual action along the episode, including inactions.