Reinforcement Learning – christina norwood

It’s Done

Finally got a very basic implementation of the RL model up and running, basic in that the input data is just a few lags of the cryptocap chart. Much work to be done, but I do have an XGBoost model predicting the prospects specifically for ADAUSDT over the next few days, and an RL model using broader market data to tell whether it’s ‘a good time to trade’. As I said, much work to be done, but the basic system is in place. From here on – refinement. I’ve been working on this for quite a while now (mostly learning how to implement Deep Reinforcement Learning in the context of trading) and I’m very happy to reach this point at last.

At the moment both models are telling me to trade, but my intuition says otherwise. I think I’ll give it another 12 hours or so to see which way the wind is blowing. Is there still room for intuition? Maybe.

Back to Square One

During the week I had a few small wins, and a larger loss (8%) that wiped out those wins and brought me back to square one. Almost. So, business as usual. Actually I’m 10 cents down on my initial 100USDT, but close enough.

I still haven’t got the RL model up and running, but I have been tweaking the XGBoost model a bit. I’m also changing my basic trading method, now using trailing stops that should prevent larger losses but will probably mean more smaller losses. I’ve come to hate stops over the years because they always have meant that the price drops to hit my stop, sell me out at a loss, than bounce back up again so I can’t buy back in at the lower price. Has this changed? Probably not but I’ll see I guess.

A few insights are emerging from my explorations. XGBoost has a handy feature to print a chart of which features are most important in its decision making, and it seems recent price movements are the least important. The most important features seem to be the total change and the volatility over the past 90 days. I guess those are showing the overall trend. I’m not sure I need an ML model for that.

Basic State for RL

For the input to the RL model that I’ll use to ‘supervise’ the xgboost model I’ve gathered a crypto market cap index, SPY, Nasdaq, and a crypto Fear and Greed index which goes back to Feb 2018. The others go back further than that but I might have to use the Feb 18 as the start date. This should be enough to make a start on constructing state for the RL model and generating rewards.

My plan at the moment is simply to use the 1 day future return on the cryptomarketcap as the reward. I think I need episodes for the A2C model, so I guess I can just use a set number of days for an episode. I’m using daily data for this, not 6 hourly, which I think will work OK, just have to see. There won’t be any direct interaction between the RL model and the XGBoost model. The process will be that the RL model will predict whether it’s ‘a good time to trade’, i.e. positive return on the current day’s state, or not. If not, then whatever the XGBoost model says about trading ADA specifically will be ignored. If both are in agreement on green to trade, then do so. As the RL model is looking at broader market conditions (SPY, Nasdaq, broad crypto index) then the two models should complement each other, or at least that’s the plan. I think this qualifies as an implementation of meta-labelling, although a very minimal and rough one. Well, the proof is in the pudding, as they say.

Actually, having done a quick review of meta-labelling I’m not sure that what I’m doing does qualify as such. However I did read somewhere some pundit opine that using a deep ML model to ‘critique’ a shallower ML model was a good procedure, and I can certainly make the RL model ‘deeper’ than the XGBoost model, if that actually helps.

Note to Self re Spy

For my RL model I’ve decided to use a crypto total market cap index, excluding BTC and ETH, as my main source of data, and will probably generate rewards based on that. All well and good.

So I’m also adding the SPY index, but have problems relating to dates and times. I downloaded data from TradingView, and all the dates have the time portion at 13:30 or maybe 14:30. I’m guessing this is the opening time of the US exchanges (probably NYSE), in UTC time. Also, no Saturday or Sunday prices. I’m sure there’s a proper way to convert the datetime to 00:00:00 hours so that I can concatenate the columns with the cryptocap data but I used a hack. Converting the datetime to just the date (with date() function), saving the dataframe to a CSV file, then loading that back in to a new dataframe, parsing the date column, gave me 2024-08-12 00:00:00 instead of the original 2024-08-12 13:30:00. Also, using the asfreq(‘d’) function on the dataframe filled in the missing weekend days, forward filling with Friday prices. All a bit of a hack but I can’t think of any better way to deal with this issue. Not all data I might want to use for this model will be conveniently in daily format, at UTC time.

XGBoost

After considering what kind of ‘other model’ I could use to provide input to my RL agent, I think now that a supervised learning classification model might be most appropriate, and I’ve head that XGBoost is one of the best. I’m also thinking of using different data for that than the instrument I’m actually trading with the RL algorithm. For example if I’m trading ADA or ETH I could use BTC or a crypto index as the source of data for the other model. All crypto coins are fairly highly correlated, although BTC seems to take the lead most of the time.

I guess it would be fairly straight forward to turn BTC prices into a supervised learning problem, with maybe a positive daily return indicating good to buy, a negatiive return as good to sell, and close to zero being good to hold. Actually probably some threshold above and below zero might work better.

I could potentially use this approach with lots of different input data, as long as they have some correlation to crypto prices and therefore some predictive power. Anyway, time to get up to speed on using XGBoost effectively, so naturally I now have a small book on the subject, courtesy of machinelearningmastery, as usual with ML books. I’m sure I play a significant part in keeping the ML publishing industry afloat.

State

I did some reading on the use of dropout layers in RL Agents, and the general consensus is that it’s not such a good idea. So I removed them and ran the study again for 5000 episodes this time just to check for some kind of convergence to a reasonably stable result, but no luck.

This shouldn’t be too surprising really. If significant results could be extracted from a few lagged returns and a few returns over longer time periods then everyone would be doing it. My agent needs more intelligence, in the sense of information. Before I spend much time tuning the algorithm I need to ensure that there is some meaningful information to extract.

So, time to go back to working on input features, aka state. Apart from data such as other markets and sentiment, I must explore the possibility of using the output of other models as inputs to this one. One possibility is to use an unsupervised learning model to identify different clusters of trading conditions, and use this as input to my RL agent. I even have a course on the use of Unsupervised Learning in Trading. Perhaps it’s time to review it. Would this be a case of ensemble learning? Probably would.

First Pass

This plot shows the A2C algorithm with my largest dataset, aprox 6 years of hourly data for ETHUSDT. If my logic is correct then the rewards are the 30 day returns, so maybe 5% per month, not too shabby. Of course very variable, and probably some overfitting as each 100 episodes runs through the entire dataset a couple of times. Anyway, something to work on.

A dropout layer in each network, with modest drop0ut setting (0.2) has smoothed things out a lot. And done away with the optimism. I’m hoping that some other, more diverse, inputs will improve matters.

Overfitting?

I’ve got results similar to the above a couple of times now, where the results improve fora while and then get worse again. A search on Google suggest the most likely cause is overfitting. This would not be too surprising given that each of 50 episodes is a complete run through the data.

Suggested solutions include reducing the complexity of the network so it doesn’t learn the specifics of the data so well, and various approaches t regularization. I already have a couple of dropout layers, but I increased the percentage of one of them a bit. And also decreased the number of nodes on my layers. So I’m running it again, let’s see what happens this time. At about 3 hours per runthrough finding good values might take a while. For the chart above I also used RMSprop as the optimizer instead of Adam which I had used previously. I do get some more positive results, so that’s a good thing.

Later:

RMSprop started giving me strange results, a whole string of zero PnL. I think this has happened before with that optimizer. Anyway, back to Adam, with a slightly simpler network, and

No that’s better. Still, it looks like 50 episodes is a bit more than I need, and I’ll reduce it in future. In fact I could save a model after about 10 episodes and then explore further variations using that as a starting point, avoiding the initial training period altogether except for the first time. Perhaps I’ll try that.

Learning

I’m now using my biggest data set, approx 48,000 instances for the ETHUSDT trading pair (one hour candles) over the past six or so years. I ran through this data 20 times and the above chart shows total PnL for each run, given each trade is $100. So if I used the trained model for predictions I would just about lose my $100 by the end of the six years. At least this chart does suggest some learning is taking place. This is using a Double Q Network (or Double Deep Q Network if you prefer).

So, I’ve got a fair amount of data, I’ve got some learning happening, now to get the predictions above zero! Options are to increase the number of input features, to try different network architectures, or to try different algorithms. I’m just about at the point of implementing A2C (Advantage Actor Critic) to work with my test harness. A bit more study needed first. I think I’ll try that first, as my intuition is that it is the option that promises the most improvement.

Random Actions

Average Total PnL	-402
Standard Deviation	135

Total PnL with Random Actions (average of 10 episodes)

Time to test out different algorithms. I’ve modified my code to use a larger dataset, both in number of items and size of the state. Also I modified the agent’s policy to always pick a random action, so the current HoldProcessor, described in the previous post, doesn’t actually do anything. I also ran the agent’s run method 10 times to get an average result (total reward, which is percentage profit or loss). I’m considering this a baseline upon which to improve.

The interpretation of the result is as follows: if I put $100 on every trade (when the agent received an action of 2 (buy) from the policy followed sometime later by 0 (sell), I would lose $402 over the six year period that the data represents.

Hold It!

I’ve implemented the UML Sequence diagram I described in a previous post. The Processor I used, which I’ve called HoldProcessor, decides to hold (action 1) whatever the state is. And doesn’t learn anything.

from processor_base import Processor
import numpy as np


class HoldProcessor(Processor):
    def __init__(self):
        super(HoldProcessor, self).__init__()

    def get_action(self, state: np.ndarray) -> int:
        return 1    # hold

    def learn(self, batch: tuple):
        pass

Even with this primitive decision maker I usually make a profit when I run the app using the last years worth of daily data for ADAUSDT. Probably because ADA has generally been going up for the past year, so random trades are more likely to make a profit than a loss.

Also I have a policy that selects a random action half the time, and consults the processor the other half, so there is a chance of the occasional buy and sell. This is to implement the explore/exploit requirement for the processor to actually learn anything, if it could. I didn’t include the policy in the sequence diagram (an oversight) but it’s just a function that figuratively ‘tosses a coin’ and if it comes down heads it picks a random action (0 to sell, 1 to hold, 2 to buy) otherwise it gets the action from the processor as shown in the sequence diagram.

In the code shown the HoldProcessor inherits from Processor. This is an abstract base class that specifies that any subclass must have a get_action method and a learn method. So as long as my processor classes that use DDQNs or Actor-Critic networks implement these methods I should be able to swap this one for one of those with no other changes. The beauty of coding to an interface!

from abc import ABC, abstractmethod
import numpy as np


class Processor(ABC):
    @abstractmethod
    def get_action(self, state: np.ndarray) -> int:
        pass

    @abstractmethod
    def learn(self, batch: tuple):
        pass

Sequence Diagram

I think I’ve done this correctly, both from a UML perspective and how my app should actually work. I don’t think I need to create frameworks and such, because the only variations I expect to have is to use different processing algorithms. That would involve using different classes to implement the Processor in the diagram above, and nothing else needs to change. Well, I could change what data I’m working with, but that would only require that I pass a filename to the Environment constructor to tell it what data to load. Easy.

One thing I haven’t included is the possibility of running through the data several times. This would involve having the run method (at No. 6 in the diagram) itself being in a loop. This application is different from most of the book implementations in that it’s basically a continuous process, not episodic, so there isn’t really a terminal state. I guess I could consider running out of capital a terminal state, but I don’t think that’s useful at this stage.

I’ve been reading a more theoretical book on the subject, Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto. It doesn’t have actual code in it but it does use pseudocode to define various algorithms, and gives a more detailed discussion of all the ins and outs of the subject than the books that focus more on implementation in Python do. I think I’ll be re-reading it quite a few times.