Reinforcement Learning – Page 2

The Grind

Training my model on 48,000 data items (hourly candles for ETHUSDT from the past six years, approximately) takes about 6 hours. Tuning the model is likely to take weeks. I’m not being very systematic about it because I think I need to include some more feature inputs, so it’s really just an inititial exploratory phase.

I will be reading up on alternative algorithms as well. Currently it’s just a basic Double Deep Q Network. I’ll get my head around the Advantage Actor Critic model and give that a try as well. So now that I’ve actually got code up and running, with enough understanding of how it works to make the changes I want, the long haul starts.

More Progress

I’ve restructured the RL trading app the way I like it, and sorted out the obvious bugs to the point where it runs without error. Of course there might still be logic errors in the code, and the performance is not great. There seems to be a lot of reshaping, squeezing and unsqueezing tensors along the way, probably more than is actually necessary. I’m going to have to examine parts of the code in more detail. Anyway, running it on my 6hr data for ADAUSDT produced the following plot

Each of the 50 episodes was once through the entire dataset. Probably some serious overfitting there, although no obvious learning took place. However the average reward (return from trade) was greater than zero most of the time.

So writing this the way I want, and fixing all the errors, has improved my understanding of how it all works quite a bit, and put me in a position to explore different variations. I’m feeling pretty happy with progress. I’ve been working on this for several months now.

I added another layer to the neural network. There seem to be fewer results below zero, and not so far below zero. Probably a better result.

Sin Bin

Perhaps instead of getting frustrated by all the errors I’m making I should just celebrate them with a collection. So here’s the first.

 File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 46, in run
    current_q_values = self.policy_net(states).gather(1, actions)
RuntimeError: Index tensor must have the same number of dimensions as input tensor

A bit of editing brings up a new error:

File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 46, in run
    current_q_values = self.policy_net(states).gather(1, actions)
RuntimeError: Size does not match at dimension 2 expected index [1, 1, 64] to be smaller than self [1, 64, 3] apart from dimension 1

Am I getting any closer? I think I’m going to have to spend a month or two doing nothing but reshape tensors and changing their datatypes, and converting to/from anything else that they can be converted from until I can do it in my sleep. I thought I had the gather method under control a couple of days ago. Apparently not, since that’s what’s giving me all my current errors. Perhaps I should lay in large amounts of alcohol to see me through this tedious process.

And another:

Traceback (most recent call last):
  File "/home/christina/Pycharm Projects/RL Framework/app.py", line 8, in <module>
    agent.run()
  File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 63, in run
    print(np.mean(rewards))
  File "/home/christina/.local/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 3502, in mean
    return mean(axis=axis, dtype=dtype, out=out, **kwargs)
TypeError: mean() received an invalid combination of arguments - got (dtype=NoneType, out=NoneType, axis=NoneType, ), but expected one of:
 * (*, torch.dtype dtype)
 * (tuple of ints dim, bool keepdim, *, torch.dtype dtype)
 * (tuple of names dim, bool keepdim, *, torch.dtype dtype)

Traceback (most recent call last):
  File "/home/christina/Pycharm Projects/RL Framework/app.py", line 8, in <module>
    agent.run()
  File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 43, in run
    states, actions, rewards, next_states = self.extract_tensors(experiences)
  File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 86, in extract_tensors
    next_states = torch.from_numpy(np.array([x.next_state for x in experiences])).float()
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (64,) + inhomogeneous part.

I think I’ll leave that last one for tomorrow.

Moving Right Along

I’ve written code for my ADA trading environment, and an Agent that decides to sell, hold or buy using the integers 1, 0 and 2. Here’s the code that currently makes that important decision

    def select_action(self, state):
        """select action and pass to environment"""
        action = random.randint(0, 2)
        self.current_action = action    # needed for replay buffer
        reward, next_state, trade_closed = self.env.receive_action(action)
        return reward, next_state, trade_closed

The third line just selects a number from 0 to 2 randomly! If I run my app with 8863 lines of data I do usually get about 0.02% average return, which might just cover the trading fees.

I need to replace that line with a neural network or two. I’m going to attempt to do that with as little reference to other people’s code as possible. A real test of my understanding of how these things work. What mark will I get for this assignment? Well, that could be the profit that my code manages to produce, if I ever use it for live trading. Learning Reinforcement Learning is itself an exercise in Reinforcement Learning. How meta!

ETA: Have an NN making decisions, but not actually learning yet. Sorted out numerous issues converting lists <-> np.ndarray <-> torch.tensor, all with the right number of dimensions! And no integers amongst the floats. Overall profits about the same as selecting random actions. Now, on to the actual learning.

Finding a Framework

The various books and courses I study all organize code differently. Even something as simple as the Replay Buffer I’ve seen four or five different implementations of, from a separate list for each item in the saved ‘experience’ (state, action, reward, next state, done) to named tuples stored in a deque. I’m hunting around to see where each required task actually gets done, and what kind of data is being used. If I want to rejig an example to work with different data it can be a job to make sure that everything is in the same format. I need to know what is a list, what is a numpy array, what is a torch tensor (and what is a pandas DataFrame, my preferred data type for the loaded data, and where the necessary conversions take place.

I need some kind of standard template, and convert anything I’m studying to that template so I know that everything is covered and everything fits. I’m inspired by this basic image of what RL involves

Simple and straightforward. So I should have an Environment class, an Agent class, and an App that creates each and sets the ball rolling. Everything else should be hidden inside those two classes. So I can use different environments such as games or trading environments or some other, and it will be transparent to the main App. And different agents using different strategies, but once again, transparent.

I’ll start with a couple of simple examples, maybe a GridWorld using state/action tables, and then move on to neural networks, while trying to maintain the same basic framework. Hopefully it will make my task easier.

So here’s a very bare bones start. I decided to make Agent and Environment Abstract Base Classes, so I’ll have to subclass to get concrete implementations. Having taught Java for 16 years this feels right at home. I’ve kept all program logic out of the app that acts as starting point. I guess the Agent will end up being pretty busy, as it has to make all the decisions (and improve it’s ability to make good decisions!) Only one point of interaction of Agent with Environment, i.e. Agent provides an action and the environment provides the required responses – the next state, the reward, and whether the ‘game’ is complete.

For the sake of completeness I’ve added the above trivial concrete implementations of those two abstract base classes.

The Bees Knees

I’ve been looking at Ivan’s implementation of the A2C (Advantage Actor Critic) approach to deep reinforcement learning for trading, which he said (in 2022, when the book was published) was the bees knees (my term, not his). So I applied it to my recently downloaded data for ETHUSDT. Results shown above. Not great, but it’s a start.

His construction of state is fairly basic, just the last 10 closing prices as far as I can tell. I’m sure I can do something about that. Also he’s using raw price data. Most people who talk about training models for trading recommend using returns (percent change) rather than actual prices as the latter don’t have a constant mean or variance. I’m not sure if that’s relevant to these RL models, but I have a feeling that it is. Also the trades are simple, just buy (or short) at the start of the day and sell or cover at the end. No holding until a signal to close. The actual code is going to take some study. I get the general idea of what the actor critic approach is trying to achieve, compared with the temporal difference approach which is what I have been looking at up ’til now. The devil is in the details however.

So, I’ve looked at the rather elaborate approach used by Quantra in their Deep Reinforcement Learning in Trading course, with state composed of ohlc data over several bars at three levels of granularity. Plus technical indicators, and calendar related inputs. Approach is temporal difference (I think that’s what it’s called). I’ve looked at a similar approach from DeepLizard which was created to solve a GridWorld environment, which I’ve rejigged to work with trading data. Not sure how successfully. And now this A2C approach from Ivan Gridin’s book.

I’m not at the point where I could write code to implement one of these without consulting references. Too many details that I haven’t totally internalized yet, especially concerning getting tensors into the right shape. It’s a language problem really, internalizing the grammar and vocabulary so that you can speak/write without thinking about it. I guess it’s just practice, practice, practice.

Whats Up Ivan?

I have a book by Ivan Gridin on Reinforcement Learning, and he has a very interesting looking chapter on using it for stock trading, with code versions using Tensorflow and PyTorch. It’s a later chapter and I’m going to have to study what comes before to follow the details, but looks promising.

However I recently came across this article on Medium written by the same Ivan, discussing the pitfalls of Reinforcement Learning in stock trading, and explicitly warning against the very approach he used in his book (Actor-Critic). The wisdom of experience perhaps. Anyway, as far as I understand what he’s saying one could deal with the issues with proper risk management. I’ll be interested to see what Yves Hilpisch has to say on the issue in his forthcoming book on reinforcement learning in trading. I guess since I don’t intend to do any serious trading, but am just looking at this as a ‘hobby’, I won’t be risking anything significant.

More Data, Baby

I believe it was Enrico Fermi who once said that you won’t solve a problem in 1000 years if you don’t think about it for 5 minutes. So I finally decided to write a loop to download data from Binance, given the limit of 1000 bars for any one download, and now have 48,000 one hour bars (OHLC data). I’ve gone for ETHUSDT this time, not sure why, maybe just so it’s easier to distinguish the files.

The training loop I have been using for the ADA data has cycled through the data quite a few times, due to there not being all that much of it (about 7000 bars) but I think that might have caused some of my problems. This time I’ll just go through it once. Perhaps I should add a couple more features, but getting data at the 1 hour granularity is not that easy. I could add in the BTC price (easy), or a genera crypto index (but getting that at 1 hour for six years might prove challenging). Perhaps I could use some other data at the daily level (such as S&P500) and resample to hourly. Must give it some thought.

I’ve hit a bit of a snag with the online course I’ve mentioned. It uses Tensorflow, which I can’t run on the GPU on my machine because of some conflicts that I don’t understand, but probably because I’ve set up PyTorch to run on the GPU. This means it’s pretty slow. Transferring from TF to PT is easy as far as creating the model is concerned, but the training is giving me some issues. PyTorch requires a custom training loop, and TF just calls a fit method. Not so hard with a standard supervised learning problem, but an RL problem is a bit different. I just don’t know TF well enough to work out the equivalents. I’ll have to give that aspect of the project a bit more thought. I’m probably making mountains out of molehills (or storms in a teacup, or whatever).

The main issue is understanding exactly what it is that I’m trying to optimize. I guess when updating the Q table one is working towards the point where the new value is close to the old value, and the difference between them (the loss) is minimal. Sometimes I think the fog is lifting, and sometimes not.

ETA: Facepalm! The Quantra course has about 10 years of data for the S&P500 at 5 minute granularity, as I’ve mentioned quite a few times. They even resample it to hourly as part of the state building process. Surely I can use that as in input feature for my crypto trading. What an idiot I am.

Validation?

I’m attempting to validate my models by running them on data that was not used to train them, but I’m getting very strange results. During training I saved some models that appeared to be giving decent results, but when loading them back again and running them ON THE ORIGINAL TRAINING DATA they give no results at all!! Not all of them though, only 10 out of 12. Something is happening here and you don’t know what it is…

Well, I could validate two of the models. They didn’t perform as well on the test data as on the training data, but the difference was not so great. Some overfitting, or perhaps just regime change (of the fiscal, not political, variety). I’ll have to consider how to proceed. More data would be great, but the only way to do that is by going for shorter time periods. I want to end up with something usable, and preferably not with a trading bot. Is it worth going for 4 hour data? Will increase the total periods by fifty percent, but I’m not sure that will make much difference. There’s obviously a good reason why the Quantra course on RL was producing worse results than a simple buy and hold strategy (on the S&P500) even with ten years of five minute data.

With the experience of working through an actual project I can go back to the books and have a better understanding of the issues being discussed. It’s a bit hard to do that ‘in a vacuum’ so to speak. There are still a lot of avenues to explore. I’ve read good things about LSTMs, and PPO, and stuff like that. I might even find out what those acronyms mean. Should keep me busy for a long time.

Feeding the Beast

I’m expanding the input features to my ADA neural network. Have added day of the week (one hot encoded), a measure of the range of each period (low/high), and an RSI indicator courtesy of the TA-lib. Plus I updated the data from Binance and now have over 8000 six-hour periods. The spreadsheet with the data looks quite impressive. Below is a screenshot showing the first 80 periods, or 20 days, one percent of the total.

Some of the column headings are not quite accurate. 30dayret is actually 30periodret, where the period is 6 hours, not 1 day. In the past I’ve mostly worked with daily data so it’s a habit to refer to everything as 30day, 60day, etc.

Running my training script on this data gave me an average return per trade of 0.5% (before transaction costs), and maybe 100 such trades per year. I guess if I cleared 0.3% per trade on 100 trades that would be about 30% per year, which is not too shabby. Still, rosy test results have cost me quite a bit in the past. Those trading gods are fickle, if not downright malicious.

I guess I’ll have to redo my hyperparameter tuning now that I’m using an altered data set, and some validation of course. And maybe explore different network topologies, more nodes, more layers, potentially different kinds of layer such as RNN or CNN.

So far I’m only looking at a long-only strategy. I could expand this to a long-short strategy, but that’s harder to actually trade now that Binance doesn’t allow margin trading (in Australia). Perhaps I should check out 1inch or similar. Binance was so convenient. Not going to get too excited. If all goes well I might put $100 into trading the strategy, just to maintain some interest.

An interesting possibility is that the model trained on ADA could be used on other coins. Seems to be a common practice, using pre-trained models for similar problems. I doesn’t take that long to train a model though. Currently about an hour for 1,000,000 episodes (each episode is one period of data)

Hyperparameter Tuning

Hyperparameter tuning sounds such a fancy term, but in reality it’s just adjusting a couple of variables to get the best result possible. Like finding the perfect temperature to cook crepes (I’ve been seasoning a new carbon steel crepe pan lately, with mixed resuls).

I’ve been exploring various values of the learning rate. An ML algorithm starts of by ‘guessing’ how important any given input feature is in determining the final result, and adjusting the importance depending on how wrong the predictions are. The size of the adjustment is the learning rate, and the best rate has to be determined on a case by case basis. So, try a whole bunch of values, within a range that ‘seems reasonable’, and find the best by trial and error. A lot of that in machine learning.

Another ‘hyperparameter’ commonly experimented with is the optimizer algorithm used to go from first guess to best result. I’ve tried Adam, SGD (Stochastic Gradient Descent) and RMSprop. Also AdamW, which is supposed to be an improved Adam but in my case gave worse results. I don’t intend to modify the actual network much until I get some more consistent results. So far they’re very variable. I think I need a wider range of inputs.

I haven’t found that using a gpu is faster than using the cpu, however I’ve seen charts that show that for large, complicated problems it is faster, but not necessarily for smaller, simpler problems. However one big disadvantage of using the cpu for my machine learning problem is that it uses all the processing power and I can’t do anything else on the computer while waiting for the training to complete. And when you’re doing trial and error, that can take a long time. Using the gpu for training leaves me with enough processing power on the cpu to do most of the other things I use my computer for. Definitely the way to go.

Too Good to be True

My early tests on my RL trading app gave promising results, which at the time I thought were a little ‘too good to be true’. Well, that feeling was justified, as I later discovered that I had written the code in a way that it repeatedly learned from a small subset of the data, thus essentially ‘rote learning’ (called overfitting in ML lingo) rather than learning more general principles that could generalize to unseen data.

So after rewriting the code, and fixing many other errors besides (for which the logging I’ve incorporated has proved somewhat helpful), and also downloading 6 hour data from Binance for the entire 6 years that they’ve had ADAUSDT on their exchange, I’ve been running the app again with a variety of optimizers but always getting a similar result, which is that almost no learning takes place!! My average return over several thousand trades is approx 0.03%. Not enough to even cover the fees (which I haven’t included in the algorithm).

There’s not much point trying to do further optimization, or explore a range of different network topologies, when the baseline is so close to zero. I think I’m going to have to address the ‘what data to use’ issue first up, until I do actually get some learning, and then try to improve on it. That Quantra course used quite a lot of input features, including several technical indicators and what day of the week it was. I’m going to have to enlarge my ‘state space’ a bit. It’s interesting to see what other people (authors of books/courses) are using for their features. Ideally I should be using some measure of market sentiment. Perhaps I need to learn how to scrape X for tweets (?) relating to crypto. Another day, perhaps.