Overfitting?

I’ve got results similar to the above a couple of times now, where the results improve fora while and then get worse again. A search on Google suggest the most likely cause is overfitting. This would not be too surprising given that each of 50 episodes is a complete run through the data.

Suggested solutions include reducing the complexity of the network so it doesn’t learn the specifics of the data so well, and various approaches t regularization. I already have a couple of dropout layers, but I increased the percentage of one of them a bit. And also decreased the number of nodes on my layers. So I’m running it again, let’s see what happens this time. At about 3 hours per runthrough finding good values might take a while. For the chart above I also used RMSprop as the optimizer instead of Adam which I had used previously. I do get some more positive results, so that’s a good thing.

Later:

RMSprop started giving me strange results, a whole string of zero PnL. I think this has happened before with that optimizer. Anyway, back to Adam, with a slightly simpler network, and

No that’s better. Still, it looks like 50 episodes is a bit more than I need, and I’ll reduce it in future. In fact I could save a model after about 10 episodes and then explore further variations using that as a starting point, avoiding the initial training period altogether except for the first time. Perhaps I’ll try that.

New Adventure

I’ve signed up for Google Cloud Platform, and am working through a Udemy course on how to actually use it to run ML apps in Python. I find it all a bit intimidating, not sure why. My main motivation is to access more powerful resources, now that my needs are more demanding. I ran my trading app 50 times over my large dataset this morning and it took over 3 hours. That’s not such a big deal, except that I’ll be wanting to iterate over multiple variations (as in hyperparameter tuning) and that becomes unworkable on my local machine. I could upgrade, but there are still limits.

GCP gives a newcomer several hundred dollars credit on signup with 30 days to spend it. I had better not wait too long to actually give it a good try out. My trials (on my own machine) indicate that using a GPU is no faster than a CPU, but apparently this depends on how big the dataset is, and particularly how many features are being calculated at once because that’s where the parallelism of the GPU comes into it’s own. I’ve currently only got 24 features, and the 4 cores on my CPU are running at about 70% when I’m running my trading app. I doubt one of those fancy new nVidia GPUs would improve on that too much. So I’m not sure that GCP will actually speed up my code but there probably is some way I could achieve that.

Learning

I’m now using my biggest data set, approx 48,000 instances for the ETHUSDT trading pair (one hour candles) over the past six or so years. I ran through this data 20 times and the above chart shows total PnL for each run, given each trade is $100. So if I used the trained model for predictions I would just about lose my $100 by the end of the six years. At least this chart does suggest some learning is taking place. This is using a Double Q Network (or Double Deep Q Network if you prefer).

So, I’ve got a fair amount of data, I’ve got some learning happening, now to get the predictions above zero! Options are to increase the number of input features, to try different network architectures, or to try different algorithms. I’m just about at the point of implementing A2C (Advantage Actor Critic) to work with my test harness. A bit more study needed first. I think I’ll try that first, as my intuition is that it is the option that promises the most improvement.

Random Actions

Average Total PnL-402
Standard Deviation135
Total PnL with Random Actions (average of 10 episodes)

Time to test out different algorithms. I’ve modified my code to use a larger dataset, both in number of items and size of the state. Also I modified the agent’s policy to always pick a random action, so the current HoldProcessor, described in the previous post, doesn’t actually do anything. I also ran the agent’s run method 10 times to get an average result (total reward, which is percentage profit or loss). I’m considering this a baseline upon which to improve.

The interpretation of the result is as follows: if I put $100 on every trade (when the agent received an action of 2 (buy) from the policy followed sometime later by 0 (sell), I would lose $402 over the six year period that the data represents.

Hold It!

I’ve implemented the UML Sequence diagram I described in a previous post. The Processor I used, which I’ve called HoldProcessor, decides to hold (action 1) whatever the state is. And doesn’t learn anything.

from processor_base import Processor
import numpy as np


class HoldProcessor(Processor):
    def __init__(self):
        super(HoldProcessor, self).__init__()

    def get_action(self, state: np.ndarray) -> int:
        return 1    # hold

    def learn(self, batch: tuple):
        pass

Even with this primitive decision maker I usually make a profit when I run the app using the last years worth of daily data for ADAUSDT. Probably because ADA has generally been going up for the past year, so random trades are more likely to make a profit than a loss.

Also I have a policy that selects a random action half the time, and consults the processor the other half, so there is a chance of the occasional buy and sell. This is to implement the explore/exploit requirement for the processor to actually learn anything, if it could. I didn’t include the policy in the sequence diagram (an oversight) but it’s just a function that figuratively ‘tosses a coin’ and if it comes down heads it picks a random action (0 to sell, 1 to hold, 2 to buy) otherwise it gets the action from the processor as shown in the sequence diagram.

In the code shown the HoldProcessor inherits from Processor. This is an abstract base class that specifies that any subclass must have a get_action method and a learn method. So as long as my processor classes that use DDQNs or Actor-Critic networks implement these methods I should be able to swap this one for one of those with no other changes. The beauty of coding to an interface!

from abc import ABC, abstractmethod
import numpy as np


class Processor(ABC):
    @abstractmethod
    def get_action(self, state: np.ndarray) -> int:
        pass

    @abstractmethod
    def learn(self, batch: tuple):
        pass

Sequence Diagram

I think I’ve done this correctly, both from a UML perspective and how my app should actually work. I don’t think I need to create frameworks and such, because the only variations I expect to have is to use different processing algorithms. That would involve using different classes to implement the Processor in the diagram above, and nothing else needs to change. Well, I could change what data I’m working with, but that would only require that I pass a filename to the Environment constructor to tell it what data to load. Easy.

One thing I haven’t included is the possibility of running through the data several times. This would involve having the run method (at No. 6 in the diagram) itself being in a loop. This application is different from most of the book implementations in that it’s basically a continuous process, not episodic, so there isn’t really a terminal state. I guess I could consider running out of capital a terminal state, but I don’t think that’s useful at this stage.

I’ve been reading a more theoretical book on the subject, Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto. It doesn’t have actual code in it but it does use pseudocode to define various algorithms, and gives a more detailed discussion of all the ins and outs of the subject than the books that focus more on implementation in Python do. I think I’ll be re-reading it quite a few times.

Visual Paradigm

I need to have a very clear idea of the components of my RL Framework (or is Template a better term?) and what each one does, and how they interact. Rewriting someone else’s code to fit my structure is still a bit of a problem for me. So I’m turning to a tool that I used a bit when teaching Java, namely Visual Paradigm. Actually I did use it for Python a few years back when I was coding up a trading bot. The Sequence Diagram is very helpful for defining how the different parts interact.

There’s a free Community Edition which I’ve installed. Charts are watermarked but that’s not a problem for me. I’ll probably need to do a quick review of the docs, but it’s not entirely new to me.

I’ve been wondering whether to start putting some money down (as in $100 per trade, with maybe a dozen trades per year). Just a hobby. I’d like to work with long/short strategies but Binance has made that a lot more difficult than it used to be. Using DeFi requires using multiple platforms, but perhaps that’s not such a problem really. I might hold off a while longer. At the moment I’d be sure to lose money, but I think that’s a certainty however good my models are!

The Grind

Training my model on 48,000 data items (hourly candles for ETHUSDT from the past six years, approximately) takes about 6 hours. Tuning the model is likely to take weeks. I’m not being very systematic about it because I think I need to include some more feature inputs, so it’s really just an inititial exploratory phase.

I will be reading up on alternative algorithms as well. Currently it’s just a basic Double Deep Q Network. I’ll get my head around the Advantage Actor Critic model and give that a try as well. So now that I’ve actually got code up and running, with enough understanding of how it works to make the changes I want, the long haul starts.

More Progress

I’ve restructured the RL trading app the way I like it, and sorted out the obvious bugs to the point where it runs without error. Of course there might still be logic errors in the code, and the performance is not great. There seems to be a lot of reshaping, squeezing and unsqueezing tensors along the way, probably more than is actually necessary. I’m going to have to examine parts of the code in more detail. Anyway, running it on my 6hr data for ADAUSDT produced the following plot

Each of the 50 episodes was once through the entire dataset. Probably some serious overfitting there, although no obvious learning took place. However the average reward (return from trade) was greater than zero most of the time.

So writing this the way I want, and fixing all the errors, has improved my understanding of how it all works quite a bit, and put me in a position to explore different variations. I’m feeling pretty happy with progress. I’ve been working on this for several months now.

I added another layer to the neural network. There seem to be fewer results below zero, and not so far below zero. Probably a better result.

Sin Bin

Perhaps instead of getting frustrated by all the errors I’m making I should just celebrate them with a collection. So here’s the first.

 File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 46, in run
    current_q_values = self.policy_net(states).gather(1, actions)
RuntimeError: Index tensor must have the same number of dimensions as input tensor

A bit of editing brings up a new error:

File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 46, in run
    current_q_values = self.policy_net(states).gather(1, actions)
RuntimeError: Size does not match at dimension 2 expected index [1, 1, 64] to be smaller than self [1, 64, 3] apart from dimension 1

Am I getting any closer? I think I’m going to have to spend a month or two doing nothing but reshape tensors and changing their datatypes, and converting to/from anything else that they can be converted from until I can do it in my sleep. I thought I had the gather method under control a couple of days ago. Apparently not, since that’s what’s giving me all my current errors. Perhaps I should lay in large amounts of alcohol to see me through this tedious process.

And another:

Traceback (most recent call last):
  File "/home/christina/Pycharm Projects/RL Framework/app.py", line 8, in <module>
    agent.run()
  File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 63, in run
    print(np.mean(rewards))
  File "/home/christina/.local/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 3502, in mean
    return mean(axis=axis, dtype=dtype, out=out, **kwargs)
TypeError: mean() received an invalid combination of arguments - got (dtype=NoneType, out=NoneType, axis=NoneType, ), but expected one of:
 * (*, torch.dtype dtype)
 * (tuple of ints dim, bool keepdim, *, torch.dtype dtype)
 * (tuple of names dim, bool keepdim, *, torch.dtype dtype)
Traceback (most recent call last):
  File "/home/christina/Pycharm Projects/RL Framework/app.py", line 8, in <module>
    agent.run()
  File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 43, in run
    states, actions, rewards, next_states = self.extract_tensors(experiences)
  File "/home/christina/Pycharm Projects/RL Framework/ada/ada_agent.py", line 86, in extract_tensors
    next_states = torch.from_numpy(np.array([x.next_state for x in experiences])).float()
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (64,) + inhomogeneous part.

I think I’ll leave that last one for tomorrow.

Moving Right Along

I’ve written code for my ADA trading environment, and an Agent that decides to sell, hold or buy using the integers 1, 0 and 2. Here’s the code that currently makes that important decision

    def select_action(self, state):
        """select action and pass to environment"""
        action = random.randint(0, 2)
        self.current_action = action    # needed for replay buffer
        reward, next_state, trade_closed = self.env.receive_action(action)
        return reward, next_state, trade_closed

The third line just selects a number from 0 to 2 randomly! If I run my app with 8863 lines of data I do usually get about 0.02% average return, which might just cover the trading fees.

I need to replace that line with a neural network or two. I’m going to attempt to do that with as little reference to other people’s code as possible. A real test of my understanding of how these things work. What mark will I get for this assignment? Well, that could be the profit that my code manages to produce, if I ever use it for live trading. Learning Reinforcement Learning is itself an exercise in Reinforcement Learning. How meta!

ETA: Have an NN making decisions, but not actually learning yet. Sorted out numerous issues converting lists <-> np.ndarray <-> torch.tensor, all with the right number of dimensions! And no integers amongst the floats. Overall profits about the same as selecting random actions. Now, on to the actual learning.

Finding a Framework

The various books and courses I study all organize code differently. Even something as simple as the Replay Buffer I’ve seen four or five different implementations of, from a separate list for each item in the saved ‘experience’ (state, action, reward, next state, done) to named tuples stored in a deque. I’m hunting around to see where each required task actually gets done, and what kind of data is being used. If I want to rejig an example to work with different data it can be a job to make sure that everything is in the same format. I need to know what is a list, what is a numpy array, what is a torch tensor (and what is a pandas DataFrame, my preferred data type for the loaded data, and where the necessary conversions take place.

I need some kind of standard template, and convert anything I’m studying to that template so I know that everything is covered and everything fits. I’m inspired by this basic image of what RL involves

Simple and straightforward. So I should have an Environment class, an Agent class, and an App that creates each and sets the ball rolling. Everything else should be hidden inside those two classes. So I can use different environments such as games or trading environments or some other, and it will be transparent to the main App. And different agents using different strategies, but once again, transparent.

I’ll start with a couple of simple examples, maybe a GridWorld using state/action tables, and then move on to neural networks, while trying to maintain the same basic framework. Hopefully it will make my task easier.

So here’s a very bare bones start. I decided to make Agent and Environment Abstract Base Classes, so I’ll have to subclass to get concrete implementations. Having taught Java for 16 years this feels right at home. I’ve kept all program logic out of the app that acts as starting point. I guess the Agent will end up being pretty busy, as it has to make all the decisions (and improve it’s ability to make good decisions!) Only one point of interaction of Agent with Environment, i.e. Agent provides an action and the environment provides the required responses – the next state, the reward, and whether the ‘game’ is complete.

For the sake of completeness I’ve added the above trivial concrete implementations of those two abstract base classes.