Nailing it Down

Decision time. After considering lots of options, here’s the plan. I intend to trade manually (not via a bot) and for that to be doable I think the shortest timeframe that makes sense is 6 hour. So, check the prices and the models four times a day.

Whatever I decide to trade, use XGBoost to generate signals, where the target is a profit over the following five days. So cumulative return over the next 20 periods is greater than some threshold, at some point. State will be as much input related to the instrument as I can find, that has some predictive capability.

Check the output of this model using an RL model that uses any non-endogenous data that I can dream up, using returns from the first model as rewards. Basically keep adding stuff until I get some result. Once that happens I’ll start trading a small amount. Long only. Probably on Binance. So, time to start putting all this together. Rubber hits the road. Almost.

Avenue to Explore

Transfer learning involves using a model trained on one dataset for a different dataset. As a major problem with crypto is lack of data, as many coins have not been around for very long, one needs to use fairly high frequency data to train the model. I’m not referring to high frequency trading here, with timeframes in the milliseconds. For me hourly data is pretty high frequency, as I generally prefer to work with daily data.

So, if I can train a model on hourly data, but then use it to make predictions on daily data, the job could be done. I clearly need to explore transfer learning more closely, but it seems that it involves keeping early layers of a neural network but replacing the final layer, or something like that. I’m a bit encouraged by the possibility of all this by a comment I remember from one of the first books I read on price action trading, which said that chart patterns have a fractal nature, in that they look the same whatever timeframe one is looking at. Anyway, definitely an avenue to explore

An Edge?

I ran xgboost on my large dataset, as a binary classification problem. For the target I simply recorded whether the following time period’s return was positive, or not. So basically, given current data, will the price go up in the next hour? Here’s the resulting confusion matrix:

Predicted NegativePredicted Positive
True Negative40463834
True Positive36784203

These figures relate to a test set. The model was trained on the first 70% of data, and tested (above) on the remaining 30%. Overall accuracy was 52.34%, and precision was 52.30%

I guess the most important numbers are the predicted positive values. This was correct more often than not. The percentage of true positives in the test dataset is 50.00%, and with close to 16,000 entries in the test set, the 2% advantage is probably significant. But is it big enough to be exploitable?

I’ve used xgboost without any tuning. I’ve just skimmed through that reference on using xgboost that I mentioned in a previous post. More significantly, can I refine my signals by using an RL model to validate whether the market as a whole is in a good state to trade? Or to determine position size based on level of risk? At least I have something to work with. A slight edge.

Meta Labelling

Marcos Lopez de Prado has championed the concept of meta-labelling, which, in the field of quantitative trading, means using an ML model to apply risk management to a different quantitative trading strategy. Ernie Chan has discussed this in various contexts, for example here.

So if I was to use this in my trading, how would it relate to what I’m currently doing? The suggestion seems to be that a deep neural network ‘scrutinizes’ the output of some more shallow ML model, or indeed any other strategy including simple discretionary (not quantitative) trading.

So I’m wondering if perhaps I should use something like XGBoost to identify signals for trading, and use my RL model to assess whether trading those signals is a good idea and if so, how much to put on them. That’s pretty much the reverse of what I’ve been contemplating for the past few days. However with some theoretical basis (and actual practice) supporting this approach, I think I’ll give it a go. As for features, I’m thinking that endogenous features (related to the asset I’m trading) should be used for the base model, XGBoost in this case, and exogenous data for the ‘supervisory’ model. That’s my intuition anyway, so I’ll go with that, at least for starters.

XGBoost

After considering what kind of ‘other model’ I could use to provide input to my RL agent, I think now that a supervised learning classification model might be most appropriate, and I’ve head that XGBoost is one of the best. I’m also thinking of using different data for that than the instrument I’m actually trading with the RL algorithm. For example if I’m trading ADA or ETH I could use BTC or a crypto index as the source of data for the other model. All crypto coins are fairly highly correlated, although BTC seems to take the lead most of the time.

I guess it would be fairly straight forward to turn BTC prices into a supervised learning problem, with maybe a positive daily return indicating good to buy, a negatiive return as good to sell, and close to zero being good to hold. Actually probably some threshold above and below zero might work better.

I could potentially use this approach with lots of different input data, as long as they have some correlation to crypto prices and therefore some predictive power. Anyway, time to get up to speed on using XGBoost effectively, so naturally I now have a small book on the subject, courtesy of machinelearningmastery, as usual with ML books. I’m sure I play a significant part in keeping the ML publishing industry afloat.

State

I did some reading on the use of dropout layers in RL Agents, and the general consensus is that it’s not such a good idea. So I removed them and ran the study again for 5000 episodes this time just to check for some kind of convergence to a reasonably stable result, but no luck.

This shouldn’t be too surprising really. If significant results could be extracted from a few lagged returns and a few returns over longer time periods then everyone would be doing it. My agent needs more intelligence, in the sense of information. Before I spend much time tuning the algorithm I need to ensure that there is some meaningful information to extract.

So, time to go back to working on input features, aka state. Apart from data such as other markets and sentiment, I must explore the possibility of using the output of other models as inputs to this one. One possibility is to use an unsupervised learning model to identify different clusters of trading conditions, and use this as input to my RL agent. I even have a course on the use of Unsupervised Learning in Trading. Perhaps it’s time to review it. Would this be a case of ensemble learning? Probably would.

First Pass

This plot shows the A2C algorithm with my largest dataset, aprox 6 years of hourly data for ETHUSDT. If my logic is correct then the rewards are the 30 day returns, so maybe 5% per month, not too shabby. Of course very variable, and probably some overfitting as each 100 episodes runs through the entire dataset a couple of times. Anyway, something to work on.

A dropout layer in each network, with modest drop0ut setting (0.2) has smoothed things out a lot. And done away with the optimism. I’m hoping that some other, more diverse, inputs will improve matters.

The Road Ahead

I’ve pretty much got my head around the implementation of the Advantage Actor Critic (A2C) algorithm provided by Ivan Gridin in his book Practical Deep Reinforcement Learning with Python. There are a couple of lines of code I’m a bit uncertain about, but running them through the debugger and having a good look at what changes they bring about should help clarify.

Ivan’s example trades Microsoft stock, downloaded with yfinance (which gets data from Yahoo Finance). The networks he’s using are pretty basic, as is the trading strategy he’s using, and the state he constructs also. So the task ahead is to modify it to work with any data (and specifically my crypto data), to construct more elaborate state, use a more sophisticated trading strategy, and explore more complex network architectures, perhaps including networks other than MLPs. All without breaking the code!

These changes are not particularly challenging. Actually finding good solutions might be challenging, but it’s basically trial and error from here on out. I think at this stage I can say that I know what I’m doing. Probably not a good thing to say. Can I hear Nemesis winging my way?

Historical Reasons

I think I’ve worked out what has been a major source of confusion for me. The books and courses I’ve been reading/viewing on RL have made a big deal about policy gradient methods being ‘parametric’, yet it seems to me that the only real difference between using a NN for Q learning or policy gradient learning is whether or not there’s a softmax activation layer on the output. Everything a neural network does is parametric, if one considers the network weights to be parameters. I just can’t work out what the big deal is about.

However, I think now that it’s for historical reasons. When I was teaching basic computer concepts to first year students one encounters terms like ROM and RAM in relation to types of memory, and yet those terms are meaningless. ROM is RAM, in that it’s random access memory, which is what RAM stands for. The real difference (back then) is that RAM is volatile and ROM is not. However there was a time, in the dawn of computing, when tapes were used for storage, and these are sequential storage devices. This is the real alternative to random access. But the terms stuck, and were used in ways no longer appropriate to their actual meaning.

So it’s finally occurred to me that these approaches to reinforcement learning predate the use of neural networks as function approximators. And all the concepts around the different approaches reflect what were real differences. So the fact that one treats the output of the network a bit differently if one is using it for a policy gradient strategy than if one is using it for a value gradient strategy might not seem much of a difference, but before neural networks the differences were probably a lot greater. Just like once upon a time the web and email (and news) were all different services on the internet, with different servers and clients, but now that has all converged to ‘the web’. For a long time I read email with Thunderbird, and occasionally used RSS Owl to ‘consume’ newsgroups. And paid my utility bills in person.

How Good Was That?

I’ve been overthinking this whole Deep RL thing. Having got a basic Policy Gradient algorithm up and running, I think I can lay out a high level view of what it’s all about.

  • State (numbers) from the environment input to network
  • Network produces some other numbers (output)
  • Agent uses some process to decide how to translate those numbers into an action to send to environment
  • Agent sends action to Environment
  • Environment responds with new state and reward
  • Agent evaluates ‘how good was that response’, aka loss function
  • Agent instructs network to do better next time (minimize loss)
  • Network attempts to oblige until Agent is happy with the outcome

Step 3 is where all the theory comes in. The basic policy gradient algorithm takes the output (the number of output values equals the number of available actions) and scales them until they are all positive, all lie between 0 and 1, and all add up to 1. Then the agent thinks, hey, these look like probabilities. And the action sent to the environment is based on that. If the numbers are 0.7, 0.2 and 0.1, then the agent chooses action A 70% of the time, action B 20% of the time, and action C 10% of the time. If the Agent is in a Q learning frame of mind, it uses the numbers to get to know the model of the environment a bit better. Whatever the process is, as long as there’s some measure of ‘how good was that’ the network can work towards improving the overall result.

The other aspect of all this is ‘improving’ the result, not in the sense of maximizing it but in the sense of smoothing it out. Hence a double Deep Q network will produce a smoother output than a single Deep Q network. And I imagine more elaborate variations of the Policy Gradient idea will produce smoother, maybe faster, results.

So what the books and courses are mostly about is how the Agent decides, on the basis of the numbers provided by the network, what to do in the Environment. As far as the network and the environment are concerned, it’s all just numbers.

Up and Running

I have a basic Policy Gradient algorithm up and running at last. The code comes from a Udemy course by Phil Tabor (specifically on AC methods), which I modified very slightly to use with my algorithm testing app. It’s not performing all that well, and I don’t fully understand the code, but it’s a starting point. From here I can study the ‘inner workings’ more closely with the help of the debugger to step through code and see what it actually does. Also my reading will have a solid point of reference. I find it hard to think about these things in purely abstract terms. And I can work on modifications and improvements.

I think I’ve reached the most important milestone, moving from ‘can I do this?’ to ‘this is doable’. I’ve certainly had huge doubts over the past couple of months, or longer. This field is more ‘math heavy’ than any I’ve previously tackled, and I find that very challenging. There’s the question how much do I need to understand the maths that underpins the algorithms. I’ve always found it easier to do something if I understand what I’m doing, rather than blindly following instructions, but many people don’t seem to labour under that constraint. Anyway, I now feel confident to move forward.

What am I Missing?

TL:DR Feeling stupid today

Not the first time I’ve asked myself this question, for sure. For the kind of algorithm I’m looking at now (policy gradient) a word keeps popping up – parameterized. The new policy is to be parameterized. By the weights of the network, apparently. But wait. The old policy had network weights. Wasn’t that ‘parameterized’? It’s not so much the use of the word, but how it differs from what went before. As a programmer I have an understanding of what parameters are in the programming context. Is it different in the mathematical context? I recall refreshing my memory of linear algebra a year or two ago and that the linear equations could be parameterized, don’t exactly recall the specifics. But they seem to making a big deal about what? No idea.

There seem to be some fine distinctions in this field of study that I’m not seeing. I guess for someone who understands it those distinctions are not fine but broad. I’m trying to understand what the provided code is trying to achieve, without much success. The mathematical explanations don’t make much sense to me. For me the question What are you trying to achieve? is important to understand when learning how that aim is to be realized. If the provided answer is too general or expressed in a language I don’t understand very well, I get stuck.

The level of explanation I’m looking for is as follows, with regard to Q learning. From state A one can move to state B or state C. There’s an immediate reward for each, probably different. But Q learning provides an answer to the question once I’m at B or C, what can I achieve from there? So the Value, or Q value of a state depends on what it can lead to, and this look ahead functionality allows one, with sufficient exploration, to reach the final goal easily on later attempts at the task, because these ‘landmarks’ have been set up. Talking about the recursive nature of the Bellman equation is not so intuitive.