ML Experiments – christina norwood

State

From the 6 hour close prices of my synthetic data I have constructed state (a set of features) consisting of (in each row) the period return (fractional change from the previous period), the past 6 periods returns, and the total return over the past 7, 15, 30, 60, 90 and 120 days. Also what day of the week each row occurred on. This can sometimes be significant in trading. Code and dataframe are shown below:

So what is this data used for? Well, the ‘agent’ (in this case the DDQN) gets each row of data and has to work out whether to buy, hold or sell. If it buys then later sells, the profit (or loss) acts as a reward. It can also short sell, i.e. sell first and then buy later. With only this information, and initially acting completely at random, it learns how to make a profit!! Hopefully. Pretty neat, huh?

Synthetic Data

A couple of courses I’ve done have recommended testing an algorithm on synthetic data, especially data with a very simple form such as a fairly linear uptrend or a sine wave (to emulate a mean reverting asset). Each of these should be a bit noisy to be more ‘realistic’ but still, simple. This will show whether the algorithm can learn such simple patterns, and provide the opportunity to test out whether other factors are OK.

One of those ‘other factors’ is amount of data. In the course I’m currently looking at the data used is about 10 years worth of 5 minute data! And when they backtested the algorithm (a double deep Q network) it took about a year to become profitable. That’s a lot of data. Another factor is the state. The course used the whole OHLCV data, with so many lags, at different time granularities, with associated TA indicators, that it ended up with about 160 inputs. My current intention is to simplify, starting with how the input state is organized. But perhaps it’s a good idea to test it out on something easy, such as a noisy uptrend or a noisy sine wave, rather than go straight for real data.

So I made some synthetic data, 6hourly samples of a noisy uptrend, 8000 samples in total. Will my DDQN (Double Deep Q Network) be able to solve this problem?

Plan B

Or is that plan Z? Anyway, several months ago I purchased a course on Quantra about Reinforcement Learning in Trading. I found it pretty heavy going, especially as some of the explanations seemed a little ‘light on’. Well, I’ve just been going through it again (videos, text and code in the form of Jupyter Notebooks) and it makes a lot more sense now that I’ve filled in the details from other sources.

So my plan now is to recreate the template that it develops but in a simpler form. I can always add complexity later. I’ll be using PyTorch instead of Keras/Tensorflow as well, but that change should be pretty trivial now that I’m more familiar with both.

I plan to use ADA as the asset, having already settled on that for my previous plan (development of a ML assisted momentum strategy). Once again I doubt that I’ll actually use this in trading, but it’s a field I’m familiar with so I can focus on setting up the Double Deep Q Network instead of concerning myself with how that relates to the task. I feel pretty confident that I can get something up and running, with lots of scope for improving it after that, and lots of opportunity to test it out on real world data once its working. Definitely seems like a plan.

ADA

Started a newish project, implementing the approach from my course using Cardano (ADA) as the asset. Focus at the moment is using XGBoost ML algorithm to help determine whether to go long or short using a momentum strategy. I haven’t used XGBoost before, so something to learn I guess. If the strategy seems profitable, as all the strategies I’ve lost money on have, I might hazard a few digital dollars just to keep the interest going. So here’s the code to fetch the data (ADA-USD) from Yahoo Finance:

import yfinance as yf
import pandas as pd

ada_data = yf.download("ADA-USD", start="2017-01-01",  end="2024-06-02")
ada_data.index = pd.to_datetime(ada_data.index)
print(ada_data)

And here’s the beginning and end of the data received

                Open      High       Low     Close  Adj Close     Volume
Date                                                                    
2017-11-09  0.025160  0.035060  0.025006  0.032053   0.032053   18716200
2017-11-10  0.032219  0.033348  0.026451  0.027119   0.027119    6766780
2017-11-11  0.026891  0.029659  0.025684  0.027437   0.027437    5532220
2017-11-12  0.027480  0.027952  0.022591  0.023977   0.023977    7280250
2017-11-13  0.024364  0.026300  0.023495  0.025808   0.025808    4419440
...              ...       ...       ...       ...        ...        ...
2024-05-28  0.467963  0.468437  0.453115  0.456990   0.456990  418594476
2024-05-29  0.456990  0.463107  0.450914  0.450995   0.450995  350482630
2024-05-30  0.450995  0.454546  0.443807  0.446581   0.446581  356151973
2024-05-31  0.446581  0.454957  0.444461  0.447461   0.447461  290913148
2024-06-01  0.447461  0.452584  0.445254  0.449975   0.449975  167918462

[2397 rows x 6 columns]

I guess I didn’t need that Adjusted Close column, as crypto doesn’t really do dividends and splits the way equities so. Earliest date on Yahoo Finance seems to be 2017-11-09. I wonder if that really was the date Cardano went live. Or perhaps there wasn’t a USD trading pair available.

Next up: Target

Bike Experiment

UCI has a dataset on bike sharing. This is for a regression problem, to predict the number of bikes shared on any given day. I cleaned it up a bit (removed index, date) and one-hot encoded features such as day of week and others. Then separated out the count as the label (target).

I created a simple MLP consisting of a couple of layers, 35 inputs down to 1 output. I initially tried just a single layer for the simplest possible model, but got unexplained errors (all the predictions were null (nan), so I made it a little more complex (35->2, ReLU, 2->1) and that worked.

I set up a grid search in Optuna (which is not documented as well as it might be) with a couple of optimizers and a range of learning rates. SGD gave the best and worst results, usually with the same learning rate! Adam and RMSprop were much more consistent, although not quite as good as SGD’s best, but hard to trust predictions when the optimizer has such huge variability.

A couple of the features had much higher values than the rest. Bike use counts vary from about 1000 to 5000 per day, and two features tracked how many were casual users and how many registered users, whereas everything else was in the range approx 0 – 1. So I produced a feature casual/registered and used that instead, but it made almost no difference. Maybe with a more complex model it might make a bigger difference, more testing needed.

Best result was 5330 which actually was produced by Adam, not SGD, and with the modified dataset, but it was something of an outlier. Still, that’s a pretty high value when very few days had a count that high.

ETA: I realized that the above result was achieved with only 10 epochs in the training loop. Increased to 1000 and the best result was below 2000 (average of 10 trials). 1000 epochs gave a significantly better result than 500 even, so that’s something to consider in future.

I decided to switch to median rather than mean to get the following results (third column is the RMSE – Root Mean Squared Error), where each row represents 100 trials. RMSprop is consistently better than Adam, and the higher learning rates give better results. It appears that a learning rate of 0.1 is the largest value the algorithms will accept.

params_lr params_optimizer        
0.01      Adam              4397.0
          RMSprop           4695.0
0.02      Adam              2598.0
          RMSprop           2477.0
0.03      Adam              2226.0
          RMSprop           1951.0
0.04      Adam              1964.0
          RMSprop           1475.0
0.05      Adam              1739.0
          RMSprop           1226.0
0.06      Adam              1557.0
          RMSprop           1159.0
0.07      Adam              1427.0
          RMSprop           1135.0
0.08      Adam              1338.0
          RMSprop           1127.0
0.09      Adam              1281.0
          RMSprop           1125.0
0.10      Adam              1245.0
          RMSprop           1126.0

As a comparison I ran a Linear Regression from scikit-learn on the same data and got approx 1100 as a result, same as best results above.

Why Oh Why?

TL:DR Whinge, whinge whinge.

I set up an ML experiment with Optuna using some bike sharing data from UCI. Despite using almost the same code as in previous, successful, experiments, I was getting errors everywhere. Well, I’ve made so many mistakes coding over the years that I’m pretty experienced at finding where the problem is, but in this case something that worked yesterday (on different data) didn’t work today.

I was using a simple model, only one layer with inputs equal to the number of features, and one output (being a regression problem). I tried this on my BTC data a while back and it worked fine. Now, however, the array of predictions was full of null values. If I made the model a little more complex, an extra layer or an activation layer, it worked. But I didn’t need those with the BTC example. So what gives? Usually when I find a problem I understand why it occurred, but not this time. Grr!!

Today’s Plan

Will focus on regression problems, perhaps using several datasets from the UCI repository, plus a few of my own, including synthetic data and some crypto data. Plan is to set up data with Dataset/Dataloader, to explore a broad range of parameters using Optuna, and then to fine-tune the more important parameters, again with Optuna. The aim is to have a practised pipeline so that I can run any data through the process without having to think too much.

Thinking about this a bit more, I can leverage the Dataset class by implementing different subclasses to perform different data preparation protocols. A simple way to have different ways to prepare the data, and keep them all separate and easy to work on, and then to use in the training by simply specifying which subclass to use. I think my background with Java is showing here.

Interesting

I used Optuna to optimize my ETH experiment. I only optimized the learning rate of the Adam optimizer, and yet was able to achieve a precision of 0.65. Basically this means that if it predicts that today’s return will be greater than 1% it will be right twice as often as it is wrong, approximately. Maybe with some more input features, or with some optimization of the model itself, in terms of number of layers and number of nodes, number of dropout layers, etc, I could improve on this. Very interesting.

ETA: Alas, that result was something of an outlier. I repeated the experiment and out of 100 trials that best learning rate gave a value greater than 0.5 only about once, where many other trials using similar learning rate were below 0.5 That’s the problem with finding the best result, there’s obviously quite a lot of variance in this process. Back to the drawing board.

Accuracy, Precision, Recall

I reworked my experiment as a classification problem rather than a regression problem. Classification is predicting what group an instance belongs to, regression predicting a number, such as todays return. For a class I went with whether or not a return of at least 1% was achieved. So, binary classification, yes or no.

I was getting about 70% accuracy, because every time the result was no, the prediction was no, and this was most of the time. However nearly all the yes results were also classified as no.

This is where precision and recall come in. Precision – how many results that are classified positive actually are positive. And recall – how many examples that actually are positive are classified as such. So precision has to do with false positives, and recall with false negatives. Some results on my test data:

	Predicted No	Predicted Yes
True No	334	128
True Yes	146	73

Confusion Matrix

So, the worst possible result. Most of the true yes examples were predicted to be No, and most of the examples predicted to be Yes were not. I guess there’s plenty of room for improvement.

Dropout

I’ve discovered what Dropout layers are used for. Facing the problem of overfitting, a bit of research (Google) revealed that Dropout layers are good for reducing overfitting, so I tried them out. Indeed with a configuration that had been giving me considerable overfitting (training loss small, test loss large) a couple of Dropout layers brought them back into line. However the overall result was no better than a Linear Regression model.

So what to do. I can try more variations of parameters, but I’ve covered a decent range already without being exhaustive. And truth is people like Ernie Chan say that ML, including NNs, can’t provide reliable signals, which my output is equivalent to I guess. I think this little exercise has actually served its purpose as a familiarization exercise. In my ongoing study I’ll have a much firmer basis to build new knowledge on. Anyway, I need to get back to finalizing my tax. Fortunately I’m making some progress on that and not letting myself be completely distracted by this ML stuff.

Overfitting

I’ve been playing around with my new, improved data set, and tried out a few configurations on Google Colab. Had my first real experience with overfitting. With one MLP configuration I was getting really good loss figures on the training data, and a really bad result on the test data. I believe this is a classic sign of overfitting. The losses on the training data were amazing compared with the losses using a less complex architecture (and also compared with simple Linear Regression), but it didn’t carry through. I guess this is where the trial and error starts. Find that best result that still works with test data.

The Plan

For my ETH experiment I’ve decided to concentrate on working with the MLP on a fixed set of data, rather than trying to vary both the data and the MLP settings, which is going to get a little too complicated (for me). Keep it simple.

So, I’ll include some more lags, also the Fear and Greed Index from alternative.me. They have an API that allowed me to download all their data, which goes back further than the ETH/USDT data I got from Binance. I think I’ll include a 1 day lag of the BTC return, as altcoins tend to move in concert with BTC. I think I can still call ETH an altcoin. Another input will be day of the week, which has been discussed in various courses I’ve done as a potentially useful factor in crypto trading. That will probably require one hot encoding, which is easy to do in pandas.

That’s not a very comprehensive set of input factors, but it should do for exploring various settings of the MLP, such as number of layers, number of nodes per layer, activation functions, optimizer, learning rate, etc.

I’ve considered other inputs such as the S&P500 from the equities market, but that has the disadvantage that it only trades about 250 days per year, Monday to Friday excluding public holidays. Dealing with issues like that is important in data science, but at this stage it’s just easier to avoid the problem altogether.

Other possible inputs include various technical indicators. I have the TA library installed in one of my Docker containers, but I’m not actually too concerned at this point about getting the best set of data. I need to concentrate on exploring MLP architecture and not get too sidetracked by other issues.

Anyway, I have another pressing issue (tax) to deal with, so I probably won’t be getting back to this for a couple of weeks.