The Bees Knees

I’ve been looking at Ivan’s implementation of the A2C (Advantage Actor Critic) approach to deep reinforcement learning for trading, which he said (in 2022, when the book was published) was the bees knees (my term, not his). So I applied it to my recently downloaded data for ETHUSDT. Results shown above. Not great, but it’s a start.

His construction of state is fairly basic, just the last 10 closing prices as far as I can tell. I’m sure I can do something about that. Also he’s using raw price data. Most people who talk about training models for trading recommend using returns (percent change) rather than actual prices as the latter don’t have a constant mean or variance. I’m not sure if that’s relevant to these RL models, but I have a feeling that it is. Also the trades are simple, just buy (or short) at the start of the day and sell or cover at the end. No holding until a signal to close. The actual code is going to take some study. I get the general idea of what the actor critic approach is trying to achieve, compared with the temporal difference approach which is what I have been looking at up ’til now. The devil is in the details however.

So, I’ve looked at the rather elaborate approach used by Quantra in their Deep Reinforcement Learning in Trading course, with state composed of ohlc data over several bars at three levels of granularity. Plus technical indicators, and calendar related inputs. Approach is temporal difference (I think that’s what it’s called). I’ve looked at a similar approach from DeepLizard which was created to solve a GridWorld environment, which I’ve rejigged to work with trading data. Not sure how successfully. And now this A2C approach from Ivan Gridin’s book.

I’m not at the point where I could write code to implement one of these without consulting references. Too many details that I haven’t totally internalized yet, especially concerning getting tensors into the right shape. It’s a language problem really, internalizing the grammar and vocabulary so that you can speak/write without thinking about it. I guess it’s just practice, practice, practice.

Interesting

I used Optuna to optimize my ETH experiment. I only optimized the learning rate of the Adam optimizer, and yet was able to achieve a precision of 0.65. Basically this means that if it predicts that today’s return will be greater than 1% it will be right twice as often as it is wrong, approximately. Maybe with some more input features, or with some optimization of the model itself, in terms of number of layers and number of nodes, number of dropout layers, etc, I could improve on this. Very interesting.

ETA: Alas, that result was something of an outlier. I repeated the experiment and out of 100 trials that best learning rate gave a value greater than 0.5 only about once, where many other trials using similar learning rate were below 0.5 That’s the problem with finding the best result, there’s obviously quite a lot of variance in this process. Back to the drawing board.

Dropout

I’ve discovered what Dropout layers are used for. Facing the problem of overfitting, a bit of research (Google) revealed that Dropout layers are good for reducing overfitting, so I tried them out. Indeed with a configuration that had been giving me considerable overfitting (training loss small, test loss large) a couple of Dropout layers brought them back into line. However the overall result was no better than a Linear Regression model.

So what to do. I can try more variations of parameters, but I’ve covered a decent range already without being exhaustive. And truth is people like Ernie Chan say that ML, including NNs, can’t provide reliable signals, which my output is equivalent to I guess. I think this little exercise has actually served its purpose as a familiarization exercise. In my ongoing study I’ll have a much firmer basis to build new knowledge on. Anyway, I need to get back to finalizing my tax. Fortunately I’m making some progress on that and not letting myself be completely distracted by this ML stuff.

The Plan

For my ETH experiment I’ve decided to concentrate on working with the MLP on a fixed set of data, rather than trying to vary both the data and the MLP settings, which is going to get a little too complicated (for me). Keep it simple.

So, I’ll include some more lags, also the Fear and Greed Index from alternative.me. They have an API that allowed me to download all their data, which goes back further than the ETH/USDT data I got from Binance. I think I’ll include a 1 day lag of the BTC return, as altcoins tend to move in concert with BTC. I think I can still call ETH an altcoin. Another input will be day of the week, which has been discussed in various courses I’ve done as a potentially useful factor in crypto trading. That will probably require one hot encoding, which is easy to do in pandas.

That’s not a very comprehensive set of input factors, but it should do for exploring various settings of the MLP, such as number of layers, number of nodes per layer, activation functions, optimizer, learning rate, etc.

I’ve considered other inputs such as the S&P500 from the equities market, but that has the disadvantage that it only trades about 250 days per year, Monday to Friday excluding public holidays. Dealing with issues like that is important in data science, but at this stage it’s just easier to avoid the problem altogether.

Other possible inputs include various technical indicators. I have the TA library installed in one of my Docker containers, but I’m not actually too concerned at this point about getting the best set of data. I need to concentrate on exploring MLP architecture and not get too sidetracked by other issues.

Anyway, I have another pressing issue (tax) to deal with, so I probably won’t be getting back to this for a couple of weeks.

ETH

New experiment. Aim is to explore MLPs in greater depth. I’ve decided to use ETH, specifically the ETH/USDT trading pair, as my data. I’ve downloaded daily data going back more than three years from Binance.

I’ll use daily returns for the output (target, label, whatever) as before, and I plan to use more features than I did for the BTC experiment. I’ll start with a few more lags, and then see if I can improve on that by adding additional features. Not sure at this stage what those will be. Lots of trial and error coming up.

I’ve done some initial exploration. First Linear Regression, using 1 lag and then 5 lags. Next, MLP using 5 lags as inputs and 1 output (should be equivalent to the LR with 5 lags, and then a slightly more complex MLP with a hidden layer with 10 inputs and 1 output. I guess I should do an MLP with just 1 lag as input but the BTC experiment showed me that this is practically identical to LR with one lag as input so I didn’t bother.

AlgorithmRMSE
Linear Regression, 1 lag0. 03479
Linear Regression, 5 lags0.03506
MLP, 5 lags, 1 output0.03617
MLP, 5 lags, 10 nodes in 1 hidden layer, 1 output0.03523
MLP, 3 hidden layers, one with 50 nodes0.03490

Using 5 lags doesn’t seem to make any difference, and using an MLP with a fairly simple structure (1 hidden layer with 10 nodes) doesn’t make much difference either. I tried a more complex MLP, 2 hidden layers, one with 50 nodes. A bit better than the other MLP results but hardly different from the Linear Regression with only 1 lag for input.

import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error

df = pd.read_csv('data/eth.csv', usecols=['close'])

df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.insert(0, 't-2', df_returns['t'].shift(2))
df_returns.insert(0, 't-3', df_returns['t'].shift(3))
df_returns.insert(0, 't-4', df_returns['t'].shift(4))
df_returns.insert(0, 't-5', df_returns['t'].shift(5))

df_returns.dropna(inplace=True)

y = df_returns['t'].to_numpy(dtype=np.float32).reshape(-1, 1)
X = df_returns.drop('t', axis=1).to_numpy(dtype=np.float32)

train_limit = 1000

X_train, X_test = X[:train_limit], X[train_limit:]
y_train, y_test = y[:train_limit], y[train_limit:]

model = nn.Sequential(
    nn.Linear(5,50),
    nn.ReLU(),
    nn.Linear(50,20),
    nn.ReLU(),
    nn.Linear(20,1)
)

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for epoch in range( 1000):
    inputs = Variable(torch.from_numpy(X_train))
    targets = Variable(torch.from_numpy(y_train))
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

with torch.no_grad():
    y_pred = model(Variable(torch.from_numpy(X_test))).data.numpy()

loss = mean_squared_error(y_test, y_pred)
print(np.sqrt(loss))

I find the terminology for MLPs a little confusing. In the above code I think there are 3 hidden layers, not 1 input layer and 2 hidden layers. I think the ‘input layer’ is just the number of inputs to the first hidden layer, and is not actually a discrete entity of it’s own. I think.