Optimizing number of iterations of the training loop, getting strange results. A small change in loops results in large change in correct predictions, by orders of magnitude. And it’s not consistent. There seems to be no clear relationship between this hyperparameter and actual results. I’m going to have to work out how to get a little less variability in my results when trying to optimize.
Month: April 2024
Interesting
I used Optuna to optimize my ETH experiment. I only optimized the learning rate of the Adam optimizer, and yet was able to achieve a precision of 0.65. Basically this means that if it predicts that today’s return will be greater than 1% it will be right twice as often as it is wrong, approximately. Maybe with some more input features, or with some optimization of the model itself, in terms of number of layers and number of nodes, number of dropout layers, etc, I could improve on this. Very interesting.
ETA: Alas, that result was something of an outlier. I repeated the experiment and out of 100 trials that best learning rate gave a value greater than 0.5 only about once, where many other trials using similar learning rate were below 0.5 That’s the problem with finding the best result, there’s obviously quite a lot of variance in this process. Back to the drawing board.
Aaargh!!!
Why is everything so difficult? I’m sure I’ve asked that question a few times already. So, hyperparameter tuning. Requires lots of iterations to find the best values. Need for speed. So run it all on the GPU naturally. Except that the approach my current book uses, which is skorch to leverage sklearns grid search, won’t run on the GPU. I tried it on Google Colab with a GPU runtime, but it took 3 times longer than using the CPU on my local machine. I’m not sure what Google Colab actually does when it says it’s using a GPU, but I suspect it’s not doing Torch tensor operations there.
So I went looking for alternatives. Optuna is on open source hyperparameter tuning library that works with PyTorch (among other platforms) and doesn’t seem too complicated. I don’t want to have to spend a month learning a new sophisticated app. Most online tutorials I found were overly complicated. Nobody seems to have heard of the KISS principle. Anway it seems that about six months ago I actually bought a course on Udemy on Hyperparameter tuning, and it does actually have a section on Optuna, so I’m looking at that. Maybe I’ll get it to work with PyTorch NNs running on a GPU. Who knows? And is it actually worth all the trouble? And will it run on Google Colab, with a GPU?
Well, I got a very simple example working, and with the GPU. Funny thing was the GPU took 3 times longer than the CPU did!! I think I read somewhere that GPUs work faster on large datasets, and this one was pretty small. I used the BTC MLP with a single input variable. Model was pretty simple, only a single layer in addition to the input and output layers.
import torch
from torch.autograd import Variable
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import optuna
from optuna.trial import TrialState
DEVICE = torch.device("cpu")
df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)
df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.dropna(inplace=True)
X = df_returns['t-1'].to_numpy(dtype=np.float32).reshape(-1, 1)
y = df_returns['t'].to_numpy(dtype=np.float32).reshape(-1, 1)
train_limit = 700
X_train, X_test = X[:train_limit], X[train_limit:]
y_train, y_test = y[:train_limit], y[train_limit:]
class MLP_LinReg(torch.nn.Module):
def __init__(self):
super(MLP_LinReg, self).__init__()
self.fc1 = torch.nn.Linear(1, 5)
self.act = torch.nn.ReLU()
self.fc2 = torch.nn.Linear(5, 1)
def forward(self, x):
out = self.act(self.fc1(x))
out = self.fc2(out)
return out
# Optuna function defining model and ranges of it's hyperparameters
# Returns the value to be optimized
def objective(trial):
model = MLP_LinReg().to(DEVICE)
criterion = torch.nn.MSELoss()
lr = trial.suggest_float("lr", 0.001, 0.01)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
# train the model
for epoch in range(100):
inputs = Variable(torch.from_numpy(X_train)).to(DEVICE)
targets = Variable(torch.from_numpy(y_train)).to(DEVICE)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# run model on test inputs to get predictions
with torch.no_grad():
input = Variable(torch.from_numpy(X_test)).to(DEVICE)
y_pred = model(input).data.cpu().numpy()
# compare predictions with actual returns (y_test)
loss = mean_squared_error(y_test, y_pred)
return np.sqrt(loss)
if __name__ == '__main__':
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=10)
pruned_trials = study.get_trials(deepcopy=False, states=[TrialState.PRUNED])
complete_trials = study.get_trials(deepcopy=False, states=[TrialState.COMPLETE])
print("Study statistics: ")
print(" Number of finished trials: ", len(study.trials))
print(" Number of pruned trials: ", len(pruned_trials))
print(" Number of complete trials: ", len(complete_trials))
print("Best trial:")
trial = study.best_trial
print(" Value: ", trial.value)
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Grid Search
After trying out various versions of my models I’m becoming quite enamoured with the idea of automating that process. Luckily the book I’m currently working through, Deep Learning with PyTorch by Adrian Tam, has me covered. There’s a chapter on setting up a Torch model to work with Scikit-Learn’s GridSearch, an automated hyperpameter tuning functionality. I’ll definitely be giving that a go shortly. I guess while I’m working on my financial issues I can have the grid search running in the background! Maybe it’s time to get that better computer, or else just do it in the cloud on a GPU from Google Colab.
Actually this heralds a change in focus, from becoming familiar with neural networks, picking up the jargon and a general idea of what’s going on, to actually using neural networks to solve problems. Not for the solutions themselves, but to become competent in the use of neural networks. Not sure if I’ll ever be using them to solve anything significant, but I’m enjoying the process.
Accuracy, Precision, Recall
I reworked my experiment as a classification problem rather than a regression problem. Classification is predicting what group an instance belongs to, regression predicting a number, such as todays return. For a class I went with whether or not a return of at least 1% was achieved. So, binary classification, yes or no.
I was getting about 70% accuracy, because every time the result was no, the prediction was no, and this was most of the time. However nearly all the yes results were also classified as no.
This is where precision and recall come in. Precision – how many results that are classified positive actually are positive. And recall – how many examples that actually are positive are classified as such. So precision has to do with false positives, and recall with false negatives. Some results on my test data:
| Predicted No | Predicted Yes | |
| True No | 334 | 128 |
| True Yes | 146 | 73 |
So, the worst possible result. Most of the true yes examples were predicted to be No, and most of the examples predicted to be Yes were not. I guess there’s plenty of room for improvement.
Dropout
I’ve discovered what Dropout layers are used for. Facing the problem of overfitting, a bit of research (Google) revealed that Dropout layers are good for reducing overfitting, so I tried them out. Indeed with a configuration that had been giving me considerable overfitting (training loss small, test loss large) a couple of Dropout layers brought them back into line. However the overall result was no better than a Linear Regression model.
So what to do. I can try more variations of parameters, but I’ve covered a decent range already without being exhaustive. And truth is people like Ernie Chan say that ML, including NNs, can’t provide reliable signals, which my output is equivalent to I guess. I think this little exercise has actually served its purpose as a familiarization exercise. In my ongoing study I’ll have a much firmer basis to build new knowledge on. Anyway, I need to get back to finalizing my tax. Fortunately I’m making some progress on that and not letting myself be completely distracted by this ML stuff.
Overfitting
I’ve been playing around with my new, improved data set, and tried out a few configurations on Google Colab. Had my first real experience with overfitting. With one MLP configuration I was getting really good loss figures on the training data, and a really bad result on the test data. I believe this is a classic sign of overfitting. The losses on the training data were amazing compared with the losses using a less complex architecture (and also compared with simple Linear Regression), but it didn’t carry through. I guess this is where the trial and error starts. Find that best result that still works with test data.
The Plan
For my ETH experiment I’ve decided to concentrate on working with the MLP on a fixed set of data, rather than trying to vary both the data and the MLP settings, which is going to get a little too complicated (for me). Keep it simple.
So, I’ll include some more lags, also the Fear and Greed Index from alternative.me. They have an API that allowed me to download all their data, which goes back further than the ETH/USDT data I got from Binance. I think I’ll include a 1 day lag of the BTC return, as altcoins tend to move in concert with BTC. I think I can still call ETH an altcoin. Another input will be day of the week, which has been discussed in various courses I’ve done as a potentially useful factor in crypto trading. That will probably require one hot encoding, which is easy to do in pandas.
That’s not a very comprehensive set of input factors, but it should do for exploring various settings of the MLP, such as number of layers, number of nodes per layer, activation functions, optimizer, learning rate, etc.
I’ve considered other inputs such as the S&P500 from the equities market, but that has the disadvantage that it only trades about 250 days per year, Monday to Friday excluding public holidays. Dealing with issues like that is important in data science, but at this stage it’s just easier to avoid the problem altogether.
Other possible inputs include various technical indicators. I have the TA library installed in one of my Docker containers, but I’m not actually too concerned at this point about getting the best set of data. I need to concentrate on exploring MLP architecture and not get too sidetracked by other issues.
Anyway, I have another pressing issue (tax) to deal with, so I probably won’t be getting back to this for a couple of weeks.
ETH
New experiment. Aim is to explore MLPs in greater depth. I’ve decided to use ETH, specifically the ETH/USDT trading pair, as my data. I’ve downloaded daily data going back more than three years from Binance.
I’ll use daily returns for the output (target, label, whatever) as before, and I plan to use more features than I did for the BTC experiment. I’ll start with a few more lags, and then see if I can improve on that by adding additional features. Not sure at this stage what those will be. Lots of trial and error coming up.
I’ve done some initial exploration. First Linear Regression, using 1 lag and then 5 lags. Next, MLP using 5 lags as inputs and 1 output (should be equivalent to the LR with 5 lags, and then a slightly more complex MLP with a hidden layer with 10 inputs and 1 output. I guess I should do an MLP with just 1 lag as input but the BTC experiment showed me that this is practically identical to LR with one lag as input so I didn’t bother.
| Algorithm | RMSE |
| Linear Regression, 1 lag | 0. 03479 |
| Linear Regression, 5 lags | 0.03506 |
| MLP, 5 lags, 1 output | 0.03617 |
| MLP, 5 lags, 10 nodes in 1 hidden layer, 1 output | 0.03523 |
| MLP, 3 hidden layers, one with 50 nodes | 0.03490 |
Using 5 lags doesn’t seem to make any difference, and using an MLP with a fairly simple structure (1 hidden layer with 10 nodes) doesn’t make much difference either. I tried a more complex MLP, 2 hidden layers, one with 50 nodes. A bit better than the other MLP results but hardly different from the Linear Regression with only 1 lag for input.
import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
df = pd.read_csv('data/eth.csv', usecols=['close'])
df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.insert(0, 't-2', df_returns['t'].shift(2))
df_returns.insert(0, 't-3', df_returns['t'].shift(3))
df_returns.insert(0, 't-4', df_returns['t'].shift(4))
df_returns.insert(0, 't-5', df_returns['t'].shift(5))
df_returns.dropna(inplace=True)
y = df_returns['t'].to_numpy(dtype=np.float32).reshape(-1, 1)
X = df_returns.drop('t', axis=1).to_numpy(dtype=np.float32)
train_limit = 1000
X_train, X_test = X[:train_limit], X[train_limit:]
y_train, y_test = y[:train_limit], y[train_limit:]
model = nn.Sequential(
nn.Linear(5,50),
nn.ReLU(),
nn.Linear(50,20),
nn.ReLU(),
nn.Linear(20,1)
)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
for epoch in range( 1000):
inputs = Variable(torch.from_numpy(X_train))
targets = Variable(torch.from_numpy(y_train))
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
with torch.no_grad():
y_pred = model(Variable(torch.from_numpy(X_test))).data.numpy()
loss = mean_squared_error(y_test, y_pred)
print(np.sqrt(loss))
I find the terminology for MLPs a little confusing. In the above code I think there are 3 hidden layers, not 1 input layer and 2 hidden layers. I think the ‘input layer’ is just the number of inputs to the first hidden layer, and is not actually a discrete entity of it’s own. I think.
Trial and Error
ML seems to be a very ‘trial and error’ discipline. My first serious study was a course provided by the University of Waikato in New Zealand, the developers of the WEKA ML platform. I was surprised to see that in tackling any given problem, the general approach was to try a bunch of algorithms and see which one gave the best results. Then, for any given algorithm, try a bunch of different values of hyperparameters and see which ones gave the best results.
And of course for gradient descent, the approach is to start at some random values and then try to improve on that, although in this case there actually is a process (differentiaton) for working out in which direction to go to get a better result.
When it comes to neural networks there don’t seem to be too many guidelines for such things as number of layers, number of nodes, activation functions, etc. Just try out a bunch of stuff, see what works best.
As I’m considering using a greater variety of features for my next experiment I’m faced with the issue of scaling. With only one feature (input) in my last experiment it wasn’t an issue, but I believe the ranges of values for various features need to be pretty consistent for the optimization algorithms to work efficiently. So, do I use a Standard Scaler, or normalization, or MaxMin Scaling, or something else? Guess I’ll just have to try out a bunch of options and see what works best. Start at some random point, estimate how bad a result I get, and then try to improve on it. Way to go. I guess that’s why some authors devote some time to developing some kind of test harness so one can automate, to some extent, the process of trial and error.
Results for BTC Price Prediction
So I’ve tried a few algorithms on my simple data set. Raw data is 1000 daily close prices for BTC on Binance (BTC/USDT trading pair), up until about a week ago. I then processed this to daily returns, and used a one day lag as the only input (feature), and current daily return as the target. Training set was first 700 rows (after dropping NaNs), and test set was the remainder. I used Root Mean Squared Error (RMSE) as a measure of loss.
| Algorithm | RMSE |
| Persistence | 0.03616 |
| Linear Regression | 0.02478 |
| Support Vector Machine | 0.02548 |
| Multi Layer Perceptron | 0.02477 |
Apart from the persistence model these are all pretty close, which may be a reflection of using only a single input feature. I think further experiments will involve a more complex dataset, rather than trying other algorithms on this one. I’ll have to give some thought to what features I could include. Obviously I could include more lags, but there’s also more data such as high/low prices in addition to close, or perhaps returns over different timeframes. Exogenous data could include some measure of sentiment, perhaps the fear and greed index. I’ll have to give it some thought.
Support Vector Machine
RMSE = 0.02548
I’m only slightly familiar with support vector machines, although I understand the general principal. Using this article as a guide I ran the algorithm on my data and got the above result. Here’s the code
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)
df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.dropna(inplace=True)
X = df_returns['t-1'].to_numpy(dtype=np.float32).reshape(-1, 1)
y = df_returns['t'].to_numpy(dtype=np.float32)
test_limit = 700
X_train, X_test = X[:test_limit], X[test_limit:]
y_train, y_test = y[:test_limit], y[test_limit:]
svr = SVR(kernel='linear')
svr.fit(X_train, y_train)
y_pred = svr.predict(X_test)
loss = mean_squared_error(y_test, y_pred)
print(np.sqrt(loss))
I suppose it would be best to summarise the results so far in a separate post.