Results for BTC Price Prediction

So I’ve tried a few algorithms on my simple data set. Raw data is 1000 daily close prices for BTC on Binance (BTC/USDT trading pair), up until about a week ago. I then processed this to daily returns, and used a one day lag as the only input (feature), and current daily return as the target. Training set was first 700 rows (after dropping NaNs), and test set was the remainder. I used Root Mean Squared Error (RMSE) as a measure of loss.

AlgorithmRMSE
Persistence0.03616
Linear Regression0.02478
Support Vector Machine0.02548
Multi Layer Perceptron0.02477

Apart from the persistence model these are all pretty close, which may be a reflection of using only a single input feature. I think further experiments will involve a more complex dataset, rather than trying other algorithms on this one. I’ll have to give some thought to what features I could include. Obviously I could include more lags, but there’s also more data such as high/low prices in addition to close, or perhaps returns over different timeframes. Exogenous data could include some measure of sentiment, perhaps the fear and greed index. I’ll have to give it some thought.

Support Vector Machine

RMSE = 0.02548

I’m only slightly familiar with support vector machines, although I understand the general principal. Using this article as a guide I ran the algorithm on my data and got the above result. Here’s the code

import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)
df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.dropna(inplace=True)

X = df_returns['t-1'].to_numpy(dtype=np.float32).reshape(-1, 1)
y = df_returns['t'].to_numpy(dtype=np.float32)

test_limit = 700

X_train, X_test = X[:test_limit], X[test_limit:]
y_train, y_test = y[:test_limit], y[test_limit:]

svr = SVR(kernel='linear')
svr.fit(X_train, y_train)

y_pred = svr.predict(X_test)

loss = mean_squared_error(y_test, y_pred)
print(np.sqrt(loss))

I suppose it would be best to summarise the results so far in a separate post.

Uni Layer Perceptron

RMSE = 0.02477

I set up a simple MLP in PyTorch, but I only used one node in one hidden layer, so I’m calling this a Uni Layer Perceptron. I figure this should give me the same result as the previous trial with the LinearRegression class from Scikit-Learn, as it’s only using one weight and one bias, equivalent to y = mx + c. Indeed, with sufficient iterations of the training loop, the results were almost identical.

I made the model a bit more complex, with a couple of layers (5 nodes each) and a ReLU activation layer, but the result was almost identical. I guess with a single input feature there’s not much scope for improvement. The code (simple version):

import torch
from torch.autograd import Variable
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error

df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)
df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.dropna(inplace=True)

X = df_returns['t-1'].to_numpy(dtype=np.float32).reshape(-1, 1)
y = df_returns['t'].to_numpy(dtype=np.float32).reshape(-1, 1)

test_limit = 700

X_train, X_test = X[:test_limit], X[test_limit:]
y_train, y_test = y[:test_limit], y[test_limit:]

class MLP_LinReg(torch.nn.Module):
    def __init__(self):
        super(MLP_LinReg, self).__init__()
        self.fc1 = torch.nn.Linear(1, 1)

    def forward(self, x):
        out = self.fc1(x)
        return out

model = MLP_LinReg()

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for epoch in range(1, 10001):
    inputs = Variable(torch.from_numpy(X_train))
    targets = Variable(torch.from_numpy(y_train))
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print('epoch {}, loss {}'.format(epoch, loss.item()))

with torch.no_grad():
    y_pred = model(Variable(torch.from_numpy(X_test))).data.numpy()

loss = mean_squared_error(y_test, y_pred)
print(loss)
print(np.sqrt(loss))

It’s easy to get caught up on simple details. Initially I was getting an error running the code because apparently one of the arrays was float and the other was double, and PyTorch didn’t like that. With so many ways to create Torch tensors from Numpy arrays it can be hard to find the right syntax to specify the data type. but eventually I found a way to do it that didn’t give me further errors. One reason I’m writing this blog is so that I can find the correct syntax fairly easily when I need it again in future. Not sure how well that will work though.

Simple Linear Regression

RMSE = 0.02478

Using the same data as the persistence model (i.e. daily returns with one lag as input variable) I trained a Linear Regression model (from scikit-learn) on the training set, made predictions on the test set, and calculated the mean squared error and then the root mean squared error (RMSE), result shown above. This is better than the persistence model, so making progress. Code is as follows:

import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)

df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))

df_returns.dropna(inplace=True)

X = df_returns['t-1'].to_numpy().reshape(-1, 1) # matrix required
y = df_returns['t'].to_numpy()

test_limit = 700

X_train, X_test = X[:test_limit], X[test_limit:]
y_train, y_test = y[:test_limit], y[test_limit:]

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

loss = mean_squared_error(y_test, y_pred)
print(loss)
print(np.sqrt(loss))

Baseline Loss – Persistence Model

RMSE = 0.03616

So, using daily returns as data (where 0.01 is equivalent to 1%), and using a persistence model as my baseline model (today’s predicted return is the same as yesterday’s return), we have the above value for the RMSE (root mean squared error) as calculated by scikit-learn. That is nearly 4% and actually seems pretty big to me given that daily returns are usually less than that. I guess there must be a few big values in there somewhere. Hopefully it won’t be too hard to improve on that result. The code is:

import pandas as pd
from sklearn.metrics import mean_squared_error

df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)

df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.dropna(inplace=True)

X = df_returns['t-1'].to_numpy()
y = df_returns['t'].to_numpy()

test_limit = 700

X_train, X_test = X[:test_limit], X[test_limit:]
y_train, y_test = y[:test_limit], y[test_limit:]

loss = mean_squared_error(y_test, X_test)
print(loss)

Because the usual procedure for developing an ML model is to train the model on a portion of the data and then test it on the remaining data I have only used the test data here for calculating the MSE, even though the persistence model doesn’t require training. Just to have a result that is comparable to results produced by actual ML models.

The Best Laid Plans…

Why is everything so difficult? I wrote a short script to get 1000 days of data on BTC prices from Binance REST API.

from datetime import datetime, timezone
import requests        # for making http requests to binance
import json            # for parsing what binance sends back to us
import pandas as pd    # for representing/processing data

'''
LOOKS LIKE I CAN'T RUN THIS FROM GOOGLE COLAB!!

root_url = 'https://api.binance.com/api/v1/klines'

symbol = 'BTCUSDT'
interval = '1d'
limit = '1000'

url = root_url + '?symbol=' + symbol + '&interval=' + interval + '&limit=' + limit
data = json.loads(requests.get(url).text)

df = pd.DataFrame(data)
df.columns = ['date', 'open', 'high', 'low', 'close', 'v', 'close_time',
                'qav', 'num_trades', 'taker_base_vol', 'taker_quote_vol',
                'ignore']
df['date'] = [datetime.utcfromtimestamp(x/1000.0) for x in df.date]
df = df.set_index('date')

'''

But it won’t run. The error message suggests Binance blocked the request from Google Colab. No worries, I ran it on my local computer. But then I got warnings about using datetime.utcfromtimestamp, declaring this is deprecated and I need to use a timezone aware function. But that gives me dates with date, time and timezone offset, when all I need is the date. It just complicates my dataframe unnecessarily. Fortunately the ‘old’ code still works and gives me what I want, but for how much longer? I guess I’ll just have to use an old version of python when the next one forces me to take what it thinks is good for me. I couldn’t find any straight forward approach to removing the offset, and frankly I don’t even want the time for daily candles.

Anyway, I uploaded the data to Google Drive and I guess I can now start running some ML algorithms on it. No doubt that will bring me more hassles. Tomorrow, maybe.

Start Again

I’ve started ML projects before, but, well, I’m starting again. As this is a learning exercise (which actually applies to this whole ML ‘thing’ I’m doing) I’ll pick a context that has some topical interest for me with no expectation of actually discovering anything useful. So, it’s the predicting the price of BTC.

Which brings me to the first issue. The ideal time series for ML is stationary, i.e. it has a constant mean and variance. Time series for financial assets are far from stationary, and I have seen recommendations to use the return rather than the actual price, i.e. the fractional change from period to period. I think I’ll go with daily data, so the daily return is what I’m looking at. Why daily? Why not? I have to pick something and that’s it. Maybe as a variation I can use a different timeframe later and see if I get a better result.

So how much data? As this is an exercise it doesn’t really matter, although ML is said to work better with more data, although I acquired a book recently (but haven’t read it yet) that challenges this idea. Can’t imagine how that could be correct but I guess I’ll find out when I get around to reading it. Anyway, I can download 1000 data points from Binance in a single request from the REST API, so that’s easy.

What else do I need? To start with some baseline. Jason Brownlee suggests a persistence model is appropriate for time series data, so I might start with predicting that tomorrow’s return will be the same as today’s. And since I’m predicting a continuous value it will be a regression problem and I can use the Mean Square Error as an appropriate measure of loss. I’ll probably get around to actually doing this tomorrow. Just thinking about it for the moment.