RMSE = 0.03616
So, using daily returns as data (where 0.01 is equivalent to 1%), and using a persistence model as my baseline model (today’s predicted return is the same as yesterday’s return), we have the above value for the RMSE (root mean squared error) as calculated by scikit-learn. That is nearly 4% and actually seems pretty big to me given that daily returns are usually less than that. I guess there must be a few big values in there somewhere. Hopefully it won’t be too hard to improve on that result. The code is:
import pandas as pd
from sklearn.metrics import mean_squared_error
df = pd.read_csv('data/btc.csv', usecols=['date', 'close'], index_col='date', parse_dates=True)
df_returns = df['close'].to_frame().pct_change()
df_returns.rename(columns={'close': 't'}, inplace=True)
df_returns.insert(0, 't-1', df_returns['t'].shift(1))
df_returns.dropna(inplace=True)
X = df_returns['t-1'].to_numpy()
y = df_returns['t'].to_numpy()
test_limit = 700
X_train, X_test = X[:test_limit], X[test_limit:]
y_train, y_test = y[:test_limit], y[test_limit:]
loss = mean_squared_error(y_test, X_test)
print(loss)
Because the usual procedure for developing an ML model is to train the model on a portion of the data and then test it on the remaining data I have only used the test data here for calculating the MSE, even though the persistence model doesn’t require training. Just to have a result that is comparable to results produced by actual ML models.