February 2023 – christina norwood

Fear and Greed

There’s a Fear and Greed Index for BTC available at Alternative.me. They have an API so I downloaded data back to 2018. They don’t go into much detail about how it’s calculated but simply state they aggregate ’emotions and sentiment from different sources’. Anyway, a potentially useful input into my model. Great that I was able to get historical data.

Data Preparation

Tackling the problem of incorporating the NASDAQ into my model at the moment. I downloaded some data from TradingView which seems related, but of course it has only traded on a regular exchange and hence only Mon – Fri excluding public holidays. I have worked out a way to provide a complete date index for the period I’m interested in, but now need to fill in the missing values. I’m going to have quite a few similar problems before I’m finished so I decided to get yet another book by Jason Brownlee on Data Preparation. I have a vague feeling that I’ve seen this book before, some of the content seems familiar, but can’t find it anywhere.

Anyway I’m pretty happy with how this project is going, and I don’t expect real results for quite a while so plenty of time to work things out and do more relevant study.

Random Forest in Python

Random Forest Regressor.

Random Forest Classifier.

Here is an example of using a Random Forest Classifier with a dataset consisting of one feature (lagged MACD histogram value) and nominal target (positive or negative return) encoded as 1/0.

import numpy as np
import pandas as pd
import talib.abstract as ta
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('data/BTCUSDT.csv', index_col = 0, parse_dates=True)
data['return'] = data['close'].pct_change()
data['good'] = np.where(data['return'] > 0, 1, 0)

data_values = pd.DataFrame(data['good'].values)

macd = ta.MACD(data)
macd_values = pd.DataFrame(macd.macdhist.values)

df = pd.concat([macd_values.shift(1), data_values], axis=1)
df.columns = ['macd', 'target']
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

# df.to_csv('data/macd.csv', index=False)

X = df.values

# train, test, etc. are numpy arrays
test_start = datetime(2023,1,1,0,0,0)
test_size = data.index.get_loc(test_start)

train, test = X[0:test_size], X[test_size:]

train_X, train_y = train[:,0], train[:,1]
test_X, test_y = test[:,0], test[:,1]

train_X = train_X.reshape(-1, 1)
test_X = test_X.reshape(-1, 1)

clf=RandomForestClassifier()
clf.fit(train_X, train_y)

predictions = clf.predict(test_X)

print (clf.score(train_X, train_y))
print(clf.score(test_X, test_y))

Baseline Models

While studying the WEKA application I encountered the concept of a baseline model, a very simple model that provides a prediction, to be used as a starting point to evaluate the effectiveness of other models. ZeroR seems to be the preferred baseline algorithm for supervised learning – it simply takes the average of all target values in the training data and predicts that for all test instances. For classification it finds the most populous class and predicts that for all test data.

Jason Brownlee argues that for time series forecasting it is better to use an algorithm that takes the sequential nature of time series into account. He proposes a persistence algorithm, which simply states that whatever happened yesterday will happen today. A bit like the weather. If you predict that today’s weather will be the same as yesterday’s weather you’ll be right more often than not.

Extract Hour from Datetime Index

The following code creates a column in the DataFrame with the hour from the index as an integer

import pandas as pd

data = pd.read_csv('data/BTCUSDT-4h.csv', index_col = 0, 
                 parse_dates=True)

data['hour'] = [x.hour for x in data.index]

Data Cleaning

I downloaded a few years worth of four hour data from Binance recently, and noticed that the timestamp of the last item was not what I expected. I decided to go looking for the anomaly this morning, with 6000 observations to check. Anyway, it wasn’t too hard (modified binary search) and it turned out that three rows, two consecutive, one not, were missing. I’m guessing that the server was down during those periods, both were in 2018 when Binance had been operating for less than a year. Anyway it was fairly straightforward to get the prices for those periods from Poloniex. They should have been pretty close to Binance prices or someone would have grabbed an opportunity for arbitrage. I had to estimate the volumes, but three approximate values out of 6000 shouldn’t cause too many problems.

Plot Train-Test Split Data

The following code shows how to plot the data for a train-test split of a time series data set

# plot train-test split of time series data
import pandas as pd
import matplotlib.pyplot as plt

series = pd.read_csv('sunspots.csv', header=0, index_col=0, 
                  parse_dates=True).squeeze('columns')
X = series.values
train_size = int(len(X) * 0.66)
train, test = X[0:train_size], X[train_size:len(X)]

plt.figure(figsize=(16,8))
plt.plot(train)
plt.plot([None for i in train] + [x for x in test])
plt.show()

Combine Columns in DataFrame

Here are a couple of ways to combine columns. Firstly, use concat()

# create lag features
import pandas as pd

series = read_csv('data/ETHBTC.csv', header=0, index_col=0, parse_dates=True).squeeze('columns')

prices = DataFrame(series.values)

df = concat([prices.shift(3), prices.shift(2), prices.shift(1), prices], axis=1)
df.columns = ['t-2', 't-1', 't', 't+1']

Secondly, insert()

# create a lag feature
import pandas as pd

series = pd.read_csv('data/ETHBTC.csv', header=0, index_col=0, parse_dates=True).squeeze('columns')

df = pd.DataFrame(series.values, columns=['t+1'])

df.insert(0, 't', df['t+1'].shift(1))

Machine Trading by E. Chan

Sometimes I’m looking for something to engage with that’s interesting but not too demanding, so this evening I picked up a copy (Kindle) of Ernie Chan’s Machine Trading. With books like this I don’t have to work too hard trying to understand how code works, and can just think about various ideas related to trading. For instance he’s now discussing performance metrics, and how he prefers the Calmar Ratio to the Sharpe Ratio because it deals better with the fat-tailed distribution of returns common in trading financial instruments. I must admit I use the Sharpe ratio a bit, but perhaps I should rethink that. Calmar Ratio takes into account maximum drawdown, and indeed large drawdowns are the bane of trading, especially crypto trading.

Backtest 01

This code backtests a long only strategy using the lower Bollinger Band for entry and mean for exit. It follows the value of one unit of capital over multiple positions. No consideration is given to position size, just all in. The entry/exit is made on the close price so the position column is shifted so the position is applied to following days prices. Lookback period (for the mean and std rolling windows) and threshold (for the number of stds deviation for entry) are set at the top of the script.

''' 
    Backtest of long only strategy using Bollinger Band(s).
    Follows subsequent value of 1 unit of capital in opening position.
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

lookback = 20
threshold = 1.0

# Read data and calculate statistics/indicators

df= pd.read_csv('data/ADA.csv', header=0, index_col=0, parse_dates=True)
df['mean'] = df['close'].rolling(lookback).mean()
df['stdev'] = df['close'].rolling(lookback).std()
df['bband'] = df['mean'] - threshold * df['stdev']

# Position - calculated on day's close so applies to following days

df['entry'] = df['close'] < df['bband']   # condition - T/F
df['exit'] = df['close'] >= df['mean']    # condition - T/F

df['position'] = np.nan   # create and initialise the column

df.loc[df['entry'], 'position'] = 1   # set value according to condition
df.loc[df['exit'], 'position'] = 0    # set value according to condition

df = df.fillna(method='ffill')

# Returns

df['diff'] = df['close'] - df['close'].shift(1)
df['daily_returns'] = df['diff'] / df['close'].shift(1)

# position was entered at END of previous day
df['strategy_returns'] = df['position'].shift(1) * df['daily_returns']

df['cumret'] = (df['strategy_returns'] + 1).cumprod()

# Plot

df.cumret.plot(label='ADAUSDT', figsize=(12, 6))
plt.xlabel('Date')
plt.ylabel('Cumulative Returns')
plt.legend()
plt.show()

Here’s a slightly different version using BBANDS from ta-lib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import talib.abstract as ta

lookback = 20
threshold = 1.0

# Read data and calculate statistics/indicators

df= pd.read_csv('data/ADA.csv', header=0, index_col=0, parse_dates=True)
bb = ta.BBANDS(df, timeperiod=lookback, nbdevdn=threshold)
df['mean'] = bb.middleband
df['bband'] = bb.lowerband

# Position - calculated on day's close so applies to following days

df['entry'] = df['close'] < df['bband']   # condition - T/F
df['exit'] = df['close'] >= df['mean']    # condition - T/F

df['position'] = np.nan   # create and initialise the column

df.loc[df['entry'], 'position'] = 1   # set value according to condition
df.loc[df['exit'], 'position'] = 0    # set value according to condition

df = df.fillna(method='ffill')

# Returns

df['diff'] = df['close'] - df['close'].shift(1)
df['daily_returns'] = df['diff'] / df['close'].shift(1)

# position was entered at END of previous day
df['strategy_returns'] = df['position'].shift(1) * df['daily_returns']

df['cumret'] = (df['strategy_returns'] + 1).cumprod()

# Plot

df.cumret.plot(label='ADAUSDT', figsize=(12, 6))
plt.xlabel('Date')
plt.ylabel('Cumulative Returns')
plt.legend()
plt.show()

Python Standard Library

Python Standard Library Documentation.

Create DataFrame column based on Condition(s)

One can create a column in a DataFrame based on the values of existing columns using np.where(). Lots of parentheses required otherwise type errors.

import numpy as np
import pandas as pd
import talib.abstract as ta

df = pd.read_csv('backtest_data.csv', index_col=0, parse_dates=True)
df['atr'] = ta.ATR(df.high, df.low, df.close, timeperiod=20)

macd = ta.MACD(df)

df['macd'] = macd.macd
df['macd_signal'] = macd.macdsignal
df['macd_hist'] = macd.macdhist
df['hist_prev'] = df['macd_hist'].shift(1)
df['entry'] = np.where(((df['macd_hist'] > 0) & (df['hist_prev'] < 0)), 1, 0)

Also note in the code that instead of referring to a previous days value by accessing the previous row one can use shift(1) to create a column with ‘yesterdays’ value in ‘todays’ row.