Documenting the Process

My current attempts at pair trading are not working so well, and I’m intending to make one more serious effort to somehow harness ML to help with that. It might be a good idea to actually document this process to help keep me on track, and this is as good a place as any to do that.

Over the last couple of posts I mentioned a book by Jason Brownlee of machinelearningmastery.com – Deep Learning Time Series Forecasting. Jason follows a rigorous process of incremental improvement in performance by starting with classical models (of supervised learning) and then trying to improve on ‘the best so far’ with other models, such as various deep learning algorithms. He actually has another book on time series forecasting that does not include (primarily) deep learning, called Time Series Forecasting with Python, so I’ve decided to review that first so I can better follow his incremental improvement approach. I don’t remember much about ARIMA, for example, which he described at length in that earlier work.

I’ll probably stick with pair trading because the whole stationarity thing is easier for any model to work with. Also pair trading has the huge advantage that it’s nearly cost neutral, the shorts pay for the longs, and I don’t need to actually invest any further capital, just use the margin I already have. The general context for this is trading crypto in the Binance cross margin wallet, using my BTC as margin (collateral).

Anyway, I first need to spend some time sorting out my tax from last financial year, so progress on the ML might be a bit slow.

I’m Going to Need a Better Computer

Reading through that ebook I mentioned in my previous post, I’m once again reminded that machine learning is an experimental science. The general approach to solving a problem is to throw everything at it and see what works best. And by ‘everything’ I mean multiple variations of multiple algorithms within multiple families, plus a few ensemble solutions for dessert.

That’s a lot of processing. My most powerful PC is out of action. It has an RTX2080Ti graphics card which, while a bit old now, is still well regarded for ML processing. I’m thinking of building a new machine. In the past I’ve had a couple of machines built by friends who are more hardware savvy than I am, including the Linux box I’m currently using. Over the years I have built a lot of different things (not computers), some of them quite challenging. It should be easy.

So, build a dedicated ML PC. It won’t be for a while because I still have so much to learn. And I’m getting older. Maybe it will happen one day. Or perhaps I can just get someone to do a custom buiid for me. Do I really need to do it myself? I’m getting to that stage in my life where it’s easier just to pay someeone else to do things for me. Pity.

Reading a bit further, Jason suggest cloud-based resources, such as AWS. Perhaps I should give that serious consideration.

Hidden Gem?

I’ve been clearing out old emails and came across one that I had completely forgotten about – receipt and download link for Deep Learning Time Series Forecasting by Jason Brownlee, of machinelearningmastery.com. I have several of Jason’s books (pdfs plus code) and I find him very readable and usually pitched at just the right level for me. I like his style.

So this one is about using deep learning for time series forecasting, obviously, with an emphasis on CNNs and RNNs for supervised learning problems, plus some hybrid systems. He’s vey big on data preparation, and I must admit this is an area I struggle with. For trading there are so many possible inputs, and Ernie Chan has even suggested a system that uses hundreds or thousands of inputs, without however being very specific about what those inputs are.

So, another detour. I’ll put the unsupervised learning study aside for a while and have a quick read through this book. Perhaps it will give me some ideas, and is at least directly relevant to my main concern, ie time series forecasting.

Sleeping on it

After sleeping on it I’ve decided to stick with my original resolve. The book I’m working through is Hands-On Unsupervised Learning using Python by Ankur Patel (an O’Reilly book) via Kindle. At least with unsupervised learning I don’t have to make decisions about what to use as targets (classes, labels, etc), and with PCA I can relax a bit about my feature selection/engineering. All very good for a person who has trouble making decisions (and sticking with them). So, a good book, an IDE and environment setup that works for this kind of stuff, example code downloads as usual with this kind of book, and I’m set for a couple of months.

or Third…

I mentioned in a previous post that I had a Quantra course on Unsupervised Learning in Trading. Fact is that it attempted to cluster different financial instruments, with a view to pair trading instruments in the same cluster. It is generally recommended that one pair trade with instruments that have some fundamental feature in common. However I was unable to group crypto instruments in any way that produced good results. Sharpe ratios from backtesting were abysmal. I did try.

So, should I try again? Can I improve my understanding of unsupervised learning to the point where I can achieve useful results? Seems I’ve failed too many times to achieve any significant results over the past few years. What to do?

On Second Thought…

I’m already questioning my decision from the previous post, to focus on Unsupervised Learning. In the back of my mind is the advice from Ernie Chan, that ML is most useful for metalabelling – deciding whether to trade (or not) where actual signals have come from a non-ML strategy. I guess this is a supervised classification problem. My new book on Unsupervised Learning has started with a chapter on Supervised Learning, which is what has reminded me of all this.

What I’m actually thinking of is some kind of weird hybrid system, where a classification of ‘good time to trade’ is in fact a buy signal for whatever data feed into the trained model. So combining strategy with metalabelling. What could go wrong?

Focus

I frequently feel overwhelmed by the sheer volume of tutorial material I now have. I need to focus on one thing, and stick with it long enough that I don’t need to come back to it again. So for better or worse I’ve decided to focus on unsupervised learning. I even bought a shiny new (Kindle) book despite already having a couple on this subject. Retail therapy.

So this one uses Tensorflow for it’s examples, and I set up a new Docker environment with Tensorflow and the standard data science libraries. All the code files that come with the book are in Jupyter Notebook format, and I don’t like that much because I can’t easily inspect the values of variables (and check the types) as I go. Currently my PyCharm is not set up to run notebooks from Docker. I believe it’s possible (something about exposing ports) but not a high priority for me at the moment. I can just use Google Colab if I really want to run the notebooks.

Anyway, it’s pretty easy to open the notebook in PyCharm even if I can’t run it there, and copy the code into a py file with minor adjustments where needed to get that to work, such is inserting print statements rather than just having the notebook print the last returned value in a cell.

Will Unsupervised Learning help me with my trading? Who can say? I do actually have a Quantra course on that very subject, but I’m hoping to gain a more holistic view of unsupervised learning so that I can explore it in more depth.

Docker

When I started programming in Python about four years ago I used the Anaconda distribution for package and environment management. This worked well most of the time, but I had some issues, notably with TA-lib. I gradually discovered that there weren’t as many packages (or specific versions of packages) available through conda as there were on pypi using pip. While some people used both conda and pip with gay abandon, I did see some warnings not to do that. Very confusing. Perhaps the issue was related to using Spyder as my IDE. It had to be installed in each environment, and perhaps had problems with pip-installed packages. I switched to PyCharm as an IDE but still had problems once I started on Machine Learning, especially confilcts between PyTorch/TensorFlow and CPU/GPU installs.

Lately I’ve been using Docker for environment management, which seems to work well although I’m not attempting to use GPU versions of the ML libraries. Biggest problem is that one can’t install new packages in an image without rebuilding the image. Anyway process at the moment is to use a Docker container as a remote interpreter, with the actual python project (or tutorial) files in some directory outside the container. I have to use PyCharm Pro for this, as the Community Edition doesn’t support Docker based remote interpreters.

Giving Up on Pair Trading

Over the past few months I’ve tried to tighten up my pair trading. Firstly I’m picking pairs that are cointegrated at 99% confidence level by two different tests, and then by visual inspection of charts reach trigger conditions and actually revert to mean often enough to be worth trading. Also I’m using many small position sizes (approx $500USD each long and short). Most such pairs go way off pattern as soon as I put a trade on, and usually don’t return again. At least with the smaller position sizes I’m not losing too much, but it’s certainly time to reconsider using this strategy. Pity, as I’ve put a lot of work into it (and money) over the past few years.

Finding the Path

I’ve accumulated a lot of instructional material (for Python) over the past couple of years, and it’s proving a challenge to develop a doable self instruction course. I don’t want to waste time on stuff I mostly know already, and also don’t want to get bogged down in material that is a bit too challenging at my current skill level. Also should I concentrate on pure Python programming, or on trading strategies, using Python to help make decisions? And what about web scraping or GUI development, that might help getting information from the world at large, and organizing my scripts in an easy-to-use graphic interface?

Anyway, I’ve recently decided to spend time with an actual book, Python Cookbook by David Beazley and Brian Jones. It’s not specifically geared towards trading, or even data science, but seems a good fit to improve my basic understanding of Python. The issue that I face is not that I need to develop sophisticated code, but that I need to be able to understand the sometimes quite sophisticated code that some authors/trainers/course creators use in their examples. So I need to be able to read code at a much higher level than I am ever likely to write.

I’ve pretty much decided to spend nearly all my time on this enterprise. I’ll be moving soon to a probably more expensive home and feel the need to actually generate some income, so trading is becoming a job rather than just a hobby. I’m still intending to let statistics (and maybe ML) inform my decisions, although I don’t plan to automate anything just yet.

Data Preparation

Tackling the problem of incorporating the NASDAQ into my model at the moment. I downloaded some data from TradingView which seems related, but of course it has only traded on a regular exchange and hence only Mon – Fri excluding public holidays. I have worked out a way to provide a complete date index for the period I’m interested in, but now need to fill in the missing values. I’m going to have quite a few similar problems before I’m finished so I decided to get yet another book by Jason Brownlee on Data Preparation. I have a vague feeling that I’ve seen this book before, some of the content seems familiar, but can’t find it anywhere.

Anyway I’m pretty happy with how this project is going, and I don’t expect real results for quite a while so plenty of time to work things out and do more relevant study.

Data Cleaning

I downloaded a few years worth of four hour data from Binance recently, and noticed that the timestamp of the last item was not what I expected. I decided to go looking for the anomaly this morning, with 6000 observations to check. Anyway, it wasn’t too hard (modified binary search) and it turned out that three rows, two consecutive, one not, were missing. I’m guessing that the server was down during those periods, both were in 2018 when Binance had been operating for less than a year. Anyway it was fairly straightforward to get the prices for those periods from Poloniex. They should have been pretty close to Binance prices or someone would have grabbed an opportunity for arbitrage. I had to estimate the volumes, but three approximate values out of 6000 shouldn’t cause too many problems.