I ran xgboost on my large dataset, as a binary classification problem. For the target I simply recorded whether the following time period’s return was positive, or not. So basically, given current data, will the price go up in the next hour? Here’s the resulting confusion matrix:
| Predicted Negative | Predicted Positive | |
| True Negative | 4046 | 3834 |
| True Positive | 3678 | 4203 |
These figures relate to a test set. The model was trained on the first 70% of data, and tested (above) on the remaining 30%. Overall accuracy was 52.34%, and precision was 52.30%
I guess the most important numbers are the predicted positive values. This was correct more often than not. The percentage of true positives in the test dataset is 50.00%, and with close to 16,000 entries in the test set, the 2% advantage is probably significant. But is it big enough to be exploitable?
I’ve used xgboost without any tuning. I’ve just skimmed through that reference on using xgboost that I mentioned in a previous post. More significantly, can I refine my signals by using an RL model to validate whether the market as a whole is in a good state to trade? Or to determine position size based on level of risk? At least I have something to work with. A slight edge.