UCI has a dataset on bike sharing. This is for a regression problem, to predict the number of bikes shared on any given day. I cleaned it up a bit (removed index, date) and one-hot encoded features such as day of week and others. Then separated out the count as the label (target).
I created a simple MLP consisting of a couple of layers, 35 inputs down to 1 output. I initially tried just a single layer for the simplest possible model, but got unexplained errors (all the predictions were null (nan), so I made it a little more complex (35->2, ReLU, 2->1) and that worked.
I set up a grid search in Optuna (which is not documented as well as it might be) with a couple of optimizers and a range of learning rates. SGD gave the best and worst results, usually with the same learning rate! Adam and RMSprop were much more consistent, although not quite as good as SGD’s best, but hard to trust predictions when the optimizer has such huge variability.
A couple of the features had much higher values than the rest. Bike use counts vary from about 1000 to 5000 per day, and two features tracked how many were casual users and how many registered users, whereas everything else was in the range approx 0 – 1. So I produced a feature casual/registered and used that instead, but it made almost no difference. Maybe with a more complex model it might make a bigger difference, more testing needed.
Best result was 5330 which actually was produced by Adam, not SGD, and with the modified dataset, but it was something of an outlier. Still, that’s a pretty high value when very few days had a count that high.
ETA: I realized that the above result was achieved with only 10 epochs in the training loop. Increased to 1000 and the best result was below 2000 (average of 10 trials). 1000 epochs gave a significantly better result than 500 even, so that’s something to consider in future.
I decided to switch to median rather than mean to get the following results (third column is the RMSE – Root Mean Squared Error), where each row represents 100 trials. RMSprop is consistently better than Adam, and the higher learning rates give better results. It appears that a learning rate of 0.1 is the largest value the algorithms will accept.
params_lr params_optimizer
0.01 Adam 4397.0
RMSprop 4695.0
0.02 Adam 2598.0
RMSprop 2477.0
0.03 Adam 2226.0
RMSprop 1951.0
0.04 Adam 1964.0
RMSprop 1475.0
0.05 Adam 1739.0
RMSprop 1226.0
0.06 Adam 1557.0
RMSprop 1159.0
0.07 Adam 1427.0
RMSprop 1135.0
0.08 Adam 1338.0
RMSprop 1127.0
0.09 Adam 1281.0
RMSprop 1125.0
0.10 Adam 1245.0
RMSprop 1126.0
As a comparison I ran a Linear Regression from scikit-learn on the same data and got approx 1100 as a result, same as best results above.