I think I’ve worked out what has been a major source of confusion for me. The books and courses I’ve been reading/viewing on RL have made a big deal about policy gradient methods being ‘parametric’, yet it seems to me that the only real difference between using a NN for Q learning or policy gradient learning is whether or not there’s a softmax activation layer on the output. Everything a neural network does is parametric, if one considers the network weights to be parameters. I just can’t work out what the big deal is about.
However, I think now that it’s for historical reasons. When I was teaching basic computer concepts to first year students one encounters terms like ROM and RAM in relation to types of memory, and yet those terms are meaningless. ROM is RAM, in that it’s random access memory, which is what RAM stands for. The real difference (back then) is that RAM is volatile and ROM is not. However there was a time, in the dawn of computing, when tapes were used for storage, and these are sequential storage devices. This is the real alternative to random access. But the terms stuck, and were used in ways no longer appropriate to their actual meaning.
So it’s finally occurred to me that these approaches to reinforcement learning predate the use of neural networks as function approximators. And all the concepts around the different approaches reflect what were real differences. So the fact that one treats the output of the network a bit differently if one is using it for a policy gradient strategy than if one is using it for a value gradient strategy might not seem much of a difference, but before neural networks the differences were probably a lot greater. Just like once upon a time the web and email (and news) were all different services on the internet, with different servers and clients, but now that has all converged to ‘the web’. For a long time I read email with Thunderbird, and occasionally used RSS Owl to ‘consume’ newsgroups. And paid my utility bills in person.