I’ve been overthinking this whole Deep RL thing. Having got a basic Policy Gradient algorithm up and running, I think I can lay out a high level view of what it’s all about.
- State (numbers) from the environment input to network
- Network produces some other numbers (output)
- Agent uses some process to decide how to translate those numbers into an action to send to environment
- Agent sends action to Environment
- Environment responds with new state and reward
- Agent evaluates ‘how good was that response’, aka loss function
- Agent instructs network to do better next time (minimize loss)
- Network attempts to oblige until Agent is happy with the outcome
Step 3 is where all the theory comes in. The basic policy gradient algorithm takes the output (the number of output values equals the number of available actions) and scales them until they are all positive, all lie between 0 and 1, and all add up to 1. Then the agent thinks, hey, these look like probabilities. And the action sent to the environment is based on that. If the numbers are 0.7, 0.2 and 0.1, then the agent chooses action A 70% of the time, action B 20% of the time, and action C 10% of the time. If the Agent is in a Q learning frame of mind, it uses the numbers to get to know the model of the environment a bit better. Whatever the process is, as long as there’s some measure of ‘how good was that’ the network can work towards improving the overall result.
The other aspect of all this is ‘improving’ the result, not in the sense of maximizing it but in the sense of smoothing it out. Hence a double Deep Q network will produce a smoother output than a single Deep Q network. And I imagine more elaborate variations of the Policy Gradient idea will produce smoother, maybe faster, results.
So what the books and courses are mostly about is how the Agent decides, on the basis of the numbers provided by the network, what to do in the Environment. As far as the network and the environment are concerned, it’s all just numbers.