Solved – specific example of reinforcement learning using linear function approximation

machine learningreinforcement learning

I can think of two cases when linear function approximation in reinforcement learning is useful: when the state space is large enough, or when the state space is continuous.

I have been reading many materials about this concept, and there are certain concepts that I would like to see in an example rather than in proofs. Is there a more 'gentle' introduction or example of how linear function approximation is applied in reinforcement learning?

Currently, I am lost as to how to formulate it in a program to solve a particular problem. I am looking at Q learning right now, and trying to extend the framework to accommodate linear function approximation. But it seems difficult.

Best Answer

For a gentle introduction, see the Georgia Tech & Udacity course on reinforcement learning. You'll find the early videos in section 8, "Generalization" cover a simple example of how one might formalize a simple problem.

For an example, start with the classic mountain car problem. The full details are nicely spelled out in the technical details section of the wikipedia article, but here's a brief, informal summary:

  • A driver in a car, with a weak motor, wishes to get to a hill on the right side of a valley.
  • The car lacks the horsepower to drive straight up, but the driver can reverse up the left hill, then follow rightward with more momentum.
  • States are formalized as real numbers describing position and velocity, within a bounded range.

The developers of the BURLAP (Brown-UMBC Reinforcement Learning and Planning) library have a tutorial on how to solve this problem using least-squares policy iteration, which includes this helpful description of how LSPI relies on function approximation.

LSPI starts by initializing with a random policy and then uses the collected SARS samples to approximate the Q-value function of that policy for the continuous state space. Afterwards, the policy is updated by choosing actions that have the highest Q-values (this change in policy is known as policy improvement). This process repeats until the approximate Q-value function (and consequentially the policy) stops changing much.

LSPI approximates the Q-value function for its current policy by fitting a linear function of state basis features to the SARS data it collected, similar to a typical regression problem, which is what enables the value function to generalize to unseen states.

BURLAP is in Java, but the prose of the tutorial can be followed without much reference to the code.