First Steps with Q Learning in Python

Josh Price
5 min readDec 6, 2020

The most interesting subset of machine learning to me has always been what I’ve learned to call Reinforcement Learning. The process of letting the software explore an environment and adapt to achieve the highest level of success.

For this first attempt, I began with this tutorial.

Mountain Car environment from the Gym library

The goal is to teach the cart to reach the flag as efficiently as possible. A push in any single direction won’t do the job, so we must build a Q table that adapts to the reward from each attempt at swinging back and forth.

The environment provides three possible actions: Push Left, Inaction, and Push Right, which are mapped to 0, 1, and 2, respectively.

Some of the constants and variables we can set include learning rate, discount, number of attempts (episodes), and how often to visually display our current progress.

One of the aspects of training I hadn’t considered was environmental granularity. The environment tracks its state with two floating point numbers representing the position and velocity of the cart. That state information is what it uses to determine the success of each attempt, and therefor how to adjust our Q table to maximize rewards in the future.

If we were to use the raw values, down to the 8-decimal-place precision, our Q table would be unreasonably massive due to the need to store every possible combination of states. Instead, we chop the environment into a series of “buckets” of variable granularity to keep our table to a manageable size.

Another “lightbulb moment” for me was the application of epsilon in the learning process. I hadn’t previously considered exactly how the agent would transition from random actions to using table-based decision making, and the gradually decreasing epsilon value makes a lot of sense.

The tutorial author’s original results seemed less-than-impressive to me at first glance. He was able to achieve reasonable success, however it was inconsistent success.

4000 training episodes is clearly not enough

This graph was what the author generated with a bucket count of 20x20 and 4000 training episodes. It clearly wasn’t enough, as there are certainly improvements but it’s still inconsistent and the min rewards too low.

I experimented with increasing bucket count, but found that it didn’t provide much improvement. It did however greatly increase simulation and rendering time, especially when performing the graphing step after the Q table had been trained. This makes sense as a 20x20x3 table would have 1200 possible combinations, whereas a 40x40x3 would have 4800.

Eventually I wound up leaving the 20x20 bucket resolution and increasing the training steps to 40,000 to get an idea of if/when we could achieve more consistent results.

Clearly I get better “convergence” with more episodes, but there’s a catch

As you can see in the graph, we achieve very consistent positive results around 25k episodes. If I was doing something like a linear regression, instinct would be to reduce my training to around 25k episodes, expecting that it would resemble the above graph but with the last 15k episodes chopped off. However, as noted in the caption, I expect there’s a catch.

The epsilon.

Currently the epsilon is set to degrade to 0 halfway through the training process, which happens to be when the reliability of my table begins to stabilize. If I reduce my training to 25k episodes here, I suspect it won’t much resemble what the above graph would look like if we just chopped off the end, because the epsilon will have decayed halfway through THAT range. Let’s find out!

Eureka!

As expected, the epsilon decay was the key to exactly how soon we achieved a stable Q table. Our final result after 25k episodes isn’t quite as stable as it was after 40k, but it is what I personally would call reliable and consistent enough.

So… can we go lower? Is 15k enough?

Eeeeeehhhhhhhh…….

We get fairly stable max and average rewards, but the min is still too chaotic for me. Let’s see how we do with 20k.

That’s better

Interestingly, this particular run through the environment resulted in a generally lower maximum/average reward than previous runs. Based on the full renders I watched, I believe it’s because the initial random values in the Q table nudged it toward a slightly-less-than-optimal solution which was still optimal enough to learn with.

The next entry in this series will focus on my attempts to improve this application’s efficiency and performance.

What I Learned

Quite a bit! I like the fact that Q tables aren’t “mysterious” like deep neural networks. Everything is plain code and plain math. There were no high-level APIs to complex libraries required. There were no hidden layers that require you to just trust that they’re doing something useful.

As noted above, the gradually-decaying-epsilon value was a particularly useful concept for me when understanding reinforcement learning.

I also learned that in this type of example at least, generating hundreds of graphs with thousands of data points while stuck behind Python’s GIL is no fun, but more on that in the next entry.

What’s Left to Learn

The piece I continue to struggle with regards the rewards of this particular example. The reward in the Mountain Cart environment, as I understand it, is only given when the flag is reached. If that’s the case, I don’t see what would be encouraging the cart to move toward the flag for the first few thousand episodes. If it hasn’t yet reached the reward, how can the table know to adjust itself to maximize future chances?

Is it about “back propagation”? That’s a concept I’m still working on. It seems I need to look deeper into that and the environment’s rewards system to understand.

--

--