Reinforcement Learning for Robotics: Q-Learning in Python

Combining reinforcement learning for robotics using Q-learning in Python represents one of the most exciting frontiers in autonomous systems today. Unlike traditional rule-based programming, reinforcement learning allows a robot to learn optimal behaviours through trial and error — exploring its environment, receiving rewards for correct actions, and gradually improving its decision-making policy. For Indian makers and engineering students, this approach opens doors to building genuinely intelligent machines at minimal cost.

Reinforcement Learning Fundamentals
Understanding Q-Learning
Setting Up Your Python Environment
Implementing a Q-Table from Scratch
Applying Q-Learning to a Real Robot
Hardware Recommendations for RL Robotics
Frequently Asked Questions

Reinforcement Learning Fundamentals

Reinforcement learning (RL) is a type of machine learning where an agent learns by interacting with an environment. The agent observes the current state, takes an action, receives a reward signal, and transitions to a new state. The goal is to learn a policy — a mapping from states to actions — that maximises cumulative reward over time.

The key components of any RL system are:

Agent: The decision-maker (your robot or software controller)
Environment: The world the agent interacts with (physical space or simulation)
State (S): A representation of the current situation (sensor readings, position, etc.)
Action (A): What the agent can do (move forward, turn left, stop)
Reward (R): A scalar signal indicating how good the action was
Policy (π): The learned strategy — which action to take in each state

RL is particularly powerful for robotics because robots operate in uncertain, dynamic environments where it is often impossible to hand-code every possible scenario.

Recommended: Waveshare General Driver Board for Robots (ESP32) — An ideal platform for deploying trained RL policies on a physical robot, with built-in WiFi, motor drivers, and sensor interfaces.

Understanding Q-Learning

Q-learning is a model-free RL algorithm, meaning it does not require prior knowledge of the environment’s dynamics. It learns a Q-function (also called the action-value function): Q(s, a) — the expected cumulative reward for taking action a in state s and following the optimal policy thereafter.

The Q-learning update rule is:

Q(s, a) ← Q(s, a) + α × [r + γ × max_a'(Q(s', a')) - Q(s, a)]

Where:

α (alpha): Learning rate (0 to 1) — how quickly to update Q-values
γ (gamma): Discount factor (0 to 1) — how much to value future rewards vs. immediate ones
r: Immediate reward received after taking action a in state s
s’: The next state after taking the action

Over many episodes, Q-values converge to optimal estimates, and the agent learns the best action to take in each state.

Setting Up Your Python Environment

Install the necessary libraries:

pip install numpy matplotlib gym

We will use OpenAI Gym for simulation. The FrozenLake-v1 environment is an excellent starting point before moving to robot hardware.

import numpy as np
import gym
import matplotlib.pyplot as plt

# Create the environment
env = gym.make('FrozenLake-v1', is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n

print(f"States: {n_states}, Actions: {n_actions}")

Implementing a Q-Table from Scratch

For small state/action spaces, we can store Q-values in a table (matrix). Here is a complete Q-learning implementation in Python:

import numpy as np
import gym

# Hyperparameters
ALPHA = 0.8       # Learning rate
GAMMA = 0.95      # Discount factor
EPSILON = 1.0     # Exploration rate (starts high)
EPS_DECAY = 0.995 # Decay epsilon each episode
EPS_MIN = 0.01    # Minimum exploration
N_EPISODES = 5000

env = gym.make('FrozenLake-v1', is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n

# Initialize Q-table with zeros
Q = np.zeros((n_states, n_actions))

rewards_per_episode = []

for episode in range(N_EPISODES):
    state, _ = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # Epsilon-greedy action selection
        if np.random.uniform(0, 1) < EPSILON:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(Q[state, :])     # Exploit
        
        # Take action
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Q-learning update
        Q[state, action] += ALPHA * (
            reward + GAMMA * np.max(Q[next_state, :]) - Q[state, action]
        )
        
        state = next_state
        total_reward += reward
    
    # Decay epsilon
    EPSILON = max(EPS_MIN, EPSILON * EPS_DECAY)
    rewards_per_episode.append(total_reward)

print(f"Average reward (last 1000 episodes): {np.mean(rewards_per_episode[-1000:]):.3f}")
print("Q-Table:")
print(Q)

Applying Q-Learning to a Real Robot

Moving from simulation to a physical robot requires careful state and action space design. Here is how to structure it for a wheeled robot with IR sensors:

State design: Discretise sensor readings into bins. For example, an ultrasonic sensor reading 0–20cm = “CLOSE”, 20–60cm = “MEDIUM”, 60cm+ = “FAR”. With 3 sensors, you get 3³ = 27 possible states.

Action design: Keep the action space small — FORWARD, TURN_LEFT, TURN_RIGHT, STOP (4 actions).

Reward function: +10 for reaching the goal, -10 for collision, +1 for each step without collision, -1 for turning (encourages forward progress).

# Simple state encoder for 3 IR sensors
def encode_state(left_ir, centre_ir, right_ir):
    def discretise(reading):
        if reading < 20: return 0   # CLOSE
        elif reading < 60: return 1  # MEDIUM
        else: return 2              # FAR
    
    l = discretise(left_ir)
    c = discretise(centre_ir)
    r = discretise(right_ir)
    return l * 9 + c * 3 + r  # State ID 0-26

Recommended: Waveshare AlphaBot2 Robot Building Kit for Raspberry Pi — Includes IR sensors and motor control, making it an ideal hardware platform for applying Q-learning algorithms.

Hardware Recommendations for RL Robotics

Building an RL robotics platform in India? Here is what you need:

Compute: Raspberry Pi 4 (4GB) for running Python Q-learning code. Handles real-time inference easily. Cost: ₹5,000–₹6,000.
Robot base: Wheeled platforms like AlphaBot2 or custom builds with differential drive motors. Cost: ₹2,000–₹8,000.
Sensors: Ultrasonic (HC-SR04, ₹50–₹100 each), IR proximity sensors (₹30–₹80 each), IMU for orientation.
Motor drivers: L298N or TB6612FNG for DC motor control. Cost: ₹150–₹400.

Recommended: Waveshare ESP32 Servo Driver Expansion Board — WiFi-enabled servo control board for building agile RL-trained robotic platforms.

Recommended: Waveshare DDSM115 Direct Drive Servo Motor — High-torque hub motor with low noise, excellent for building precision RL robot platforms.

Frequently Asked Questions

What is the difference between Q-learning and Deep Q-Learning (DQN)?

Q-learning uses a table to store Q-values — practical for small state spaces. Deep Q-Learning replaces the table with a neural network, enabling it to handle continuous or very large state spaces like camera images. Start with tabular Q-learning, then graduate to DQN using PyTorch or TensorFlow.

How long does a Q-learning robot take to train?

In simulation (Gym environments), training takes seconds to minutes on a modern CPU. On real hardware, each episode takes physical time, so real-world training of even simple behaviours may require hours of interaction. Transfer learning from simulation helps greatly.

Can Q-learning run on an Arduino?

Tabular Q-learning with a small state/action space can run on Arduino with careful memory management, but Raspberry Pi or ESP32 are far more practical. Inference (applying a trained Q-table) is very fast even on microcontrollers.

What are good Python libraries for robotics RL?

OpenAI Gym (environments), Stable-Baselines3 (pre-implemented RL algorithms), PyBullet and MuJoCo (physics simulators), and ROS2 (Robot Operating System) with RL integrations are the main tools used by professionals in India and globally.

Is reinforcement learning used in Indian robotics competitions?

Increasingly, yes. IIT hackathons and Smart India Hackathon increasingly feature RL-based robotics challenges. WRO Future Engineers category allows autonomous vehicles where RL-based approaches excel.

Shop Robotics & Automation at Zbotic →