RL Car Racing — DQN Agent — Oscar de Francesca

Overview

Final project for the Reinforcement Learning Practical course at the University of Groningen, supervised by Prof. Matthia Sabatelli. Built with my partner Konstantinos Chasiotis.

The challenge: teach an agent to drive a car around a procedurally generated track using only pixel observations — no access to car physics, no hand-crafted features. The agent must learn to steer, accelerate, and brake purely from visual input through trial and error.

Approach

We implemented a Deep Q-Network (DQN) for the CarRacing-v2 environment. The key challenge is that the observation space is high-dimensional (96x96 RGB pixels) and the environment requires precise continuous control — making it significantly harder than classic RL benchmarks like CartPole or Atari games with discrete, low-dimensional states.

The DQN discretizes the action space and uses a convolutional neural network to map raw pixel frames to Q-values for each possible action. The agent learns which actions maximize cumulative reward (staying on track, maintaining speed) through experience replay and temporal difference learning.

Technical Details

State representation: Raw pixel observations from the environment (96x96 RGB), preprocessed and stacked for temporal context
Network architecture: CNN processing visual input → fully connected layers mapping to discrete action Q-values
Training: Experience replay buffer for sample decorrelation, target network for training stability, epsilon-greedy exploration
Hyperparameter tuning: Systematic exploration of batch sizes, learning rates, replay buffer sizes, and network depth documented in analysis notebooks

Results

The trained agent successfully navigates the procedurally generated track, handling curves, maintaining speed on straights, and recovering from minor deviations:

Trained DQN agent autonomously navigating the CarRacing-v2 track

What I Learned

Visual RL is a fundamentally different challenge from state-based RL — the representation learning problem dominates
Experience replay and target networks aren't just theoretical improvements; without them the agent completely fails to converge
Hyperparameter sensitivity in DQN is extreme — small changes in batch size or learning rate can mean the difference between a competent driver and an agent that drives in circles
The gap between "agent sometimes completes a lap" and "agent reliably drives well" requires careful reward shaping and training stability