Navigating the real world is hard. Traditional obstacle avoidance algorithms often rely on rigid, sensor-based logic or simple distance measurements. These approaches suffer from significant limitations: they require manual parameter tuning, struggle to recognize complex or unknown environments, and are prone to failure due to sensor noise. To overcome these constraints, this project combines Convolutional Neural Networks (CNNs) with Reinforcement Learning (RL) to create a robust and adaptive navigation system.
Training a robot in the real world is slow, expensive, and dangerous. If the robot crashes into a wall during early training, it breaks. It costs money to fix, and time to reset. The solution is simulation. In simulation, I can crash the robot a million times with zero cost. For this project, I set out to train Nvidia JetBot mobile robot to navigate from a random starting point to a target coordinate while avoiding 10 randomized obstacles. To achieve high-fidelity physics and rapid training, I used NVIDIA IsaacSim and IsaacLab simulation environments.
NVIDIA Isaac Simis the foundational application, which acts as the engine of the operation. It handles the heavy lifting calculating complex physics (using PhysX) and rendering photorealistic visuals (using RTX tracing). It ensures that light, shadows, friction, and collisions behave exactly as they would in the real world.NVIDIA Isaac Labis a specialized framework built on top of Isaac Sim specifically for robot learning. While Isaac Sim simulates the world, Isaac Lab manages the training and integrates natively with popular RL libraries like SKRL, RLGames, and RSL-RL. Training vision-based policies requires millions of interactions, impossible with physical robots. Isaac Lab provides photorealistic physics + massive parallelization for efficient learning.
Then I used the SKRL library, a lightweight and flexible RL framework that integrates seamlessly with Isaac Lab, to implement Proximal Policy Optimization (PPO) and manage the agent's neural architecture. PPO is the gold standard for continuous control.
The goal of this project is simple to navigate from Point A to Point B without hitting obstacles.
- The Robot: A JetBot with two velocity-controlled wheels
- The Sensors: A 64x64 RGB Camera and IMU.
- The Environment: A 300mx300m plane filled with heavy (50kg) cuboid obstacles that the robot cannot simply push out of the way.
Nvidia Isaac Lab offers two primary workflows for environment design:
- Manager-Based Workflow: A modular approach where the environment is decomposed into separate
managersfor observations, rewards, actions, and terminations. - Direct Workflow: A monolithic approach where the environment logic is encapsulated within a single class. In Nvidia Isaac Lab, the Direct Workflow differs from the Manager-based workflow by centralizing the MDP (Markov Decision Process) logic. Instead of separate managers for rewards or observations, everything is handled within
oneclass. In a Direct workflow, the overhead of Python function calls between managers is eliminated.
This project utilizes the Direct Workflow. This choice suggests a prioritization of computational performance and low-level control.
Observation and action spacesThe action space consists of two continuous values ranging from -1.0 to 1.0, representing the velocity for the left and right wheels.
- Action 0: Left Wheel Velocity.
- Action 1: Right Wheel Velocity.
I scale these values up (multiplying by 5.0) to translate the network's normalized output into actual physical velocity (rad/s) for the joints.
For the Observation Space, I used a bit of a trick to keep the data pipeline clean. The environment returns a tensor of shape (3, 64, 64, 6). The first 3 channels are the RGB camera frames. The last 3 channels contain the scalar goal information (Goal X, Goal Y, Distance), which I expanded to match the image dimensions so they could be passed in a single container. My model unpacks this immediately during the forward pass.
The primary sensor is a TiledCamera producing RGB images. The resolution is set to 64x64 pixels. Lower resolution significantly reduces the memory footprint and accelerates the CNN forward pass.
I designed a custom network that splits the problem into two distinct modalities: Vision (what the robot sees) and Proprioception (where the robot is relative to the goal).
The brain of my agent is a custom shared model that splits the workload between CNN and a Multi-Layer Perceptron (MLP). The CNN acts as the eyes of the JetBot.
CNNs are deep learning models designed to process data with a grid-like topology such as images. The CNN processes the RGB images from the robot's camera. Its job is to extract spatial features from the raw image data provided by the TiledCamera. Conv2D (2 Dimensions) is used for images. I built a custom 4-layer CNN to process the visual data. Single images lack motion cues. We stack the last 3 frames (64x64 RGB) to allow the network to estimate velocity and acceleration. Since I am stacking 3 frames of history, I apply this same CNN to each frame individually.
CNN se kernels (or filters) to detect features in images. My CNN architecture consists of 4 convolutional Layers:
- Layer 1: 5x5 kernels with stride 2 to aggressively downsample the image from 64x64.
- Layers 2 and 3: These extract deeper geometric features like edges and obstacle corners.
- Layer 4: A final 3x3 convolution that outputs high-level abstract features.
In the deep learning, the receptive field refers to the specific region of an input image that a particular neuron in a neural network sees or analyzes.
The convolution operation extracts different features from the input. The first convolutional layer captures low-level features such as edges, lines, and corners. In the early layers, small receptive fields focus on fine details. As the network goes deeper, higher-level layers extract more complex features. In the later layers, larger receptive fields enable the model to capture more abstract, high-level patterns, such as obstacles. If the receptive field never grows sufficiently large, the CNN might fail to integrate information about distant spatial parts of obstacle. This can lead to poor performance on tasks where context or structure at a global scale is key (e.g., object detection, semantic segmentation, or navigation).
At the final layer, one neuron sees a 41x41 patch of the original 64x64 image.
self.cnn = nn.Sequential(
nn.Conv2d(3, 32, 5, stride=2), nn.BatchNorm2d(32), nn.ReLU(),
nn.Conv2d(32, 64, 5, stride=2), nn.BatchNorm2d(64), nn.ReLU(),
nn.Conv2d(64, 128, 4, stride=2), nn.BatchNorm2d(128), nn.ReLU(),
nn.Conv2d(128, 256, 3, stride=2), nn.BatchNorm2d(256), nn.ReLU(),
nn.Flatten() # -> 1024 features per frame
)You will see nn.BatchNorm2d immediately after every nn.Conv2d layer. This is standard practice in modern CNN design to ensure deep networks can be trained effectively. Batch Normalization acts like a stabilizer between layers. Every convolutional layer in CNN model uses stride=2. This is a down-sampling. ReLU is widely used, setting negative values to zero and keeping positive values unchanged. Non-linearity allows CNNs to approximate complex functions and tackle tasks like image recognition and object detection.
After passing through these layers (and their Batch Norm and ReLU activations), the spatial data is flattened into a compact feature vector. Finally, the MLP takes the flattened visual features from the CNN and concatenates them with the raw goal vector (which I extracted back out of the observation tensor).
This fused vector passes through a 512-unit linear layer with Layer Normalization, followed by a 256-unit hidden layer. Finally, the network splits into two heads: the Policy Head, which outputs the wheel velocities, and the Value Head, which helps the critic estimate how good the current state is during training.
In RL, the reward function is the only way to communicate good and bad behavior to the robot. If the rewards are poorly balanced, the robot might spin in circles or refuse to move to avoid crashing.
I constructed the total reward as a weighted sum of six different behaviors.
Total_Reward = (Progress_Reward) + (Distance_Reward) + (Alignment_Reward)
- (Backward_Penalty) - (Collision_Penalty) + (Success_Bonus)
Where:
Progress_Reward = 2.5 * (Robot_Velocity dot_product Goal_Direction)
Distance_Reward = 0.5 * (Previous_Distance - Current_Distance)
Alignment_Reward = 0.5 * (Robot_Forward dot_product Goal_Direction)
Backward_Penalty = 0.5 * abs(Negative_Wheel_Velocity)
Collision_Penalty = 5.0 * Collision_Force_Intensity
Success_Bonus = 100.0 (if Distance < 0.4m)The environment loop terminates and resets if any of the following logical conditions become true.
Episode_Ends = (Success) OR (Crash) OR (Timeout)
Where:
Success = Distance_to_Goal < 0.4 meters
Crash = Impact_Force_on_Sensors > 0.1
Timeout = Episode_Duration > 60.0 secondsThe Simulation: Isaac Lab & Direct WorkflowThe Direct Workflow in Isaac Lab keeps everything in one class (JetbotDirectRLEnv), simplifying development compared to the manager-based approach.
To start training 100 robots in parallel:
python scripts/reinforcement_learning/skrl/train.py \
--task Jetbot-Nav-Direct-v0 \
--num_envs=100 \
--enable_camerasCommand Breakdown:
--task Jetbot-Nav-Direct-v0: Specifies the registered Jetbot navigation task.--num_envs 100: Runs 100 Jetbots in parallel to speed up data collection.--enable_cameras: Activates the onboard camera sensors for perception-based RL.
To train the policy, I utilized a high-performance workstation equipped with an Nvidia 5090 GPU. I configured the PPO agent with a learning rate of 3.0e-4 to ensure stable convergence and set a rollout length of 256 steps, allowing the model to gather sufficient interaction data before each update. The training pipeline was robust, running at approximately 18.60 iterations per second. I paused the session at step 122, 366 out of the planned 500, 000 steps, which took roughly 2 hours and 34 minutes to complete. I used a KL-divergence scheduler that automatically lowers the learning rate if the policy changes too drastically, ensuring stable convergence.
122366/500000 [2:34:11<5:38:22, 18.60it/s]Even with just over 2.5 hours of training on the Nvidia 5090, the PPO agent successfully converged, learning to reliably navigate toward the red goal marker while consistently detecting and avoiding the blue obstacles in its path. No overfitting. No spinning. Just robust, visual navigation.
The core of my methodology is domain randomization. In every training episode, I randomize the robot's starting position, the goal location, and, most importantly, the positions of 10 different obstacles. This domain randomizationforces the robot to learn robust navigation strategies, not memorize a fixed path.
I set up 100 parallel environments that run simultaneously on the GPU. In every single step, 100 instances of my Jetbot are trying to reach a target while avoiding 10 randomly placed obstacles Running 100 environments with 100 cameras would normally overwhelm the GPU. But Isaac Lab’s TiledCamera solves this by rendering all camera views into a single large texture, drastically reducing memory and compute overhead.
I implemented the environment using Isaac Lab’s Direct Workflow.
I spawn two distinct 3D arrows floating above every robot instance:
- The Green Arrow (Forward): This represents the robot's actual heading (where it is currently facing).
- The Red Arrow (Command): This represents the ideal path (the direct vector to the goal).
Vision isn't enough, the robot needs to know exactly when it has failed. I attached Contact Sensors to three specific parts of the robot: the chassis, the left wheel, and the right wheel. These sensors are programmed to trigger only when they physically collide with the specific prim paths of the 10 obstacles. Contact sensors ignore ground collisions and only detect impacts with obstacles.
Once training is complete, you can visualize the agent's performance using the play.py script:
python scripts/reinforcement_learning/skrl/play.py \
--task Jetbot-Nav-Direct-v0 \
--num_envs 1 \
--checkpoint /path/to/checkpoints/best_agent.pt \
--enable_camerasBelow is a demo screenshot from Isaac Lab showing the agent successfully navigating the obstacle field.
Here is a demo video of the final policy in action:
We will see the Nvidia JetBot navigate dynamically, adjusting its path in real-time based on camera input and goal position.
By combining NVIDIA Isaac Lab's high-fidelity simulation, domain randomization, and a vision-proprioception fused PPO agent, I trained a robust navigation policy in just 2.5 hours, to navigate from a random starting position to a target coordinate while avoiding 10 randomly placed obstacles in each episode. This system demonstrates a practical approach to vision-based navigation without explicit mapping or SLAM.
Future improvements:
- Add dynamic and different size obstacles using curriculum learning.
- Once trained, the policy can be deployed to real Nvidia JetBot using NVIDIA Jetson
- Incorporate depth or LiDAR for better 3D awareness.
- Use rsl_rl instead of the SKRL library for PPO.
All code is open-source and available on GitHub (link soon!).
The future of robot learning is sim-first, and tools like Nvidia Isaac Sim and Lab are making it accessible to everyone.
References:


Comments