Embodied Intelligence: Fusing SLAM and LLMs on JetRover

Transform natural language into precise action. See how JetRover uses SLAM and Multimodal LLMs to master complex, autonomous tasks.

IntermediateProtip5 hours174

Embodied Intelligence: Fusing SLAM and LLMs on JetRover

Things used in this project

Hardware components

JetRover (Lidar included, assembled)

Robot Arm

12.6V 2A charger (DC5.5*2.5 male)

Card reader

Wireless controller

3D depth camera

Type-C cable (800mm)

Camera bracket

Colored blocks (30*30mm)

Tags (30*30mm)

Accessory bag

User manual

7-inch LCD screen

HDMI cable (300mm)+ micro-USB cable(450mm)

Screen bracket

6-Microphone array

Type-C cable(350mm)

Story

Since its initial launch in late 2023, the JetRover has undergone a significant evolution, transitioning from ROS 1 to ROS 2 and expanding its capabilities from basic Inverse Kinematics (IK) to sophisticated Large Language Model (LLM) integration. As of May 2026, we have successfully deployed multimodal LLMs on the platform, utilizing APIs like Qwen to enable advanced features such as scene understanding, visual tracking, and natural language voice control.

By merging traditional SLAM (Simultaneous Localization and Mapping) with these high-level "brains, " the JetRover has moved beyond simple automation into the realm of Embodied AI. This combination allows the robot to not only navigate a physical space but to understand the intent behind complex, multi-step instructions.

The Challenge: From Verbal Command to Physical Execution

To demonstrate this synergy, we designed a maze-based scenario. The JetRover is first tasked with mapping the environment to create a high-fidelity "memory" of the maze. Various colored blocks and corresponding boxes are then placed at random, with their general areas marked on the map.

A user provides a single, complex command: "Put the red, green, and blue blocks into their matching colored boxes and return to the starting point."

Once the command is received, the JetRover initiates a sophisticated six-step execution loop:

Intent Analysis: The LLM parses the sentence, identifying the specific objects (blocks), the target locations (matching boxes), and the final return command.
Autonomous Navigation: Utilizing SLAM, the robot plans and executes a path to the block zone using fused data from its LiDAR, high-precision encoders, and IMU.
Spatial Perception: A 3D depth camera identifies the exact coordinates and orientation of the blocks in three-dimensional space.
Precision Manipulation: The robot utilizes Inverse Kinematics (IK) to coordinate its robotic arm for a successful grasp.
Task Sequencing: The robot repeats the navigation and placement process for each color-coded pair until the mission is complete.
Homecoming: After the final drop-off, the JetRover autonomously returns to its designated starting point.

Ready to build? Check out the official JetRover Tutorials now.

The Tech Stack: A Super-Brain for the Edge

Integrating a multimodal LLM essentially provides the JetRover with a "Super Brain" capable of synthesizing text, voice, and visual data. Because it leverages cloud-based APIs, the robot can perform complex reasoning—such as automatically prioritizing tasks and dynamically planning paths—without the need for massive local training sets. Even if you don't explicitly tell the robot the order of operations, the LLM infers the hidden requirements of the task and sequences them logically.

Underpinning this intelligence is a robust hardware foundation. The SLAM system serves as the robot’s "sensory eye, " relying on LiDAR and IMU fusion to prevent the robot from losing its way, even in dynamic environments with moving obstacles. The addition of a 3D depth camera elevates this perception, allowing the robot to "see" depth and volume, which is critical for the hand-eye coordination required for precise pick-and-place tasks.

Conclusion

Embodied AI is about more than just stacking complex technologies; it is about creating robots that can interact with the world in a way that is both useful and intuitive. By combining SLAM with the cognitive power of LLMs, JetRover offers a glimpse into a future where robots truly understand the world around them.