Published December 1, 2025 © Apache-2.0

Evaluating Tool Awareness of LLMs for Robotics Control

This article implements a strategy to evaluate agentic AI ROS2 nodes for robotic control.

IntermediateFull instructions provided4 hours439

Things used in this project

Hardware components

AMD Radeon™ Pro W7900 GPU

Story

Introduction

This article is the fourth in a series exploring how to control robots with LLMs:

Part 1 — The Genius Taxi Driver

Part 2— OLLAMA — Getting Started Guide for AMD GPUs

Part 3 — LANGCHAIN — Getting Started Guide for AMD GPUs

Part 4 — Implementing Agentic AI in ROS2 for Robotics Control

Part 5 — Evaluating Tool Awareness of LLMs for robotic control

Evaluating the Tool Awareness of our Agents

In a previous article, Part 2, we identified which open-source models were good candidates as subject matter experts. In other words, which LLMs had enough ROS2 knowledge to control a robot.

We now want to know if these same models can act on this knowledge, and use tools.

To do this, we will build on the work from the previous articles, Part 3 and Part 4, where we implemented ROS2 nodes that had agentic AI functionality implemented with LANGCHAIN.

The DeepSeek-R1, although being reported as supporting tools by OLLAMA, reported otherwise when executed. They will, therefore, not be evaluated.

In order to allow for the LLM used by the agent to be changed dynamically, I also implemented a service which allows the ros2_ai_eval node to re-initialize the LLM used in the ros2_ai_agent node.

ros2_ai_agent : ROS2 graph with set_llm_mode service (📷: AlbertaBeef)

The presence of the set_llm_mode service (of type srv/SetLlmMode) can be validated with the following command:

ros2 interface show ros2_ai_agent/srv/SetLlmMode

The set_llm_mode service can be invoked as follows:

ros2 service call /set_llm_mode ros2_ai_agent/srv/SetLlmMode “{enable: true, llm_api: ‘ollama’, llm_model: ‘gpt-oss:20b’}”

ros2 service call /set_llm_mode ros2_ai_agent/srv/SetLlmMode “{enable: true, llm_api: ‘ollama’, llm_model: ‘qwen2.5-coder:7b’}”

The “disable” option for the service is not yet implemented, so will have no effect

ros2 service call /set_llm_mode ros2_ai_agent/srv/SetLlmMode “{enable: false, llm_api: ‘’, llm_model: ‘’}”

An additional ROS2 node, created specifically for the purpose of evaluation, will re-configure the agent for a specific LLM (if needed), provide the input prompt to the LLM, monitor its tool calls, then capture the final output, in order to evaluate its success or failure to perform a given task.

Here is our final ROS2 graph for evaluating the turtlesim agent:

Evaluating turtlesim agent : ROS2 graph (📷: AlbertaBeef)

Here is our final ROS2 graph for evaluating the robotic arm agent:

Evaluating robotic arm agent : ROS2 graph (📷: AlbertaBeef)

Evaluating the Turtlesim Agent — Take01

In order to test the evaluation ROS2 node, I started with the turtlesim agent, with the following tasks:

[Task-01] Advance turtle 2.0 units.
[Task-02] Backup turtle 2.0 units.
[Task-03] Rotate 90 degrees.
[Task-04] Draw a 5-point start of size 3.0 units.

The first test run allowed me to confirm that “Tool Awareness” via monitoring of /llm_tool_calls is working.

Another aspect I had not considered, “Self Awareness” is not occuring … despite have a get_pose() tool, no one is using it … “hit the wall” events are occuring quite often.

The following screen capture (played back at 8x speed) illustrates the cumulative effect of each LLM, resulting in “hit the wall” events:

evaluate_turtlesim_agent.launch.py (📷: AlbertaBeef)

Based on the //llm_tool_calls monitoring, the following LLM models have passed the “Tool Awareness” evaluation.

evaluate_turtlesim_agent.launch.py — Results 01 (📷: AlbertaBeef)

The six (6) LLM models that have passed this Tool Awareness test are:

openai — gpt-4o-mini, gpt-4o
ollama — qwen2.5:32b, qwen2.5:70b
ollama — qwen3:8b, qwen3:32b

OpenAI’s results are the ones that we are trying to reproduce with open-source models.

The Qwen 2.5 results indicate that there is no advantage in using the 72b parameter model, since the 32b parameter model performs just as good. The 7b parameter model, almost succeeded, but failed in the fourth task.

If we exclude the double results (taking action twice), the best open source model (favoring less parameters) is:

qwen3:8b

The following task is a trick question since the system prompt does not explicitly state that the move_forward() function can be used to back-up when specifying a negative value.

[Task-02] Backup turtle 2.0 units.

Most solutions were:

[‘move_forward(-2.0)’]

However, openai’s gpt-4o got creative with this correct solution:

[‘rotate(180.0)’, ’move_forward(2.0)’]

The following task is a stretch goal, taken from NASA JPL’s ROSA turtlesim demo:

[Task-04] Draw a 5-point start of size 3.0 units.

I had to specify a max size, since some LLMs would attempt to draw a star that exceeded the boundaries of the turtlesim simulation.

However, in most cases, what is actually drawn by the turtle is incorrect.

In order to better capture the LLMs’ intent between tool calls, and actual turtlesim visual result, I had to implement the following additional features:

reset robot (for every new configuration of the agent with a new LLM)
wait for current tool calls to finish before continuing … implemented with programmable delays (task_delay, tool_delay)
add self awareness instruction in system prompt

Evaluating the Turtlesim Agent — Take02

After fixing the previous issues, I re-ran the evaluation session, with similar results, with the exception that the turtlesim now correctly reflected the tool calls generated by the LLMs.

evaluate_turtlesim_agent.launch.py — Results 02 (📷: AlbertaBeef)

Oddly, despite improving the visual results of turtlesim, the evaluation results were slightly lower, with only four (4) LLM models passing the Tool Awareness test:

openai — gpt-4o-mini
ollama — qwen2.5:32b
ollama — qwen3:8b, qwen3:32b

I will still make note of the following LLMs which “almost” passed:

openai — gpt-4o
ollama — qwen2:7b

The following screen capture (played back at 8x speed) illustrates the correct results for two LLMs (qwen2.5:32b and qwen3:8b):

llm-robot-control-05 — Correct evaluation of ros2_ai_agent_turtlesim (📹 AlbertaBeef)

The following video captures the full session (34 minutes):

llm-robot-control-05 — Full evaluation of ros2_ai_agent_turtlesim (📹 AlbertaBeef)

The following animation captures the results of each model:

llm-robot-control-05 — Results Animation for ros2_ai_agent_turtlesim (📹 AlbertaBeef)

Additional Learning Resources

I want to take the time to acknowledge the following learning resources that were indispensable for this exploration journey.

Andrew Ng’s “Agentic AI” course is phenomenal !

Agentic AI with Andrew Ng

https://learn.deeplearning.ai/courses/agentic-ai/information

Although DeepLearning.AI will ask for a membership, you can sign up for a trial, and take the course for free. You just won’t get the certificate at the end, but you will benefit from the learning.

My main take-away from this course was the following quote from Andrew Ng:

“One thing that distinguishes teams that are able to execute agentic workflows really well, versus those who are not as efficient at it, is their ability to drive a disciplined evaluation process.”, Andrew Ng

Conclusion

In this article, we implemented a strategy to evaluate our agentic AI ROS2 nodes automatically. If we implement updates, we will be able to quickly evaluate if the performance has increased or degraded.

Version History

2025/12/01 — Initial version

Mario Bergeron

65 projects • 311 followers

Mario Bergeron specializes in embedded vision, machine learning, and robotics.

Evaluating Tool Awareness of LLMs for Robotics Control

Things used in this project

Hardware components

Story

Introduction

Evaluating the Tool Awareness of our Agents

Evaluating the Turtlesim Agent — Take01

Evaluating the Turtlesim Agent — Take02

Additional Learning Resources

Conclusion

Version History

Credits

Mario Bergeron

Comments

Embed the widget on your own site

Evaluating Tool Awareness of LLMs for Robotics Control

Evaluating Tool Awareness of LLMs for Robotics Control

Things used in this project

Hardware components

Story

Introduction

Evaluating the Tool Awareness of our Agents

Evaluating the Turtlesim Agent — Take01

Evaluating the Turtlesim Agent — Take02

Additional Learning Resources

Conclusion

Version History

Credits

Mario Bergeron

Comments

Related channels and tags