This article is the fourth in a series exploring how to control robots with LLMs:
Part 1 — The Genius Taxi Driver
Part 2— OLLAMA — Getting Started Guide for AMD GPUs
Part 3 — LANGCHAIN — Getting Started Guide for AMD GPUs
Part 4 — Implementing Agentic AI in ROS2 for Robotics Control
Part 5 — Evaluating Tool Awareness of LLMs for robotic control
Evaluating the Tool Awareness of our AgentsIn a previous article, Part 2, we identified which open-source models were good candidates as subject matter experts. In other words, which LLMs had enough ROS2 knowledge to control a robot.
We now want to know if these same models can act on this knowledge, and use tools.
To do this, we will build on the work from the previous articles, Part 3 and Part 4, where we implemented ROS2 nodes that had agentic AI functionality implemented with LANGCHAIN.
The DeepSeek-R1, although being reported as supporting tools by OLLAMA, reported otherwise when executed. They will, therefore, not be evaluated.
In order to allow for the LLM used by the agent to be changed dynamically, I also implemented a service which allows the ros2_ai_eval node to re-initialize the LLM used in the ros2_ai_agent node.
The presence of the set_llm_mode service (of type srv/SetLlmMode) can be validated with the following command:
ros2 interface show ros2_ai_agent/srv/SetLlmModeThe set_llm_mode service can be invoked as follows:
ros2 service call /set_llm_mode ros2_ai_agent/srv/SetLlmMode “{enable: true, llm_api: ‘ollama’, llm_model: ‘gpt-oss:20b’}”
ros2 service call /set_llm_mode ros2_ai_agent/srv/SetLlmMode “{enable: true, llm_api: ‘ollama’, llm_model: ‘qwen2.5-coder:7b’}”The “disable” option for the service is not yet implemented, so will have no effect
ros2 service call /set_llm_mode ros2_ai_agent/srv/SetLlmMode “{enable: false, llm_api: ‘’, llm_model: ‘’}”An additional ROS2 node, created specifically for the purpose of evaluation, will re-configure the agent for a specific LLM (if needed), provide the input prompt to the LLM, monitor its tool calls, then capture the final output, in order to evaluate its success or failure to perform a given task.
Here is our final ROS2 graph for evaluating the turtlesim agent:
Here is our final ROS2 graph for evaluating the robotic arm agent:
In order to test the evaluation ROS2 node, I started with the turtlesim agent, with the following tasks:
- [Task-01] Advance turtle 2.0 units.
- [Task-02] Backup turtle 2.0 units.
- [Task-03] Rotate 90 degrees.
- [Task-04] Draw a 5-point start of size 3.0 units.
The first test run allowed me to confirm that “Tool Awareness” via monitoring of /llm_tool_calls is working.
Another aspect I had not considered, “Self Awareness” is not occuring … despite have a get_pose() tool, no one is using it … “hit the wall” events are occuring quite often.
The following screen capture (played back at 8x speed) illustrates the cumulative effect of each LLM, resulting in “hit the wall” events:
Based on the //llm_tool_calls monitoring, the following LLM models have passed the “Tool Awareness” evaluation.
The six (6) LLM models that have passed this Tool Awareness test are:
- openai — gpt-4o-mini, gpt-4o
- ollama — qwen2.5:32b, qwen2.5:70b
- ollama — qwen3:8b, qwen3:32b
OpenAI’s results are the ones that we are trying to reproduce with open-source models.
The Qwen 2.5 results indicate that there is no advantage in using the 72b parameter model, since the 32b parameter model performs just as good. The 7b parameter model, almost succeeded, but failed in the fourth task.
If we exclude the double results (taking action twice), the best open source model (favoring less parameters) is:
- qwen3:8b
The following task is a trick question since the system prompt does not explicitly state that the move_forward() function can be used to back-up when specifying a negative value.
- [Task-02] Backup turtle 2.0 units.
Most solutions were:
- [‘move_forward(-2.0)’]
However, openai’s gpt-4o got creative with this correct solution:
- [‘rotate(180.0)’, ’move_forward(2.0)’]
The following task is a stretch goal, taken from NASA JPL’s ROSA turtlesim demo:
- [Task-04] Draw a 5-point start of size 3.0 units.
I had to specify a max size, since some LLMs would attempt to draw a star that exceeded the boundaries of the turtlesim simulation.
However, in most cases, what is actually drawn by the turtle is incorrect.
In order to better capture the LLMs’ intent between tool calls, and actual turtlesim visual result, I had to implement the following additional features:
- reset robot (for every new configuration of the agent with a new LLM)
- wait for current tool calls to finish before continuing … implemented with programmable delays (task_delay, tool_delay)
- add self awareness instruction in system prompt
After fixing the previous issues, I re-ran the evaluation session, with similar results, with the exception that the turtlesim now correctly reflected the tool calls generated by the LLMs.
Oddly, despite improving the visual results of turtlesim, the evaluation results were slightly lower, with only four (4) LLM models passing the Tool Awareness test:
- openai — gpt-4o-mini
- ollama — qwen2.5:32b
- ollama — qwen3:8b, qwen3:32b
I will still make note of the following LLMs which “almost” passed:
- openai — gpt-4o
- ollama — qwen2:7b
The following screen capture (played back at 8x speed) illustrates the correct results for two LLMs (qwen2.5:32b and qwen3:8b):
The following video captures the full session (34 minutes):
The following animation captures the results of each model:
Additional Learning ResourcesI want to take the time to acknowledge the following learning resources that were indispensable for this exploration journey.
Andrew Ng’s “Agentic AI” course is phenomenal !
Agentic AI with Andrew Ng
Although DeepLearning.AI will ask for a membership, you can sign up for a trial, and take the course for free. You just won’t get the certificate at the end, but you will benefit from the learning.
My main take-away from this course was the following quote from Andrew Ng:
“One thing that distinguishes teams that are able to execute agentic workflows really well, versus those who are not as efficient at it, is their ability to drive a disciplined evaluation process.”, Andrew Ng
ConclusionIn this article, we implemented a strategy to evaluate our agentic AI ROS2 nodes automatically. If we implement updates, we will be able to quickly evaluate if the performance has increased or degraded.
Version History2025/12/01 — Initial version





Comments