With the recent TikTok wave of being "performative" and our mutual hatred for a coffee shop chain at UIUC (Espresso Royale), we decided to develop a matcha-making robot as a potential MVP for a fully autonomous coffee shop.
With the constraints given by this hackathon, our stack primarily consists of the following:
- Modified XLeRobot for the hardware
- Gr00t N1.5 Vision Language Action Model for decision making
- NVIDIA Jetson Thor for inference and processing
- 3 USB 2.0 Cameras for vision data
We have three main tasks for which we fine-tune our model:
- Pouring the matcha powder
- Pouring the water
- Whisking the concoction
The instructions for LeRobot installation are here for reference.
*Note that we worked in a Conda environment with Python version 3.10
Cloning the repository and navigating into the directory:
git clone https://github.com/huggingface/lerobot.git
cd lerobotInstall Lerobot library and related :
pip install -e .
pip install lerobot
pip install 'lerobot[feetech]'Install our custom pip packages for bimanual control:
pip install lerobot-teleoperator-bi-so101-leader
pip install lerobot-robot-bi-so101-followerFinding PortsTo find the USB port connection for each motor controller, run the following:
lerobot-find-portYou can find more comprehensive instructions in the SO-101 tab in the guide we linked above.
*Note that it is possible to set up static ports on Ubuntu, and every time you disconnect, run the command again to verify your ports have stayed the same.
Finding CamerasTo find the USB port connection for each camera, run the following:
lerobot-find-cameras opencv
#or realsense for Intel Realsense camerasYou can find more comprehensive instructions in the Cameras tab in the guide we linked above.
*Note that it is theoretically possible to run more than two USB 2.0 Camera Streams on the NVIDIA Jetson hardware, but extremely difficult to implement.
Data CollectionThe following is the command we used to run the data collection process:
lerobot-record --robot.type=bi_so101_follower \
--robot.left_arm_port=/dev/tty.usbmodem58CD1767761 \# REPLACE THIS WITH ACTUAL PORT
--robot.right_arm_port=/dev/tty.usbmodem58760432021 \# REPLACE THIS WITH ACTUAL PORT
--robot.id=follower_arm \
--robot.cameras='{
top_rgb: {"type": "opencv", "index_or_path": 2, "width": 640, "height": 480, "fps": 30},
left: {"type": "opencv", "index_or_path": 1, "width": 640, "height": 480, "fps": 30},
right: {"type": "opencv", "index_or_path": 0, "width": 640, "height": 480, "fps": 30}
}' \
--teleop.type=bi_so101_leader \
--teleop.left_arm_port=/dev/tty.usbmodem58FD0169181 \# REPLACE THIS WITH ACTUAL PORT
--teleop.right_arm_port=/dev/tty.usbmodem5A7A0591211 \# REPLACE THIS WITH ACTUAL PORT
--teleop.id=leader_arm \
--dataset.repo_id=SIGRoboticsUIUC/matcha-making-hopeful \# REPLACE THIS WITH ACTUAL HUGGING FACE REPO
--dataset.num_episodes=5 \
--dataset.single_task="Pour matcha powder" \# REPLACE THIS WITH SPECIFIC TASK INSTRUCTION, another is "Grab whisk and whisk green water"
--display_data=true \
--resume=true #RESUME FLAG NOT USED FOR FIRST RUN*Note that there must be four spaces between each command if you delete the backslashes, otherwise it will error.
Before collecting data on a different device, the following command must be input for each run to prevent overwriting the repo's data:
rm -rf /Users/steph/.cache/huggingface/lerobot/SIGRoboticsUIUC/improved-matcha-dataset; #REPLACE THIS WITH ACTUAL FILEPATHFor each of the three tasks, we collected roughly 30 episodes, but it is up to you and how much over-fitting you would want.
Remember to log in to Hugging Face and create your dataset repo before you start collecting data. Note that when you have a brand new dataset, do not use:
--resume=trueon the first run of episodes to be uploaded to the dataset.
Dataset Collection Errors and Work-AroundsThe most significant thing to remember is that GR00T cannot remember its past actions (lacking memory). This means it doesn't know what it did previously. Therefore, for specific actions that may require repeated motion or symmetrical actions, you may need to collect pristine episodes where there is a distinct start and end state for each task, or alter the trajectory in your recordings such that the states are asymmetric and do not have any overlap. Otherwise, the model will get confused and may lead to inconsistent movements. While the model does attempt to account for this by predicting many steps in the future (via action horizon specification), changing this argument, keeping in mind inference duration as a constraint, did not seem to fix our consistency issues, perhaps due to the complexity of our task.
Dataset ConversionRecently, GR00T has updated support for LeRobot (and presumably their 3.0 Dataset File Format), but when we started, there was no support. Therefore, we needed to convert our data using a custom script, which you can view in the conversion.py file in our forked Isaac-Groot repo attached to this page. Essentially, this code converts the episode structure in the LeRobot dataset from 3.0 to the supported 2.1 (which we developed based on existing code - citation in github repo). Again, you may not need to do this, as in the same guide we linked at the beginning, the GR00T N1.5 Policy tab contains all of the necessary instructions.
*Note that GR00T allows you to configure your own embodiment configuration, which is what is used in our script.
Dataset PreparationFor the actual model, we have our own custom fork of NVIDIA GR00T N1.5 that supports our custom XLeRobot bimanual SO101 arm system, since this embodiment is not natively supported by GR00T. We have a triple camera set up (at the hackathon, we had to adopt a dual-camera setup due to one of our cameras shorting out), which streams different perspectives of the scene to our inference service. As stated above, dataset conversion was necessary to get the data we collected to work with our inference setup and the GR00T model architecture.
A major problem with any general foundation models for robot manipulation is that they do not have the best concept of memory. It is difficult for them to distinguish between similar states that are actually completely different in terms of context (for example, moving a salt shaker to a position and moving it back are very similar states visually, but are contextually very different). To help counter this issue, we structured our data collection around encoding context in aspects of the scene itself. For example, for our pouring water subtask as a part of the general matcha-making process, we let the water container fall at the end of the task in our data collection so that the model knew the task was finished, being a clear distinction from when the water had not been poured yet. We achieved a similar state distinction for the matcha pouring task through visual separation of our states (bowl empty vs full) in our data collection. For the whisking task, as it was our final task, we ensured that whenever the whisk is inside the bowl, we are moving it in a clockwise motion, ending our recordings abruptly without stopping the motion, preventing the model from being confused about whether to stop or stir when in the bowl.
Rest of StepsFor bi-manual control, that was the big change we needed to make. For the rest of the process, refer to this process.
Model TrainingWe trained several models utilizing the NVIDIA Brev service and an H100. We found several issues with the flash attention package, requiring its own workaround, on newer architectures like the RTX 5090. Depending on our model hyperparameters, training on the H100 took anywhere from 10 minutes to 2 hours. We found that a medium/regular batch size (32-64) produces the best models, as well as having enough steps corresponding to the batch size to let the model train for at least 1 epoch (make one pass over all your data). When calculating this, keep in mind that batch size refers to individual frames and not entire episodes (i.e., 32 is one pass over 32 frames from a select episode).
InferenceClone our Isaac-Gr00t repository on both the host computer (that runs the model, ideally an Ubuntu system with a decent GPU), and the client computer that runs the actual inference and is physically connected to the arms via cables (both can be on the same computer, just with new terminal windows). Run the corresponding inference script (on the host computer) following the documentation provided in the earlier linked tutorial. Store the IP, which will be entered as an argument on the client-side computer connected to the arms. Run the corresponding eval/inference script on the client-side computer/terminal (first ensuring the connection is established).
We have a low-latency system, where the Jetson Thor runs the GR00T N1.5 model, and sends actions over an Ethernet cable to the MacBook. The advantages are that we can use 3 USB 2.0 cameras for concurrent streaming to our inference service, not having to worry about connecting all of the hardware to the NVIDIA Jetson or common bandwidth issues.
Web InterfaceWe also developed a prototype web interface between a user and the running inference on the Jetson Thor. To deploy, clone the repository locally on the machine that will be running the interface. If you are running client-server inference where you have a computer, such as a MacBook, running inference with a Jetson Thor acting as the server, run the website locally. Currently, we have a prototype pipeline setup where it takes in the audio input and tries to save it to a .wav file using a Vosk offline model. In the future, we plan on parsing the text input to recognize keywords, and depending on what is said, it will run different tasks. In our case, for example, we would say "Pour matcha powder," and it would recognize said instruction and run the model with the appropriate text input.





Comments