Published December 22, 2023

Mercury Humanoid Robot VR(Visual Reality) Manipulation

Mercury humanoid robots X1 employ the latest NVIDIA Jetson Xavier and Jetson Nano as the main control modules to achieve VR control.

AdvancedProtip3 hours581

Things used in this project

Hardware components

Elephant Robotics Mercury B1

Elephant Robotics Mercury A1

Elephant Robotics Mercury X1

Raspberry Pi 4 Model B

NVIDIA Jetson Xavier NX Developer Kit

Software apps and online services

ROS Robot Operating System

Unity

Story

Introduction

VR Teleoperated Robotic Arms are systems that combine virtual reality technology with robotic arm control, allowing users to manipulate and interact with actual robotic arms through virtual reality devices. This technology has applications in various fields, including remote operation, training, and working in hazardous environments.

Dual-arm humanoid robots are designed to simulate the upper body structure of the human body, including the head, torso, and arms. These robots aim to mimic the movements and functions of the human upper body to perform a variety of tasks, thus having a wide range of applications in the field of virtual reality (VR).

The integration of virtual reality and humanoid robots can bring innovation and improvement to many fields, enhance user experience, expand application areas, and promote technological development. This combination blurs the boundary between virtual and reality, creating exciting possibilities for the future.

Hardware Introduction

Mercury X1

The "Mercury X1" is a wheeled humanoid robotic arm meticulously developed by Elephant Robotics. It features 19 degrees of freedom (DOF), granting it high flexibility and adaptability. Wheeled mobile base was powered by high performance Direct Drive Motors, the working voltage is 48V, and it is equipped with a 9-inch quantum dot touch screen, combining modernity and efficiency. The Mercury X1 has a maximum battery life of 8hrs and working radius of 450 millimeters and a maximum payload capacity of 1 kilogram, making it highly suitable for lightweight operation tasks.

Its repeatability is up to ±0.05 millimeters, indicating its reliability in precision tasks. The robot is expected to have a lifespan of 5000 hours, demonstrating excellent durability. The main control unit uses Jetson Xavier, and the auxiliary control is managed by ESP32*1, ensuring powerful data processing capabilities and stable control performance. The robot also includes a carbon fiber shell, making the arm not only sturdy and durable but also lightweight. It is equipped with an Orbbec Deeyea 3D camera and a linear array sound module with four microphones, enabling efficient visual and sound processing. In terms of communication, it supports WIFI, Ethernet, Bluetooth, USB ports, and RS485, and is compatible with ROS1/ROS2, meaning it can be easily integrated into various smart manufacturing and research environments.

The Mercury series, a pioneer in MVRM (Mercury Visual Reality Manipulation), represents a significant leap in combining VR with robotic technology.

VR Oculus Quest 3

Developed by Oculus VR, a subsidiary of Facebook, Oculus Quest 3 is a virtual reality (VR) headset. The Quest 3 has a developer mode, which allows developers to install and test their own applications directly on the device. It is primarily supported by the game engines Unity and Unreal Engine.

VR-Controlled Robot Project

Software Architecture and Interaction Design

The primary issue VR teleoperation needs to address is the communication between the operator and the robot. For this, we chose communication based on the HTTP protocol, with the server using Flask. Although selecting Python might negatively impact performance, the Python SDK for the Mercury robot — pymycobot — is a Python library. Therefore, Python is the most convenient choice for development and integration with the robot, making it highly suitable for verifying the project's feasibility.

On the VR side, we chose Unity3D, whose rich community resources make this project possible. The service model follows the classic C/S structure. The VR side sends different commands to the robot by accessing specific URLs, which are then relayed to pymycobot by the server to execute the actual actions.

In terms of interaction design, we adopted a relative movement approach. The user pressing the A button on the controller marks the start of the movement, while releasing it signifies the end of the movement.

Real-Time Video Streaming

One of the key challenges in overcoming technical difficulties in VR teleoperation is ensuring low-latency video streaming. We adopted an innovative solution by utilizing the Accelerated GStreamer plugins provided by the NVIDIA Jetson Xavier platform, achieving GPU-accelerated video encoding and decoding. This aims to optimize bandwidth utilization while ensuring real-time performance.

The Accelerated GStreamer plugins on the Jetson Xavier platform are a powerful tool that leverages the GPU's computing power for accelerated processing of video data. This innovative approach not only enhances the efficiency of video encoding and decoding but also significantly reduces latency while optimizing the use of network bandwidth.

Accelerated GStreameris a set of GStreamer plugins provided by NVIDIA for its Jetson platform, designed to enhance performance and reduce latency by using the GPU (Graphics Processing Unit) to accelerate multimedia processing tasks. These plugins are specifically designed for the Jetson platform, making full use of its powerful GPU resources and are suitable for video encoding/decoding, image processing, and deep learning inference among other multimedia applications.

Implementation Process

Firstly, download and compile the source codeprovided by the official GStreamer.

After compiling, you will get a file named "test-launch". The streaming pipeline is constructed using GStreamer commands. The test setup used is as follows:

```bash
nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM), format=NV12, width=3264, height=2464, framerate=21/1 ! nvv4l2h264enc insert-sps-pps=true ! h264parse ! rtph264pay name=pay0 pt=96
```

nvarguscamerasrc sensor-id=0:

nvarguscamerasrc is a GStreamer plugin provided by NVIDIA for handling Argus cameras.

sensor-id=0 specifies the use of the camera sensor with ID 0. In multi-camera systems, different camera sensors can be selected.

video/x-raw(memory:NVMM), format=NV12, width=3264, height=2464, framerate=21/1:

video/x-raw(memory:NVMM) indicates that the output is raw video data managed by NVIDIA's memory management.

format=NV12 specifies the video frame format as NV12, which is a type of YUV format.

width=3264, height=2464 specifies the width and height of the video frame.

framerate=21/1 specifies the video frame rate as 21 frames per second.

nvv4l2h264enc insert-sps-pps=true:

nvv4l2h264enc is an H.264 encoding plugin provided by NVIDIA.

insert-sps-pps=true indicates that SPS (Sequence Parameter Set) and PPS (Picture Parameter Set) are inserted into the output stream, which is necessary for decoding H.264 video streams.

h264parse:

h264parse is a plugin used to parse H.264 data streams.

rtph264pay name=pay0 pt=96:

rtph264pay is a plugin used to package H.264 data streams into RTP packets.

name=pay0 assigns a name to this RTP Payloader.

pt=96 specifies the RTP payload type, set here as 96.

Combining GST Commands with RTSP Server, input the command:

```bash
./test-launch "nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM), format=NV12, width=3264, height=2464, framerate=21/1 ! nvv4l2h264enc insert-sps-pps=true ! h264parse ! rtph264pay name=pay0 pt=96"
```

This allows another host on the same local area network to access the current host through the RTSP protocol to test the results (the VLC player can access the RTSP protocol stream for direct playback).

Finally, rename the URLs bound in test-launch.c to "left" and "right". Also, write scripts for enabling and disabling for convenient execution.

```bash
# launch-rtsp.sh
width=1920
height=1080
framerate=28
./left-rtsp-server "nvarguscamerasrc sensor-id=1 ! video/x-raw(memory:NVMM), format=NV12, width=$width, height=$height, framerate=$framerate/1 ! nvv4l2h264enc insert-sps-pps=true maxperf-enable=1 bitrate=8000000 ! h264parse ! rtph264pay name=pay0 pt=96" &
./right-rtsp-server "nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM), format=NV12, width=$width, height=$height, framerate=$framerate/1 ! nvv4l2h264enc insert-sps-pps=true maxperf-enable=1 bitrate=8000000 ! h264parse ! rtph264pay name=pay0 pt=96" &
```

```bash
# stop-rtsp.sh

ps -al | grep "left\|right" | awk '{print $4}' | xargs kill
```

Teleoperation System

The main challenge faced by VR teleoperation systems is to balance the instability of network conditions with the need for stable inputs for hardware such as robotic arms. This balance is one of the key factors in ensuring the effective operation of remote teleoperation systems.

First, network instability can lead to delays, packet loss, or unpredictable bandwidth situations. This directly affects the real-time interaction between remote users and the robotic arm. In VR teleoperation systems, users need to feel real-time changes in the virtual environment and respond instantly to the movements of the robotic arm. Network delays can cause a noticeable lag between the user's commands and the actual actions of the robotic arm, reducing the accuracy and smoothness of the operation.

On the other hand, hardware devices like robotic arms require high stability in inputs. Since the teleoperation system needs to transmit user inputs to the robotic arm in real-time, the instability of input signals can lead to abnormal or uncontrolled movements of the robotic arm. This can pose serious problems for tasks requiring high precision and reliability, such as surgical operations and industrial maintenance.

Dilemma of VR Teleoperation

To understand this problem, one must first understand the movement of the robotic arm. In traditional robotic arm movement, a point is sent to the arm, and it automatically plans acceleration and deceleration to reach the destination smoothly. However, this automatic acceleration and deceleration planning can cause frequent speeding up and slowing down between every two sampling points in teleoperation, preventing continuous movement. In teleoperation, the acceleration and deceleration should be controlled by the operator's hand movements. Therefore, theoretically, to achieve smooth teleoperation, the robotic arm needs an interface that can abandon automatic planning and fully use sampling points for interpolation movement. We named this interface the "velocity fusion interface".

However, the underlying interpolation movement interface of the robotic arm usually requires extremely high signal stability, needing millisecond-level stability to ensure stable operation of the arm. This interpolation interface, which fully utilizes sampling points, allows the Mercury robot's VR control method to move towards a specified target point within a given time. For example, if you input a coordinate and a time of 50ms, the robotic arm will move towards the coordinate you sent within these 50ms, stopping if it reaches the target or if the time elapses.

Ideally, if you continuously and stably input movement point commands to the robotic arm through velocity fusion, and your sending interval perfectly matches the given movement time slice, then theoretically, the arm can follow human hands as long as the speed does not exceed the limit. However, if the next command does not arrive in time (possibly due to network-induced latency fluctuations), the robotic arm will immediately stop, causing jitter due to the absence of a deceleration process.

A reasonable improvement is to add a buffer, which smooths out the errors caused by network latency, providing a relatively stable stream of instructions to the robotic arm. However, this leads to reduced instantaneity. If the buffer were added, then the time slice given in the velocity fusion parameters must be less than the actual sending interval, because only if the consumption of instructions is slower than their production can we ensure that there is always a instructions in the buffer to avoid the awkwardness of sudden stops due to a lack of instructions. But this means that in stable network conditions, the buffer is always full, adding a fixed delay of time slice x buffer size to the entire control system.

Robotic arms require stable control signals because their movement is fundamentally driven by motors. In regular point-to-point movement, the arm automatically plans acceleration and deceleration internally, achieving relatively stable movement. However, in the field of teleoperation, real-time performance is crucial, so velocity fusion interfaces are typically designed to directly send points sampled from the VR end to the robotic arm. This design ensures real-time performance but also introduces the problem of difficulty in achieving both low latency and stability.

If we pursue real-time performance and send a point as soon as it arrives from the network, since we cannot fully predict when the next control signal will arrive, the provided time parameter might deviate from reality. If too little time is given, the robotic arm's movement can't be continuous; if too much time is given, the next command might arrive before the arm has completed its movement, causing command pile-up and reduced real-time performance. If stability is pursued, it's necessary to add a buffer queue to smooth out communication-induced latency fluctuations. However, this also increases control delay. This is the dilemma of teleoperation systems: latency and stability can't be achieved simultaneously.

VR Headset End Implementation

The implementation on the VR end utilized Unity3D + XR Interactive Toolkit. Unity3D was chosen mainly for its simplicity, allowing rapid familiarization and support for various VR platforms. Using Unreal Engine might have required extensive time investment on issues unrelated to this project. XR Interactive Toolkit, the official VR framework for Unity3D, was used instead of the Oculus plugin to facilitate easy porting of the project to other VR devices.

In terms of the development environment, besides using Unity3D's built-in Editor, Visual Studio 2022 Community Edition was used for C# editing. The UMP (Universal Media Player) plugin was imported as the RTSP Player, along with Oculus’ official basic VR materials.

Initially, the plan was to use absolute coordinates to directly translate the movements of the controller on a one-to-one basis, but this method proved inconvenient. Firstly, the arm span of the Mercury robot is roughly equivalent to that of a teenager, and using absolute 1:1 control could easily lead to the robotic arm hitting its joint limits. Moreover, the initial position of the controller posed a problem. Hence, I decided to adopt a relative control method, where the robotic arm moves only when a specific button is pressed. The advantage of this approach is that you can adjust your posture freely when the button is not pressed, and when you start moving, the robotic arm always moves relative to your current position, making control much easier.

In terms of video transmission, the initial method was to use the server to forward MJPG images to the VR end, which were then rendered on the screen as textures. This method was simple to implement but had many drawbacks, such as latency and CPU load issues. Direct server forwarding of images would require at least one additional copy and made it difficult to utilize Nvidia's built-in encoders/decoders. Additionally, the VR end needed time for decoding and copying, resulting in high overall latency and CPU load. Later, I considered using the GStreamer+NV acceleration plugin solution, as mentioned above. Utilizing NV hardware acceleration significantly improved latency and load.

Furthermore, during development, I gained a deeper understanding of Unity3D+Quest 3 as a teleoperation platform. In Unity3D, almost all computations are frame-aligned. Although it's possible to open threads, the resources provided by the game engine are refreshed per frame. For instance, the maximum refresh rate I can get for the controller's coordinates is equal to the game's frame rate. Therefore, on this platform, without considering interpolation and other operations, the upper limit of the teleoperation control frequency sampling is actually the frame rate, which is typically 70-90Hz per second. Consequently, I did not use multithreading to send information but instead used the most common approach in Unity3D: coroutines. Using coroutines ensures that your operations are frame-aligned, avoiding many synchronization-related issues.

```c#
// One example of coroutine function
IEnumerator sendUpdate()
{
    if (!flag_origin_initialized) yield break;
    Vector3 pos = gameObject.transform.position;
    pos -= origin_pos;
    pos *= 1000;
    
    Vector3 rotation = gameObject.transform.rotation.eulerAngles;
    if (isZero(pos) || isZero(rotation))
        yield break;
    rotation = angleConvert(rotation);
    string content = $"[{pos.x:F2},{pos.y:F2},{pos.z:F2},{rotation.x:F2},{rotation.y:F2},{rotation.z:F2},{System.DateTime.UtcNow.Ticks / TimeSpan.TicksPerMillisecond}]";
    Debug.Log(content);
    using (UnityWebRequest www = UnityWebRequest.Post(updateURL, content, "application/json"))
    {
        www.timeout = 1;
        yield return www.SendWebRequest();
        Debug.Log(www.result);
        if (www.result != UnityWebRequest.Result.Success)
        {
            Debug.Log(www.error);
        }
    }
}

// Use this in main thread (eg. inside Update() to sync with main loop)
StartCoroutine(sendUpdate());

```

Server-Side Implementation

On the server side, I chose Flask for its simplicity and reliability as a framework, as my needs were primarily for a control framework serving as an RPC (Remote Procedure Call). Flask's streamlined design was well-suited for this task.

A significant challenge in the control system was aligning Unity3D with the robotic arm platform, specifically in translating coordinates from the VR world into coordinates understandable by the robotic arm. The Mercury robot's arm is essentially composed of two Mercury single arms. From the perspective of a single arm, its coordinates are actually based on its own base, i.e., the position of the upper arm joint, not the waist as we might expect. Therefore, it was necessary to design a coordinate transformation tool to convert from the VR world to Mercury's base coordinate system and then to the single arm's coordinates. This process involved extensive use of linear algebra tools and robotics theory.

For example, the following code snippet accomplishes the transformation of coordinates from the VR world to the base coordinate system:

```python
def vr_to_base(posture):
    position = np.array(posture[:3])
    rotation = np.array(posture[3:])
    matrix = cvt_euler_angle_to_rotation_matrix(rotation)
    T = np.vstack((np.hstack((matrix, position.reshape(3, 1))), np.array([0, 0, 0, 1])))

    TRB = np.array([[0, 0, 1, 0], [-1, 0, 0, 0], [0, 1, 0, 310], [0, 0, 0, 1]])

    Tp = np.dot(TRB, T)

    rotation_matrix = Tp[:3, :3]
    position = Tp[:3, 3]
    rotation = cvt_rotation_matrix_to_euler_angle(rotation_matrix)

    return np.hstack((position, rotation))
```

An interesting issue we encountered was the mirrored configuration of the left and right arms. The engineer responsible for the robot's algorithm and I had to use different firmware for each arm to make them compatible with the same set of transformation algorithms.

As mentioned earlier, the robotic arm requires stable control signals at fixed intervals, which necessitates a caching system and a high-precision timing system for signal dispatch.

The first issue is buffering control instruction, with thread safety being a primary concern. Since the server operates in a multi-threaded environment, two network packets may arrive simultaneously, or a packet might arrive just as signals are being dispatched. Therefore, designing a thread-safe double-ended queue as the caching area was a primary task. Then comes the considerations of buffer size, dispatch interval, and time slice parameters. These parameters are the equalizers for the delay and stability performance of the dispatch system and require extensive experimentation to fine-tune for optimal performance. The most challenging balance to strike is between the dispatch interval and the time slice. Network latency is unpredictable, and the machine processing also takes time, so the actual required time is longer than the pure dispatch interval, and by how much varies. Due to these factors, the parameter settings need to be more conservative to ensure the stability of the system.

The precision of the `time` module in Python standard library is not adequate for this purpose. Therefore, relying solely on `sleep` for timing is definitely not feasible. My current approach is to dedicate a separate thread to perform high-speed polling at the expense of significant CPU usage, in order to achieve low-latency dispatch. Practical tests have shown that this method results in lower latency compared to directly using a complete `sleep` function.

class MyThread(Thread):
    ...

    def run(self) -> None:
        while running:
            if len(self.bucket)!= 0 and time.time_ns() - self.last_time > self.time_slice:
                # do something
                self.last_time = time.time_ns()
            time.sleep(0.00001)

Optimization

After completing the basic framework, simple filtering can be applied to the dispatched position points to further eliminate jitter. It's impossible for human hands to perfectly stop at a single point when using VR, or to ensure a perfectly straight line during linear movement. We can control each component separately and filter out lower components to achieve a more stable motion trajectory.

a = final_base_coords[:3]
b = self.last_arm_base_coords[:3]
if self.last_arm_base_coords is not None:
    diff = a - b
final_base_coords[:3] = np.where(abs(diff) > 5, a, b)

Another potential optimization is to average the trajectory, which can make it smoother, but the potential trade-off might be an increase in latency. Below is an implementation based on a double-ended queue (deque) as a sliding window:

class SlidingWindow:
    def __init__(self, maxlen=5) -> None:
        self.lock = Lock()
        self.store = deque(maxlen=maxlen)
        for i in range(maxlen):
            self.store.append(1)
        self.maxsize = maxlen

    def append(self, obj):
        with self.lock:
            self.store.append(obj)

    def mean(self):
        with self.lock:
            return np.mean(np.array(self.store), axis=0)

    def clear(self):
        with self.lock:
            self.store.clear()

    def size(self):
        with self.lock:
            return len(self.store)

Robotic Arm Motion Control

1. Virtual Reality Interface: In the VR environment, users can see a virtual model of the robotic arm and operate it through VR devices. The user's movements are mapped onto the virtual model and instantly transformed into motion commands for the robotic arm.

2. Data Processing and Transmission: The captured motion data needs to be processed by software to be converted into instructions understandable by the robotic arm. These instructions typically involve calculations of joint angles, speeds, or positional information. Then, these instructions are sent to the robotic arm controller over the network.

3. Robotic Arm Execution: Upon receiving the instructions, the robotic arm executes the corresponding actions through its internal control system. This may include moving to a specific position, moving along a particular path, or performing complex hand movements.

Demo

Summary

The advancement of VR technology is breaking through the constraints of physical space, enabling complex operations and interactions in virtual environments. This holds significant implications for fields like remote control, education, and healthcare. As VR technology continues to evolve, we can anticipate a broader array of innovative applications. For instance, the use of VR in simulating complex surgeries, remote education, and disaster response training is expected to become more extensive.

Considering the potential of the Mercury X1, how would you envision using it?

Elephant Robotics has officially launched the Mercury series of robots today. Currently, the Mercury series includes three products: Mercury-A1-Lightweight 7-DOF Cobot; Mercury-B1 Dual-arm 7-DOF Semi-humanoid Robot; and Mercury-X1, Universal Wheeled Humanoid Robot. All three products feature industrial designs meticulously crafted by a Swedish team and integrate seven core robot algorithms, offering multiple usage and development methods.

Elephant Robotics

81 projects • 196 followers

An Innovative Robotics Company.

Mercury Humanoid Robot VR(Visual Reality) Manipulation

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Hardware Introduction

Mercury X1

VR Oculus Quest 3

VR-Controlled Robot Project

Software Architecture and Interaction Design

Real-Time Video Streaming

Implementation Process

Teleoperation System

Dilemma of VR Teleoperation

VR Headset End Implementation

Server-Side Implementation

Optimization

Robotic Arm Motion Control

Demo

Summary

Credits

Elephant Robotics

Comments

Embed the widget on your own site

Mercury Humanoid Robot VR(Visual Reality) Manipulation

Mercury Humanoid Robot VR(Visual Reality) Manipulation

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Hardware Introduction

Mercury X1

VR Oculus Quest 3

VR-Controlled Robot Project

Software Architecture and Interaction Design

Real-Time Video Streaming

Implementation Process

Teleoperation System

Dilemma of VR Teleoperation

VR Headset End Implementation

Server-Side Implementation

Optimization

Robotic Arm Motion Control

Demo

Summary

Credits

Elephant Robotics

Comments

Related channels and tags