Every time I watch Grogu raise his little green hand and move something across the room, the same thought hits me: I want to do that.
Not with CGI. With real hardware.
So I built it. A gesture recognition system that watches your hand through a webcam, figures out what you're doing — swiping, lifting, pushing — and sends that movement to a robot arm in real time. You raise your palm, the arm goes up. You push forward, it pushes forward. You swipe left, it slides left.
It feels exactly like using the Force. And it took one Python file to make it work.
Recommended Hardware: myPalletizer 260 M5The myPalletizer 260 M5is a compact 4-axis robotic arm ideal for AI + vision-based interaction projects. In this build, it executes real-time motion based on hand gestures detected via computer vision, making it perfect for rapid prototyping of human–robot interaction.
With native Python support through pymycobot, it integrates seamlessly with tools like OpenCV and MediaPipe—allowing you to turn visual input into precise physical actions with minimal setup.
How It Works
The system has five layers, and they run every single frame:
Camera → MediaPipe → Feature Extraction → Temporal Buffer → Gesture Engine → Robot
1. Camera grabs a frame. Standard webcam, 1280×720, 30 fps. The frame is mirrored so your movements feel natural — move your hand right, the arm goes right.
2. MediaPipe finds your hand. Google's MediaPipe Hands model detects 21 landmarks on your hand in real time — fingertips, knuckles, wrist, everything. No training required. It just works out of the box.
3. Features are extracted. From those 21 points, I compute three things every frame:
def extract_features(landmarks) -> FrameFeatures:
pts = np.array([[lm.x, lm.y, lm.z] for lm in landmarks])
palm_center = pts[PALM_LANDMARKS].mean(axis=0)
wrist_pt = pts[WRIST]
index_vec = pts[INDEX_MCP] - wrist_pt
pinky_vec = pts[PINKY_MCP] - wrist_pt
normal = np.cross(index_vec, pinky_vec)
norm_mag = np.linalg.norm(normal)
if norm_mag > 1e-6:
normal /= norm_mag
hand_scale = float(np.linalg.norm(pts[MIDDLE_MCP] - pts[WRIST]))
return FrameFeatures(
palm_center=palm_center,
palm_normal=normal,
hand_scale=hand_scale,
timestamp=time.time(),
)● Palm center — the average position of the wrist and four MCP knuckles. This tracks where your hand is.
● Palm normal — a vector perpendicular to your palm surface. This tells the system which way your palm is facing.
● Hand scale — the distance from your wrist to your middle knuckle. This changes as you push your hand toward or away from the camera.
4. A temporal buffer collects 20 frames. One frame isn't enough to know what gesture you're doing. Is the hand moving left, or did it just appear there? The system stores the last 20 frames of features in a circular buffer. Once the buffer is full, the gesture engine looks at the whole window to decide what happened.
5. The gesture engine fires. This is a threshold-based state machine — no ML model needed. It compares the first and last frames in the buffer:
def evaluate(self, buf: TemporalBuffer) -> Gesture:
if not buf.full() or not self._cooled_down():
return Gesture.NONE
centers = buf.centers()
delta = centers[-1] - centers[0]
dx, dy = delta[0], delta[1]
# Wipe — horizontal swipe with low vertical variance
y_variance = float(np.var(centers[:, 1]))
if abs(dx) > WIPE_X_THRESHOLD and y_variance < WIPE_Y_VAR_MAX:
return self._fire(Gesture.WIPE_RIGHT if dx > 0 else Gesture.WIPE_LEFT)
# Lift / Drop — vertical move with palm facing camera
avg_normal = normals.mean(axis=0)
avg_normal /= np.linalg.norm(avg_normal)
dot_cam = float(np.dot(avg_normal, _CAM_VEC))
if abs(dy) > LIFT_Y_THRESHOLD and abs(dot_cam) > LIFT_NORMAL_DOT:
return self._fire(Gesture.LIFT_UP if dy < 0 else Gesture.DROP_DOWN)
# Push / Pull — hand scale change (closer = bigger)
scale_start = float(scales[:5].mean())
scale_end = float(scales[-5:].mean())
scale_ratio = abs(scale_end - scale_start) / scale_start
if scale_ratio > PUSH_SCALE_RATIO:
return self._fire(
Gesture.PUSH_IN if scale_end < scale_start else Gesture.PULL_OUT)
return Gesture.NONEEach gesture maps to one axis of the robot arm. Swipe moves Y. Lift/drop moves Z. Push/pull moves X. The mapping is instant — you gesture, the arm moves.
6. Fist = pause. Close your hand into a fist and everything stops. The buffer clears, the arm holds position. Open your hand to resume. This is your safety switch.
def is_fist(landmarks) -> bool:
pts= np.array([[lm.x, lm.y, lm.z] for lm in landmarks])
wrist = pts[0]
tips= [8, 12, 16, 20] # index, middle, ring, pinky fingertips
mcps= [5, 9, 13, 17] # corresponding knuckles
curled = 0
for tip_i, mcp_i in zip(tips, mcps):
tip_dist = np.linalg.norm(pts[tip_i] - wrist)
mcp_dist = np.linalg.norm(pts[mcp_i] - wrist)
if mcp_dist > 1e-6 and tip_dist < mcp_dist * 1.2:
curled += 1
return curled >= 3Building It Step by StepStep 1: Set Up the Environmentpython3 -m venv venv
source venv/bin/activate
pip install opencv-python mediapipe numpy pymycobotMediaPipe will automatically download its hand landmark model on first run.
Step 2: Connect the myPalletizer 260 M5Plug the arm in via USB. Find your serial port with: ls /dev/tty.usbserial-*
Update the ROBOT_PORT constant in the code to match your port. The arm needs to be powered on and homed before it can accept coordinate commands. The code handles this automatically — it sends a home command on startup and waits for the arm to be ready.
Step 3: Run Itpython main.py
Or without the robot (gesture detection only):
python main.py --no-robot
You'll see a live camera feed with an overlay showing your hand landmarks, palm center/normal/scale values, buffer status, the detected gesture displayed as a big label when it fires, and current robot arm coordinates.
Step 4: Use the ForceHold your hand up, palm facing the camera. The system needs to see your open hand to track it.
● Swipe left/right — move your hand horizontally in a smooth motion. Keep it level — if your hand bounces up and down too much, the Y variance check will reject it.
● Lift/drop — with your palm facing the camera, move your hand straight up or down.
● Push/pull — push your hand toward the screen or pull it back. The system detects this by watching your hand get smaller (push) or bigger (pull) in the frame.
● Fist — close your hand to pause everything. This is your emergency stop.
One frame of hand position tells you nothing about intent. Is the hand moving left, or did it just appear there? The 20-frame buffer gives you a trajectory. By comparing the start and end of that window, you get clean, intentional gesture detection instead of noise.
After each gesture fires, the buffer clears completely. This prevents double-triggering — the old motion data can't re-fire after the cooldown expires.
Palm normal for gesture gatingNot every hand movement is a gesture. You might scratch your face or reach for your coffee. The palm normal acts as a gate — lift/drop only fires when your palm faces the camera. This means you have to deliberately present your palm to activate gesture mode. It's like raising your hand to use the Force — you have to mean it.
Why thresholds instead of a trained model?Six gestures along three axes with clean separation. A neural network would be overkill. The threshold approach is:
● Fast — no inference time beyond MediaPipe
● Tunable — change a number, change the sensitivity
● Debuggable — you can see exactly why a gesture did or didn't fire
● Zero training data — works out of the box for any hand
Robot runs in a background threadThe robot arm takes ~1.5 seconds to complete each movement. If the gesture loop waited for that, you'd get 0.6 fps. Instead, detected gestures are dropped into a queue and a background worker thread sends them to the arm. The camera and gesture detection stay at full speed.
# Gesture → (axis_index, sign) axis: 0=X 1=Y2=Z
_GESTURE_AXIS = {
Gesture.WIPE_RIGHT: (1, -1),
Gesture.WIPE_LEFT: (1, +1),
Gesture.LIFT_UP: (2, +1),
Gesture.DROP_DOWN: (2, -1),
Gesture.PUSH_IN: (0, -1),
Gesture.PULL_OUT: (0, +1),
}
def _execute(self, gesture: Gesture):
axis, sign = _GESTURE_AXIS[gesture]
lo, hi = _AXIS_BOUNDS[axis]
with self._lock:
coords = self._coords[:]
coords[axis] = float(np.clip(coords[axis] + sign * self._step, lo, hi))
self._mc.send_coords(coords, ROBOT_SPEED)
time.sleep(1.5)
with self._lock:
self._coords = coords[:]ChallengesMediaPipe's Z coordinate is noisy. The palm normal calculation uses the cross product of vectors in the palm plane. When the palm faces the camera, the Z differences between landmarks are small, making the normal's Z component sensitive to noise. Using the absolute value of the dot product with the camera vector made detection more robust.
Double-firing gestures. The temporal buffer holds 20 frames. After a gesture fires, if you don't clear the buffer, the same motion data sits there. Once the 0.8-second cooldown expires, the engine re-evaluates the stale data and fires again. Clearing the buffer on each detection solved this completely.
The arm can't read its own position mid-move. Calling get_coords() during motion returns stale or garbage data on the myPalletizer 260. The solution: read the position once at startup, then track it internally. Every gesture adds or subtracts a fixed step from the last known position. Workspace bounds prevent the arm from going somewhere it shouldn't.
What's NextThis is the foundation. The gesture vocabulary can grow — two-handed gestures, finger pinch for gripper control, circular motions for rotation. The same architecture works: extract features, buffer them, write a threshold rule.
The real dream? Mount the camera on a mobile robot, gesture-control the whole thing. Wave it forward, swipe it sideways, raise your hand to look up. Full Jedi control.
Grogu would be proud.
----------------------------
Interested on the product? Click to find out more details on myPalletizer 260 M5.
Join our User Case Initiative and showcase your innovative projects with Elephant Robotics. Join Now: https://www.elephantrobotics.com/en/call-for-user-cases-en/.





Comments