Computer vision tasks sit on a ladder of increasing detail. Image classification gives a single label for a whole frame. Object detection draws a bounding box around each object. Instance segmentation goes one step further and outlines every object pixel by pixel, producing a separate mask for each instance. That extra precision is what lets you lift a single person cleanly out of a scene, trace an irregular part on a conveyor, or measure an object's true shape instead of a rectangle around it.
Edge Impulse ships classification and object detection as built-in learning blocks, but not instance segmentation. This guide adds it by combining two techniques:
- Model cascading chains two models so each does what it is best at. A small, fast detector runs first, and a heavier segmentation model runs only when the detector finds something worth segmenting. Each model stays simple to train and deploy, and you spend compute where it matters.
- BYOM Freeform ("Bring Your Own Model") lets you upload any ONNX model to Edge Impulse and have the runtime hand back its raw output tensors untouched. This is the escape hatch for deploying architectures Edge Impulse does not parse natively, such as YOLO-seg, where you do the post-processing yourself.
To keep it concrete, we build a person-blur privacy application: Stage 1 detects people, Stage 2 segments each one, and the app blurs them using their pixel-accurate masks so the person is hidden while the background stays sharp, which is hard to achieve with bounding boxes alone. The whole pipeline runs through the Edge Impulse Linux runtime on a Qualcomm QCS6490, and because it is built on .eim files, the same code runs on any Edge Impulse Linux target.
What makes this an edge AI application is that it runs live, on the device. With a USB webcam plugged into the board, the cascade processes each frame as it arrives and blurs people in real time, with no cloud round-trip. That is the whole point of running inference at the edge: the application keeps working offline, adds no network latency, and, for a privacy use case like this, raw video never leaves the device. The clip below shows the full cascade running live on the Rubik Pi 3 from a USB webcam.
- chain two models into a detection-then-segmentation cascade
- deploy a YOLO11-seg model on Edge Impulse using the Freeform output type
- turn raw segmentation tensors into instance masks with a small post-processor
- build a privacy person-blur application on top of those masks
- run the whole pipeline live on a Qualcomm QCS6490 board
Note: An Edge Impulse .eim follows the Edge Impulse for Linux protocol, so the code in this repository runs unchanged on any supported target: a Raspberry Pi 5, other Qualcomm Dragonwing boards, or a macOS laptop for development. This guide targets the QCS6490. For another board, rebuild the .eim for that target and keep everything else the same.
Note: An Edge ImpulsePrerequisitesHardware.eimfollows the Edge Impulse for Linux protocol, so the code in this repository runs unchanged on any supported target: a Raspberry Pi 5, other Qualcomm Dragonwing boards, or a macOS laptop for development. This guide targets the QCS6490. For another board, rebuild the.eimfor that target and keep everything else the same.
- An Edge Impulse account.
- Python 3.10+ with the runtime and OpenCV:
pip install "edge_impulse_linux>=1.2.2" opencv-python numpy - Ultralytics for the one-time ONNX export:
pip install ultralytics
Important: Use edge_impulse_linux version 1.2.2 or newer. Recent .eim builds return large Freeform outputs over shared memory, and older SDKs cannot read them back, so Stage 2 returns the string "shm" instead of tensors. More on this under Stage 2.
Important: UseSource codeedge_impulse_linuxversion 1.2.2 or newer. Recent.eimbuilds return large Freeform outputs over shared memory, and older SDKs cannot read them back, so Stage 2 returns the string"shm"instead of tensors. More on this under Stage 2.
The full project source is available at: https://github.com/SamuelAlexander/instance-seg-byom-freeform-person-blur
How the cascade worksA model cascade splits the work across two models so each one stays simple:
Stage 1 is a fast detector that answers where the objects are. Stage 2 is the heavier segmentation model that produces the masks, and it only needs to run when Stage 1 finds something. Splitting the job this way keeps each model easy to deploy, runs the expensive model selectively, and lets you replace either stage without touching the rest of the pipeline.
Project structure.
├── postprocess.py YOLO-seg Freeform post-processor (the core of Stage 2)
├── test_eim.py single-image .eim sanity check
├── model_metadata.json class names and input size
├── cascade/
│ ├── cascade_inference.py two-stage cascade on a single image
│ ├── cascade_demo.py split-view demo for video or webcam
│ └── person_blur.py the person-blur application
├── images/ result stills, screenshots, and GIFs used in this guide
├── samples/ sample input frame and videos
└── models/ .eim files, see models/README.mdThe models/ folder ships each .eim for two platforms: *-aarch64.eim for the Rubik Pi 3, and *-macos-arm64.eim for local development on Apple Silicon. See models/README.mdfor how to rebuild them.
The Rubik Pi 3 is built around the Qualcomm Dragonwing QCS6490, an edge-AI SoC that combines an octa-core Kryo CPU, an Adreno GPU, and a Hexagon NPU (around 12 TOPS). That kind of on-device compute makes running a vision cascade like this at the edge practical. It runs a standard Ubuntu image, so getting it ready is quick. This section is intentionally brief; follow the linked guides for the full detail.
Flash and boot the board, then connect it to your network. See the Edge Impulse Rubik Pi 3 page for board setup and supported deployment targets.
- Flash and boot the board, then connect it to your network. See the Edge Impulse Rubik Pi 3 page for board setup and supported deployment targets.
Install the Edge Impulse Linux runtime and the Python dependencies. The runtime is what executes the .eim files; see Edge Impulse for Linux for details.
pip install "edge_impulse_linux>=1.2.2" opencv-python numpy- Install the Edge Impulse Linux runtime and the Python dependencies. The runtime is what executes the
.eimfiles; see Edge Impulse for Linux for details.pip install "edge_impulse_linux>=1.2.2" opencv-python numpy
Copy this repository to the board (clone it, or scp the folder over) and make the models executable:
chmod +x models/*.eim- Copy this repository to the board (clone it, or
scpthe folder over) and make the models executable:chmod +x models/*.eim
That is everything the board needs. From here, the commands are identical whether you run on the Rubik Pi 3 or, for development, on a macOS laptop with the bundled *-macos-arm64.eim.
Stage 1 finds people and their bounding boxes on each frame. There are two ways to get a detector for it.
Option A: train your own in Edge Impulse Studio. Collect and label images, then train an object-detection model. Make sure one of your object classes is person, since the rest of the pipeline keys off that label. This is the standard Edge Impulse flow from data to .eim; see the object detection documentation to learn more.
Option B: reuse a pretrained detector (what this guide does). Running the full training pipeline is unnecessary when a well-tested model already fits. I used YOLOX-Nano because it is already trained on the COCO dataset, which includes a person class, and it performs really well. So rather than collecting data and training from scratch, I picked this model to use directly: I uploaded the pretrained YOLOX-Nano to Edge Impulse via BYOM and used Studio only to build the .eim deployment download. Because it is uploaded with a known output type (the YOLO parser), Edge Impulse returns parsed bounding boxes directly, which is the contrast with Stage 2's Freeform output.
Either way, any detector that recognizes your target class drops in without changing the rest of the cascade.
The fastest path is to use my detector directly: open my public Edge Impulse project, clone it into your account, and build the .eim from Deployment > Linux (AARCH64) > Build. There is no need to source or upload a model yourself; the underlying detector is YOLOX. Put the downloaded .eim in models/. To check the input size and labels:
from edge_impulse_linux.runner import ImpulseRunner
import json
runner = ImpulseRunner("models/stage1-yolox-aarch64.eim")
print(json.dumps(runner.init()["model_parameters"], indent=2))
runner.stop()This is where instance segmentation gets onto Edge Impulse. BYOM (Bring Your Own Model) lets you upload any ONNX model, and the Freeform output type tells the runtime to pass every raw output tensor straight through without parsing. You handle the post-processing, which is what makes a non-native architecture like YOLO-seg deployable.
As with Stage 1, the fastest path is to use my model directly: open my Edge Impulse project, clone it into your account, and build the .eim from the Deployment tab. That lets you skip the export and upload steps below. The rest of this section shows how to build it from scratch, which is the path to take if you want to train on your own data.
A pretrained YOLO11n-seg network. It produces two output tensors:
The mask for a detection is its 32 coefficients multiplied by the 32 prototypes, passed through a sigmoid, then cropped to its box.
Export to ONNX
The export uses Ultralytics:
from ultralytics import YOLO
YOLO("yolo11n-seg.pt").export(format="onnx", opset=17, dynamic=False, simplify=True)Upload as BYOM Freeform- Go to Upload your model (BYOM) and upload
yolo11n-seg.onnx. - Set the model output type to Freeform, input to 640x640, 3 channels, scaling
0..1. - Build from Deployment > Linux (AARCH64) and place the
.eiminmodels/.
Freeform gives you tensors and nothing else. Four things trip people up, and getting any of them wrong shows up as empty masks or a mask that fills the whole frame.
Pack RGB into one float per pixelThe Linux runner expects one float32 per pixel, with the R, G and B values packed into the integer bits rather than three separate values:
rgb = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB)
r, g, b = rgb[:, :, 0].astype(np.uint32), rgb[:, :, 1].astype(np.uint32), rgb[:, :, 2].astype(np.uint32)
packed = ((r << 16) | (g << 8) | b).flatten().astype(np.float32).tolist()Match output tensors by size, not indexFreeform does not guarantee tensor order, so identify each one by its element count:
expected_det_size = (4 + num_classes + 32) * 8400 # 974400 for 80 classes
det, proto = (t0, t1) if t0.size == expected_det_size else (t1, t0)Transpose prototype masks from NHWC to NCHWThe prototypes come back flattened in NHWC order, so reshape and transpose before using them:
proto = proto.reshape(1, 160, 160, 32).transpose(0, 3, 1, 2) # (1, 32, 160, 160)Large Freeform outputs arrive over shared memoryTo avoid serializing megabytes of JSON, recent .eim builds write large Freeform outputs into POSIX shared memory and return the marker string "shm". Version 1.2.2 of edge_impulse_linux reads those segments and substitutes the real tensors for you. An older SDK leaves you with "shm", so upgrade the package and no code change is needed.
postprocess.py turns the two tensors into instance masks. It parses the detections, applies a confidence threshold and non-maximum suppression, builds each mask from the coefficients and prototypes, and resizes to the original frame:
from postprocess import YOLOSegPostprocessor
pp = YOLOSegPostprocessor(num_classes=80, conf_thresh=0.25, iou_thresh=0.7, img_size=640)
results = pp.process(det_tensor, proto_tensor, orig_img_shape=(h, w))To check Stage 2 on its own against a single image:
python test_eim.py --eim models/stage2-yolo11nseg-aarch64.eim --image samples/sample-frame.jpg --metadata model_metadata.jsoncascade/cascade_inference.py runs both stages on one image and merges them, matching Stage 1 boxes to Stage 2 masks by IoU so the detections and masks line up:
python cascade/cascade_inference.py \
--stage1 ./models/stage1-yolox-aarch64.eim \
--stage2 ./models/stage2-yolo11nseg-aarch64.eim \
--metadata ./model_metadata.json \
--image samples/sample-frame.jpg --output result.jpgOn the board (set up earlier), run the cascade over a video file with the split-view demo. This path needs no display:
python cascade/cascade_demo.py \
--stage1 ./models/stage1-yolox-aarch64.eim \
--stage2 ./models/stage2-yolo11nseg-aarch64.eim \
--metadata ./model_metadata.json \
--video samples/engineer.mp4 --output cascade_demo.mp4Note: Developing on macOS, swap the *-aarch64.eim files for the bundled *-macos-arm64.eim and the commands are identical. If macOS reports an .eim as damaged, clear the quarantine flag: xattr -d com.apple.quarantine <file>.
Note: Developing on macOS, swap thePerson-blur application*-aarch64.eimfiles for the bundled*-macos-arm64.eimand the commands are identical. If macOS reports an.eimas damaged, clear the quarantine flag:xattr -d com.apple.quarantine <file>.
cascade/person_blur.py uses the cascade to anonymize people. A bounding-box blur covers a rectangle and takes the background with it. An instance mask follows the body outline, so the blur lands on the person and nothing else.
The blur uses the union of all person masks as a blend map. A multi-pass Gaussian (passes set by --blur-passes) anonymizes faces and clothing:
blurred = frame.copy()
for _ in range(passes): # --blur-passes (default 2)
blurred = cv2.GaussianBlur(blurred, (51, 51), 0)
combined = np.zeros(frame.shape[:2], np.uint8)
for inst in person_instances:
combined = np.maximum(combined, inst["mask"])
m = (combined / 255.0)[:, :, None]
output = (blurred * m + frame * (1 - m)).astype(np.uint8) # blurred on the person, sharp elsewhereRun it on a clip:
python cascade/person_blur.py \
--stage1 ./models/stage1-yolox-aarch64.eim \
--stage2 ./models/stage2-yolo11nseg-aarch64.eim \
--metadata ./model_metadata.json \
--video samples/engineer.mp4 --output blurred_engineer.mp4The same masks are a starting point for other applications too, such as background replacement, object removal, selective effects, or AR overlays.
Live webcam demoTo run on a live USB webcam instead of a file, drop the --video flag. The preview window opens on the board's display, so launch it from a terminal on the board itself:
QT_QPA_PLATFORM=xcb python cascade/person_blur.py \
--stage1 ./models/stage1-yolox-aarch64.eim \
--stage2 ./models/stage2-yolo11nseg-aarch64.eim \
--metadata ./model_metadata.json \
--skip 5 --blur-passes 2Press q or close the window to quit. Two flags trade quality for speed in live mode: --skip N runs Stage 2 (the heavy model) only every Nth frame and reuses the mask in between, while --blur-passes sets the blur strength. Raise --skip or lower --blur-passes for a smoother feed.
Note: On a Wayland desktop (the Rubik Pi 3's default), set QT_QPA_PLATFORM=xcb as shown, or the OpenCV/Qt window may come up as a small black box.
Note: On a Wayland desktop (the Rubik Pi 3's default), set QT_QPA_PLATFORM=xcb as shown, or the OpenCV/Qt window may come up as a small black box.For a recording of this running live on the board, see the live demo above.
A note on hardware accelerationThe QCS6490 has a Hexagon NPU, which Edge Impulse can target with the Linux (AARCH64 with Qualcomm QNN) deployment option. The NPU accelerates int8-quantized models and suits the detection and classification style of model well.
The cascade in this guide runs on the CPU, which keeps it simple and portable across every Edge Impulse Linux target. Quantizing the Freeform segmentation model for the NPU is a worthwhile follow-up on its own, since its multi-tensor output makes int8 quantization model-specific work. Treat it as a next step once the CPU pipeline is running.
TroubleshootingThis guide brought instance segmentation to Edge Impulse without a native learning block by pairing two ideas: a two-stage model cascade and BYOM Freeform. A fast detector locates people, a YOLO11n-seg model produces pixel-accurate masks, and a small post-processor turns the raw Freeform tensors into instances that drive a privacy person-blur application. Because every stage runs through the Edge Impulse Linux runtime as an .eim, the same pipeline that runs on the Qualcomm QCS6490 runs unchanged on any supported target, from a Raspberry Pi 5 to a development laptop. From here you can train Stage 2 on your own segmentation data, build other mask-driven applications such as background replacement or selective effects, or explore int8 deployment of a detection-style model on the QCS6490's Hexagon NPU.









Comments