Walk onto any factory floor (or imagine a Boston Dynamics Spot doing it for you) and you’ll find legacy LCDs and analog gauges that have no data transmission capabilities, limiting your IoT pipeline. Extracting that data usually means replacing expensive hardware. However, as TinyML pioneer Pete Warden pointed out, there’s a faster and cheaper alternative: point a low-power camera at the existing display and process it locally.
In a previous project, we built a TinyML pipeline on an OpenMV board to read a prepaid electricity meter. It worked, but the implementation was tasking. Separating the screen from the meter chassis was necessary for thresholding to work, forcing us onto a brittle rule-based bounding box approach with hardcoded pixel coordinates. Then came the painstaking manual labelling of 3, 260 digit images across ten classes, the camera calibration sessions, and the ROI debugging.
The system hit 99.8% accuracy in testing and still needed a dedicated rig, a WiFi shield, and hoping the lighting conditions wouldn’t change. Night-time capture was out of scope entirely. That is the ceiling of a classical computer vision solution for screen scraping, and you hit it faster than you expect since this solution is not easily adaptable for nighttime let alone other use cases.
Enter Vision-Language Models. Instead of wrestling with fragile OCR scripts, you can now use Visual Question Answering to simply ask a local AI agent: "Read the boiler pressure on this display and return a JSON output." VLMs are remarkably resilient to glare, odd angles, and dirty lenses, exactly the conditions a robot inspecting a factory would encounters, making them the obvious upgrade for extracting data from brownfield settings at scale. There's just one catch: VLMs are RAM-devouring beasts, and dragging that kind of generative AI out of the cloud to run completely offline on edge hardware is where the real engineering challenge begins. That's exactly what we're going to tackle here.
The Edge Compute WallTo understand why VLMs haven't already taken over facility audits, you have to look at the constraints. Popular open-weight multimodal models are very effective, but they are notoriously RAM-hungry. They need massive memory pools just to hold their weights and KV cache during inference. You can’t toss a multi-billion parameter model onto a standard edge device and expect it to parse an image. There is the option of defaulting to a cloud API but streaming gigabytes of sensitive factory floor imagery introduces real security risks and a hard dependency on industrial Wi-Fi that may not exist. More often than not, the intelligence has to be strictly local.
Rubik Pi 3 for the WinThis is exactly where the Thundercomm Rubik Pi 3 resolves the bottleneck. Built around the 6nm Qualcomm Dragonwing™ QCS6490 SoC, it feels tailor-made to skip over that edge compute wall. For this VQA project, the spec that matters most is its 8GB of LPDDR4x RAM. That footprint gives us exactly the breathing room needed to load an aggressively quantised small VLM entirely into memory and run inference locally, with no cloud dependency. The board packs a Hexagon NPU delivering 12 TOPS of dedicated AI compute, alongside an integrated Adreno 643L GPU accessible via OpenCL. The Rubik Pi 3 is priced relatively lower than its counterparts, making it an accessible entry point for edge AI projects that would previously have demanded much more expensive hardware.
The Model: Liquid AI LFM2.0-VL-1.6BLiquid Foundation Models are architecturally efficient by design. The LFM series was built from the ground up for constrained hardware deployment, not shrunk down from a data centre model after the fact. The fine folks at Liquid AI recommend a Q4_K_M quantisation for a balance of quality and size. This reduces the model weights to well under 1 GB.
What makes LFM2.0-VL particularly well-suited for screen scraping specifically is a combination of capabilities that matter for this exact task. The model has strong OCR-oriented visual grounding. Its instruction-following is precise enough to emit clean JSON on every inference reliably. Critically, it proved resilient to low-resolution inputs during our experiments. Scaling images down aggressively had almost no impact on read accuracy, which matters enormously on an edge device where every token in the vision encoder has a memory and latency cost.
For reliable edge infrastructure, the inference engine needs to be lightweight. Python-heavy frameworks carry too much overhead. This project uses llama.cpp, specifically llama-server for its OpenAI-compatible API endpoint, and libmtmd for multimodal processing. Once running, llama-server exposes a standard /v1/chat/completions endpoint that any HTTP client can hit directly. The image is base64-encoded and sent as part of the message content array alongside the text prompt.
Setup note: to enable GPU offload on the Rubik Pi 3, llama.cpp must be compiled with the OpenCL backend (-DGGML_OPENCL=ON). The Thundercomm official build guide covers this in full, including the OpenCL headers and ICD loader dependencies. Confirm the OpenCL path is active by checking startup logs for ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'.
Installation of llama.cpp
To install the llama.cpp, follow the steps 1 through 5 in the Rubik Pi installation guide available here: Link.
Tuning for Reliable Inference
To run the VLM on Rubik Pi 3 with llama.cpp without hitting an OOM error, you have to consider the following flags:
Context Size (-c 1024):the single most important flag. For a VQA task, you are sending one image and asking one question. You do not need 128K tokens of context. Capping it at 4096 dropped the KV cache from several gigabytes to roughly 200-400MB.
Model Quantisation (Q8_0 to Q4_K_M): The Q8_0 weights amount to approximately 1.25GB. Dropping to Q4_K_M brings the weights well under 1GB, freeing substantial headroom while preserving better quality than pure Q4_0.GPU Offloading (-ngl 99): This flag tells llama.cpp to offload as many model layers as possible to a hardware accelerator. We set it to 99. If the OpenCL flag mentioned earlier is set, what -ngl 99 does is offload the heavy matrix multiplications to the board's integrated Adreno GPU for a 33% speedup according to Thundercomm's benchmarks.
The recommended final launch command for the Rubik Pi 3:
llama-server \
-m ./LFM2-VL-1.6B-Q4_0.gguf \
--mmproj ./mmproj-LFM2-VL-1.6B-F16.gguf \
-b 4 \
-c 1024 \
--threads 6 \
--host 0.0.0.0 \
--port 9876 \
-ngl 99 \
--no-warmupFor this use case, the prompt used for a structured output:
"""
TASK: Extract numeric readings from three digital gauges in this image.
GAUGE IDENTIFICATION (left to right):
- LEFT gauge (black/dark): rain_gauge (units: mm)
- MIDDLE gauge (white with blue header): thermometer (units: °C)
- RIGHT gauge (white/red circular): pressure_gauge (units: bar)
READING INSTRUCTIONS:
1. Focus ONLY on the main numeric display on each gauge's LCD/LED screen
2. Read the complete number including decimal points if present
3. Ignore any secondary displays, unit labels, or interface elements
4. If a gauge shows multiple numbers, use the largest/primary display
OUTPUT FORMAT:
- Return ONLY valid JSON with no additional text, markdown, or formatting
- Use null for unreadable or missing gauges
- Round to maximum 2 decimal places
- Use integers when the value is a whole number
REQUIRED JSON STRUCTURE:
{
"rain_gauge": <number|null>,
"thermometer": <number|null>,
"pressure_gauge": <number|null>
}
Analyze the image now and return the JSON response.
"""Below is a screen recording of the Rubik Pi executing the entire pipeline locally.
The table below summarizes the inference performance observed on the Rubik Pi during testing. While text prompt processing and generation run relatively quickly, the image encoding stage dominates the runtime, as the vision encoder must convert the input image into embeddings before the language model can reason about it.
This project demonstrates something that should matter to anyone building monitoring infrastructure for industrial environments: a low-cost board running open-source inference software can perform VQA on legacy displays, offline, with no cloud dependency. The engineering is reproducible, the repository is open, and the tuning steps are documented above so others can build on this work.
GitHub Repo:Link
References
Pete Warden — TinyML as a Quick, Cheap Alternative to Smart Meter Deployment
TinyML Digital Counter for Electric Metering System





Comments