Researchers Find DORY Dramatically Improves Deep Neural Network Performance on Sub-1MB SRAM MCUs
Designed to eke maximum efficiency from edge IoT chips with 1MB or less static RAM (SRAM), DORY is now available under an Apache license.
A team of researchers have released a framework, dubbed Deployment Oriented to Memory (DORY), which they say can offer between a 2.5x and a 18.1x increase in efficiency for deep neural network (DNN) operation on low-power microcontroller devices.
"The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet of Things is a critical enabler to support pervasive Deep learning-enhanced applications," the researchers explain of their work in the paper's abstract. "Low-cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency - requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering."
"In this work, we propose DORY (Deployment Oriented to memoRY) — an automatic tool to deploy DNNs on low-cost MCUs with typically less than 1MB of on-chip SRAM [static RAM] memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes."
To prove DORY's capabilities, the researchers turned to GreenWave Technologies' GAP8 processor, a highly-parallel ultra-low-power RISC-V implementation which is seeing increasing adoption at the edge. "On this device," the researchers found, "DORY achieves up to 2.5x better MAC/cycle than the GreenWaves proprietary software solution and 18.1x better than the state-of-the-art result on an STM32-F746 MCU on single layers."
"Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps — 15.4x better than an STM32-F746."
To enable wide adoption of the DORY framework, the team has published its source code plus optimized back-end kernels and heuristics, on GitHub under the Apache 2.0 license; the paper, meanwhile, is available under open-access terms on arXiv.org.