AI/ML tutorials are often complex and rely on large models that require powerful development machines with high-end GPUs and extensive software stacks.
For demonstrating the complete development flow, a lightweight custom CNN is much more practical. It allows us to cover every step — from model design and training to deployment on a DEEPX-powered edge device.
What You’ll LearnIn this project, we walk through the full workflow:
- Setting up the development environment using a Docker container(including a Jupyter Notebook and all required Python tools, as well as the DEEPX compiler)
- Designing and training a custom CNN using the MNIST dataset(a well-known dataset of handwritten digits from 0 to 9)
- Testing and validating the trained model
- Exporting the model to ONNX format
- Compiling the ONNX model using the DEEPX compiler
- Developing a C++ application that reads an image from the file system and performs inference
- Running and testing the application on a DEEPX-accelerated edge device
The development environment can be set up either on an AMD Ryzen system or a standard PC.
Since the target is edge devices — often powered by less capable CPUs (e.g., ARM) — it is recommended to perform development on a PC to better understand the edge deployment workflow.
Make sure Docker is installed and working correctly:
docker run hello-world1.2 Build the Docker ImageDownload the provided Dockerfile(s) and build the container image:
docker build -t dxcom:2026-03-31 -f ./Dockerfile .This step creates a Docker image with all required tools, including:
- Python environment
- Jupyter Lab
- DEEPX compiler
Start the container:
mkdir -p $PWD/workspace
docker run --rm --publish 7000:7000 --volume $PWD/workspace:/workspace --user $(id -u):$(id -g) -it dxcom:2026-03-31 /bin/bash1.4 Start Jupyter LabInside the container, launch Jupyter Lab:
cd /workspace/
jupyter lab --ServerApp.port=7000 -ServerApp.ip=0.0.0.0 -IdentityProvider.token="silica" --no-browserWe spawned the notebook on port 7000. Access the notebook from a browser using this URL:http://IP_OF_DOCKER_HOST:7000/lab?token=silica
Create a new notebook and name it minst.ipynb. the full minst.ipynb is also attached but it is recommended to paste cell by cell to get familiar with the flow.
First cell:
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_splitThis cell just imports all required libraries for the tutorial:
- torch / torch.nn / torch.optim → build and train the CNN
- torchvision.datasets & transforms → load and preprocess MNIST
- DataLoader utilities → handle batching and dataset splitting
- matplotlib → visualize images/results
- os → basic file handling
👉 In short: it prepares everything needed for data loading, model creation, training, and visualization.
Next Cell:
# ----------------------------
# Config
# ----------------------------
BATCH_SIZE = 64
EPOCHS = 10
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")This snippet defines the basic training configuration:
BATCH_SIZE = 64→ number of samples processed at onceEPOCHS = 10→ how many times the model sees the full datasetDEVICE = ...→ automatically uses GPU (CUDA) if available, otherwise CPU
Next Cell:
# ----------------------------
# Dataset / transforms
# ----------------------------
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST(
root="./data",
train=True,
download=True,
transform=transform
)
full_valid_dataset = datasets.MNIST(
root="./data",
train=False,
download=True,
transform=transform
)
# Split original test set into validation + test
test_size = 1000
valid_size = len(full_valid_dataset) - test_size
generator = torch.Generator().manual_seed(42)
valid_dataset, test_dataset = random_split(
full_valid_dataset,
[valid_size, test_size],
generator=generator
)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
print("Training samples: ", len(train_dataset))
print("Validation samples:", len(valid_dataset))
print("Test samples: ", len(test_dataset))This snippet prepares the dataset for training and evaluation:
- Defines preprocessing (convert to tensor + normalize MNIST images)
- Downloads and loads the MNIST dataset
- Splits the test set into validation and test subsets
- Creates DataLoaders for batching and efficient training
- Prints dataset sizes
👉 In short: it loads, preprocesses, splits, and organizes the data for training, validation, and testing.
Next Cell:Display an image from the training dataset:
# Get one sample from the training dataset
image, label = train_dataset[1]
# Convert tensor to numpy for plotting
image_np = image.squeeze().numpy()
# Display the image
import matplotlib.pyplot as plt
plt.imshow(image_np, cmap="gray")
plt.title(f"Label: {label}")
plt.axis("off")
plt.show()Next Cell:
# ----------------------------
# Model
# ----------------------------
class MNISTModel(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(
in_channels=1,
out_channels=16,
kernel_size=3,
bias=True
)
self.bn1 = nn.BatchNorm2d(16)
self.dwconv = nn.Conv2d(
in_channels=16,
out_channels=16,
kernel_size=3,
groups=16,
bias=True
)
self.bn2 = nn.BatchNorm2d(16)
self.dropout = nn.Dropout(0.1)
self.fc = nn.Linear(16 * 24 * 24, 10)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = F.relu(x)
x = self.dwconv(x)
x = self.bn2(x)
x = F.relu(x)
x = torch.flatten(x, 1)
x = self.dropout(x)
x = self.fc(x)
return x
model = MNISTModel().to(DEVICE)This snippet defines and initializes a simple CNN model for MNIST:
- Uses convolution layers (including a depthwise convolution) + batch normalization
- Applies ReLU activation, dropout, and a fully connected layer
- Outputs predictions for 10 digit classes (0–9)
- Moves the model to the selected device (CPU/GPU)
👉 In short: it builds a lightweight CNN and prepares it for training.Next Cell:
# Quick sanity check
x = torch.randn(1, 1, 28, 28).to(DEVICE)
y = model(x)
print("Sanity check output shape:", y.shape) # [1, 10]Creates a dummy input tensor with MNIST shape (1×1×28×28)
- Runs it through the model
- Prints the output shape
👉 In short: it verifies that the model works and outputs 10 class scores (one per digit).
Next Cell:
# ----------------------------
# Loss / optimizer
# ----------------------------
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# ----------------------------
# Training loop
# ----------------------------
for epoch in range(EPOCHS):
# Training
model.train()
train_loss = 0.0
train_correct = 0
train_total = 0
for x, y in train_loader:
x = x.to(DEVICE)
y = y.to(DEVICE)
optimizer.zero_grad()
outputs = model(x)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
train_loss += loss.item() * x.size(0)
preds = outputs.argmax(dim=1)
train_correct += (preds == y).sum().item()
train_total += y.size(0)
train_loss /= train_total
train_acc = train_correct / train_total
# Validation
model.eval()
valid_loss = 0.0
valid_correct = 0
valid_total = 0
with torch.no_grad():
for x, y in valid_loader:
x = x.to(DEVICE)
y = y.to(DEVICE)
outputs = model(x)
loss = criterion(outputs, y)
valid_loss += loss.item() * x.size(0)
preds = outputs.argmax(dim=1)
valid_correct += (preds == y).sum().item()
valid_total += y.size(0)
valid_loss /= valid_total
valid_acc = valid_correct / valid_total
print(f"Epoch {epoch+1}/{EPOCHS}")
print(f" Train Loss: {train_loss:.4f}, Acc: {train_acc:.4f}")
print(f" Valid Loss: {valid_loss:.4f}, Acc: {valid_acc:.4f}")This snippet trains and validates the model:
- Defines loss function (CrossEntropy) and optimizer (Adam)
- Runs a training loop over multiple epochs
- Updates model weights using backpropagation
- Tracks loss and accuracy for training and validation
- Prints performance after each epoch
👉 In short: it trains the CNN and evaluates how well it performs.
Next Cell:
# ----------------------------
# Final test evaluation
# ----------------------------
model.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
for x, y in test_loader:
x = x.to(DEVICE)
y = y.to(DEVICE)
outputs = model(x)
preds = outputs.argmax(dim=1)
test_correct += (preds == y).sum().item()
test_total += y.size(0)
test_acc = test_correct / test_total
print(f"\nTest Acc: {test_acc:.4f}")This snippet evaluates the trained model on the test dataset:
- Disables training (
eval+ no gradients) - Runs inference on the test data
- Compares predictions with true labels
- Computes and prints final test accuracy
👉 In short: it measures how well the model performs on unseen data.
Next Cell:
# Take one sample from test set
x, y = next(iter(test_loader)) # get a batch
x = x.to(DEVICE)
y = y.to(DEVICE)
# Pick first image in batch
img = x[0].unsqueeze(0) # keep batch dim → [1, 1, 28, 28]
label = y[0].item()
with torch.no_grad():
outputs = model(img)
probs = F.softmax(outputs, dim=1)
pred_class = probs.argmax(dim=1).item()
confidence = probs.max().item() * 100
print(f"Ground truth: {label}")
print(f"Prediction : {pred_class}")
print(f"Confidence : {confidence:.2f}%")This snippet performs inference on a single test image:
- Takes one image from the test set
- Runs it through the model
- Computes prediction and confidence (via softmax)
- Prints ground truth vs. predicted digit
👉 In short: it demonstrates how the trained model makes a prediction on one sample.
Next Cell:
# ----------------------------
# Show image
# ----------------------------
img_np = img.cpu().squeeze().numpy()
plt.imshow(img_np, cmap="gray")
plt.title(f"GT: {label} | Pred: {pred_class} ({confidence:.1f}%)")
plt.axis("off")
plt.show()This snippet displays the test image and prediction:
- Converts the tensor to a NumPy image
- Shows it using matplotlib
- Displays ground truth, predicted class, and confidence in the title
👉 In short: it visualizes the model’s prediction result.
Next Cell:
# ----------------------------
# Export to onnx
# ----------------------------
os.makedirs("models", exist_ok=True)
onnx_path = "models/mnist_model.onnx"
# Dummy input (must match your model input!)
dummy_input = torch.randn(1, 1, 28, 28).to(DEVICE)
torch.onnx.export(
model, # model to export
dummy_input, # example input
onnx_path, # output file
export_params=True, # store trained weights
opset_version=13, # good default
do_constant_folding=True, # optimization
input_names=["input"], # optional
output_names=["output"], # optional
dynamic_axes={ # allow variable batch size
"input": {0: "batch_size"},
"output": {0: "batch_size"},
}
)
print("✅ Save ONNX:", onnx_path)This snippet exports the trained PyTorch model to ONNX format:
- Creates a dummy input matching the model input
- Converts the model (including weights) into an ONNX file
- Applies basic optimizations and naming
- Enables dynamic batch size
👉 In short: it prepares the model for deployment on other platforms (e.g., DEEPX).
Next Cell:
# ----------------------------
# Save test images
# ----------------------------
os.makedirs("images/mnist_test", exist_ok=True)
for i in range(len(test_dataset)):
img, label = test_dataset[i]
img_np = img.squeeze().numpy()
plt.imsave(f"images/mnist_test/{i}_label_{label}.png", img_np, cmap="gray")
print("Saved individual images")This snippet exports the test dataset as individual image files:
- Creates an output folder
- Iterates over all test samples
- Saves each image as a PNG file with its label in the filename
👉 In short: it prepares test images for use outside Python (e.g., for C++ inference).
Next Cell:
# ----------------------------
# save calibration images
# ----------------------------
# Raw MNIST without normalization, so files contain normal pixel values
train_raw = datasets.MNIST(
root="./data",
train=True,
download=True,
transform=None
)
out_dir = "images/mnist_calibration"
os.makedirs(out_dir, exist_ok=True)
num_calib = 1000
for i in range(num_calib):
img, label = train_raw[i] # PIL image, label int
filename = os.path.join(out_dir, f"{i:05d}_label_{label}.png")
img.save(filename)This snippet generates calibration images from the MNIST training set:
- Loads MNIST without normalization (raw pixel values)
- Saves a subset (e.g., 1000 images) as PNG files
- Includes the label in each filename
👉 In short: it prepares raw images for model calibration (e.g., quantization).
3 Model compilationIn this section we generate the compiler config and launch the deepx compiler.
Next Cell:
import json
config = {
"inputs": {"input": [1, 1, 28, 28]},
"calibration_num": 10,
"calibration_method": "ema",
"default_loader": {
"dataset_path": "images/mnist_calibration/",
"file_extensions": ["png"],
"preprocessings": [
{"convertColor": {"form": "RGB2GRAY"}},
{"resize": {"width": 28, "height": 28}},
{"div": {"x": 255.0}},
{"normalize": {"mean": [0.1307], "std": [0.3081]}},
{"expandDim": {"axis": 0}},
{"expandDim": {"axis": 0}}
]
}
}
with open("mnist.json", "w") as f:
json.dump(config, f, indent=2)
print("mnist.json created")This JSON file we just created (mnist.json) defines the input and calibration configuration for the DEEPX compiler:
- Specifies the model input shape ([1, 1, 28, 28])
- Defines how many images to use for calibration and the method (EMA)
- Points to a folder with calibration images
- Describes the preprocessing pipeline (grayscale, resize, normalize, reshape)
👉 In short: it tells the compiler how to preprocess input data and calibrate the model for deployment.
In the Jupyter notebook select File->New Launcher and open a terminal.Launch the DEEPX Compiler:
dxcom -m models/mnist_model.onnx -c mnist.json -o output/ --gen_logIn the previous section, we walked through the complete workflow: setting up the development environment, designing a custom CNN, training and evaluating it, exporting the model to ONNX, and compiling it for the DEEPX M1 accelerator.
With the model now compiled and ready for deployment, it’s time to move from development to real-world execution.
In the next step, we will build a C++ application that runs on the edge device, loads the model, and performs inference on input images.
4. Develop and deploy a C++ classification appIn this chapter we create the edge target application for our simple CNN utilizing the DEEPX accelerator.
4.1. Build the C++ appPrepare the the folder structure in the Jupyter terminal (the same one which was used for model compilation):
cd /workspace/
mkdir -p app/src
mkdir app/bld && cd app/bldExtract and copy the attached C++ sources (mnist_edge_sources.zip) to the freshly created src folder on the host ($CUR_DIR/workspace/app/src)Next we build the app:
cmake ../src
makeThe runtime consists of the Kernel driver and the according user space libraries as well as some management tools.Follow the instructions on the github site:DEEPX Github runtime
4.2. Test acceleratorUse this command to check if the accelerator is detected on the target device:
sudo lspci -vv -d 1ff4:Have look at the negotiated speed and if the correct numbers of pcie lanes are used. On an AMD Ryzen box we'd expect:
LnkSta: Speed 8GT/s, Width x4
To see if the runtime is installed properly:
silica@4X4-BOX-8840U:~$ dxrt-cli --status
DXRT v3.1.0
=======================================================
* Device 0: M1, Accelerator type
--------------------- Version ---------------------
* RT Driver version : v2.1.0
* PCIe Driver version : v2.0.1
-------------------------------------------------------
* FW version : v2.5.0-sr1
--------------------- Device Info ---------------------
* Memory : LPDDR5x 6000 Mbps, 3.92GiB
* Board : M.2, Rev 1.0
* Chip Offset : 0
* PCIe : Gen3 X4 [04:00:00]
NPU 0: voltage 750 mV, clock 1000 MHz, temperature 35'C
NPU 1: voltage 750 mV, clock 1000 MHz, temperature 34'C
NPU 2: voltage 750 mV, clock 1000 MHz, temperature 34'C
=======================================================4.3. Testing the C++ AppCopy the following files from the docker workspace to the target, in our case the AMD Ryzen box:
- App: ./workspace/app/bld/classification_async
- Test images: ./workspace/images/mnist_test
- Compiled model: ./workspace/output/mnist_model.dxnn
Finally:
silica@4X4-BOX-8840U:/tmp$ /tmp/classification_async -m mnist_model.dxnn -i mnist_test/106_label_9.png --width 28 --height 28 --class 10 --loop 5
index: 0 class: 9
index: 1 class: 9
index: 2 class: 9
index: 3 class: 9
index: 4 class: 9
[INFO] total time : 12347 us
[INFO] per frame time: 2469 us
[INFO] fps : 404.957We took an image from the test dataset. These images have never been seen by the training alogorithm. In this case the image was classified correctly as 9.






Comments