A challenging problem in the field of sound processing is the ability to create a high-accuracy, low latency system capable of recognizing and understanding human speech. My original project idea can be found here.
First, I would like to offer a bit of context to help with the understanding of the decisions presented below. For me this project was mostly meant as a fun way to get back into the fields of hardware design and artificial neural networks after quite a long break. As a result this project ended up more in the direction of finding all (or at least quite a lot of) the paths that lead to failure instead of the one great road to victory. This means that I have spent most of my effort in board enablement tasks instead of developing FPGA accelerators for AI. More about that later. Suffice to say that if you are looking for a quick and easy voice controlled assistant this story is probably not the place to start.
Project outline:
- Software/Hardware Architecture
- Speech to Text Frameworks
- Board Setup
- Hardware Accelerated Graphics
- Text to Speech Framework
- FPGA Accelerators for DNNs
- Final Status
- Future Work
Now that we are setup lets see what we are dealing with here, the board for this project was a mandatory requirement, which means that at least part of our hardware had already been decided beforehand. So lets take a look:
Ok, ok board looks pretty nice we can see a bunch of USB ports, a microSD card slot and a miniDP port (GPIO expander and so on). But that is not what I meant, since here we can't actually see the silicon SoC, show me the block diagram:
Ok, a little bit better, we can see a fixed piece of SoC now (which Xilinx refers to as PS) and that flexible PL part, but again enough teasing, we want to take a look inside of that MPSoC, what exactly is this XCZU3EG ? One more try:
That is one awsome piece of hardware! But no time to wonder in awe, lets get a quick architecture down. For sustaining neural networks we need processing power, but at the same time we do not want to sacrifice usability, since we will have to develop on this platform. In all honesty with a bit of tweaking this is barely an embedded system anymore, and closer to a full blown computing platform in pocket size. Speaking of which I am wondering where does that small board dissipate all the power that SoC draws when it runs heavy workloads ? Hmm, no time to think about that now, architecture is where we left off.
Quad-core ARM Cortex-A53, that has a familiar ring to it, oh yeah ARCH=arm64. Ok, then Linux kernel with GNU based rootfs is it for the CPUs, that should give us a nice development environment, most Linux based distros have support for this arch nowadays so we should be fine (for teaching purposes and quick prototyping I generally prefer Ubuntu). If all else fails there seems to be Yocto Project support from Xilinx so we could build our own small and targeted distro. OS is set, what else ?
Since we are looking to do some heavy computing, before we head off to the PL side, our eyes immediately go to the on-chip embedded GPU: ARM Mali-400. That is another nice piece of hardware, we could accelerate a full blown desktop environment over there to off-load the CPUs completely and leave them open for development tasks, and if we are really lucky we might even manage some AI there (there were some rumors about TensorFlow support for some Mali chips).
So far, so good, platform with Linux based kernel, hardware accelerated graphics, and a large enough FPGA to put some accelerators for running inference with real-time constraints. Some readers might notice that I ignored the R5 cores, that is mostly since without a fair amount of work they would not really fit this use case.
Before we move on to prepare the software, lets take a look at already existing projects that deal with deep neural network based speech processing.
Speech to Text FrameworksThere are a lot of choices in this area, probably way too many for me to list here now. Because I was interested in trying out TensorFlow and because they have a pretty good data gathering effort I chose Project DeepSpeech backed by Mozilla Foundation. Raw speech data can be found at the Common Voice initiative. Because of interest and some specific hardware support on the PC side I did end up rebuilding my own package of the project on which I could train models and export them for inference. For anybody that is interested DeepSpeech is based on TensorFlow, and if anybody is interested I would recommend taking some time get familiar with both projects.
For me the original idea was as follows, train a model on the PC using TF with accelerated GPU support and then export that model for inference and find a way to execute it on an accelerator implemented in the PL.
The view above is of a pre-trained model that you can download as part of the official DeepSpeech release as a starting point for trying things out. For the readers out there wondering, I know that there are other frameworks out there, some of which might have better performance or features, this is one area where I would definitely like to spend a lot more time working and things out (but time for this project was quite limited).
Board SetupWith a model in hand, its time to get the board ready. For the first try I used the supplied microSD card to boot the board. As the boot process elapsed I immediately noticed a couple of things. First, no output on the serial console, hmm short dig through some forums and that was solved (either use a patched U-Boot or modify the PS config and regenerate). Speaking of patched U-Boot second thing I noticed, from experience, is that the supplied software is the output of a Yocto based workflow. If you need some more information about the boot process Xilinx has a good (recently updated to Confluece) wiki here, where you can read all about it.
The last quite unpleasant thing I noticed was that the UI seemed a bit laggy of sorts.
Since the Yocto build supplied was quite old (2017.x release) even by Xilinx standards (their latest release of peta-linux is 2018.2 and currently 2018.3 is in work) that was definitely not going to cut it. My decision again from experience was to grab their kernel from that microSD card together with its modules from the supplied rootfs and mix together with Ubuntu Base as a start.
For anyone interested (since I saw some of this on the forums) to create such a rootfs you can unpack an Ubuntu Base archive to your PC and together with QEMU you can actually chroot into a rootfs of another arch, in this case arm64. You can then finish the setup like you would on a normal Ubuntu bare-metal install (using the Ethernet connection of the host PC) and then just write that rootfs to the microSD card. There are plenty of such tutorials on the web, from people with nice example code.
Once I got Ubuntu up that could have been enough, just work over the network, by the since I got all the modules and firmware from Yocto builds on my Ubuntu rootfs WLAN functionality works fine.
But again my target, was to take advantage of the miniDP port interface and a nice fast Wayland (with Weston) based desktop environment to run on the target.
At this point I tried to rebuild Yocto based images using Xilinx's repositories for their boards. This is not a rant since keeping up with upstream Yocto is hard even for projects which have people on it everyday, but those repositories are as well maintained as any other vendor specific Yocto layers (which is to say not that maintained). If you have no Yocto experience thread carefully around this. Eventually, I did manage to wrestle their 2018.2 release branches to build both the petalinux-image-minimal as well as the petalinux-image-full.
The differences between the two images are that the minimal image is a console based environment while the full image is a GUI based release, in this case for release 2018.2 it is X11 based with matchbox and Xfce integration (the full image is also much larger and contains the Qt framework as well). Both images achieved my goal getting a recent Linux kernel compiled for the board with all the required modules as part of the rootfs (in this case the Mali module is of most interest for the graphics acceleration).
Hardware Accelerated GraphicsSince we have the full image anyway lets test the GUI, should be accelerated X11 since it was integrated by Yocto. I booted the board and to my surprise the UI felt laggy again.
Time to investigate a bit, pull up top and check the CPU load while moving a window around (or maximize or any other graphics task) and the CPU load spikes up. That basically points to a non-accelerated graphics stack putting a lot of load on the CPU for software rendering. But why ?
After digging around and learning about the software architecture the problem became a bit more transparent. Basically Xilinx implements their own DRM/DRI kernel driver to handle the DP connection, so to show anything on the screen you need to pass for through that interface. In addition the Yocto builds use proprietary user space binaries to interface with the Mali kernel driver. At the time when I was working on this part, and I think this still is the case release 2018.2 of petalinux from Xilinx only has support for the fbdev and X11 backends for those binaries. Release 2018.3 is supposed to support Wayland as well via the wayland-egl support coming from those binaries.
As a solution I did try building their 2018.3 release in Yocto since the branches are already available but while I was building the server where the binaries where supposed to come from went down and stayed like that for a few days (still not sure if this is up or if the Yocto recipe should be adjusted to grab from somewhere else). Eventually I gave up and grabbed the proprietary binaries directly from ARM.
But that did not ultimately end up solving the issue, since because as far as I can currently tell, even when the EGL, GLES1/2 and wayland-egl proprietary binaries work properly on the target (which means rendering would be accelerated by the Mali GPU) the last part which is related to compositing is still performed in software. As a result UI's still have lag and the CPUs still get trashed with load for rendering. If I made any mistakes here, or somebody knows/has managed better please let me know. Below some screenshots of Weston running on the target as well as some screen tests.
This is what I meant about spending large efforts on getting the board to use the hardware that is available, because of the mismatch between open source and proprietary vendor software, most of the time it does not matter how good the hardware on the board is since most people will be turned away by the effort required to achieve a state where they can take advantage of such designs.
As far as I can currently tell the nice solution here is to go full blown open source, replace the proprietary GPU driver stack with the mesa based lima implementation. I have already created an Yocto based SDK and compiled the mainline Linux kernel for the Ultra96 together with the Lima driver, and I am currently in the process of porting the Xilinx DRM layer driver stack to the 4.19 kernel (since that Xilinx driver is not merged in mainline). But this work is stil ongoing.
Text to Speech FrameworkThe second part of my project included Text to Speech such that the AI on the device could reply to humans in a natural dialog fashion. Fortunately, this part was quite easy by relying on the CMU Flite project, which can be easily compiled for the target. Either by using an Yocto SDK or just compiling on the target. The CPU cores are strong enough to compile even larger packages like weston or mesa directly on the Ubuntu based rootfs. Once Flite is compiled (actually might also be available in Ubuntu via apt, for the arm64 ? that would be nice) it can be used to produce replies from the board in one of the available voice profiles. Here is another very interesting area to work on in the future (increase performance and diversify with more voices, perhaps by uniting with Common Voice ?)
FPGA Accelerators for DNNsMy plan here was to use Xilinx's CHaiDNN frame work and implement it in the PL such that I could accelerate the inference for my model. Technically CHaiDNN builds via the SDSoC GUI in Windows are not supported, but a few minor patches (covering issues that were mostly present on the Linux build as well) it turned out that lack of support was not a show stopper as the SDSoC install of Windows is cable of building it just fine.
What did turn out to be a showstopper for using this framework is that even by building DietCHai (which means pool and deconv off - as in not accelerated in HW) the resource utilization indicated after HLS was still too high to be able to produce a working synthesis and implementation run for the Ultra96. From my side and previous work I have done in this direction using Xilinx FPGAs in the past, for this use case I would go for a HDL implementation of an accelerator (and bypass HLS completely or use it only for interfacing).
Final Status- Ultra96 board with Ubuntu and Weston 4.0.0 (latest Wayland protocol and Weston 5 also work)
- TensorFlow can run on the ARM Cortex-A53 with DeepSpeech to perform speech inference on the device
- CMU Flite - compiled for target and running - used for Speech to Text
- Together with USB microphone and speakers the Ultra96 board is a very good "listener" and can "speak" or robotically yell at you
- Unfortunately, no fully accelerated PL speech inference, though I have managed to build and test small acceleration examples, for parts of the requirements.
- Finish the kernel port to mainline with lima driver and figure out if it is even possible to have an entirely GPU rendered GUI via the miniDP port(such that the CPU cores are free for other uses)
- Implement new lightweight hardware IP for DNN processing (to replace the slightly bloated ChaiDNN, ideally this should be an HDL implementation to alleviate the need for HLS and increase performance while keeping the LUT count low), and most probably integrate as PYNQ overlays since the interface there is very nice
- Continue working with this very nice board for other topics as well
As some readers may have noticed I have not provided any binaries, that is mostly because the time was a bit short, and I had no chance to cleanup and package anything, but if anyone is interested please just get in touch and I can help.
Comments