NLU: An Adaptive, Small-Footprint, Low-Power Neural Learning Unit for Edge and IoT Applications

Amirhossein Rostami; Seyed Mohammad Ali Zeinolabedin; Liyuan Guo; Florian Kelber; Heiner Bauer; Andreas Dixius; Stefan Scholze; Marc Berthel; Dennis Walter; Johannes Uhlig; Bernhard Vogginger; Christian Mayr

doi:10.1109/OJCAS.2025.3546067

NLU: An Adaptive, Small-Footprint, Low-Power Neural Learning Unit for Edge and IoT Applications

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Amirhossein Rostami - , Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Seyed Mohammad Ali Zeinolabedin - , Blackrock Neurotech (Autor:in)
Liyuan Guo - , Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Florian Kelber - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Heiner Bauer - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Andreas Dixius - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Stefan Scholze - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Marc Berthel - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Dennis Walter - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik, Exzellenzcluster CeTI: Zentrum für Taktiles Internet (Autor:in)
Johannes Uhlig - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Bernhard Vogginger - , Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)
Christian Mayr - , Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI Dresden), Exzellenzcluster CeTI: Zentrum für Taktiles Internet, Professur für Hochparallele VLSI-Systeme und Neuromikroelektronik (Autor:in)

Abstract

Over the last few years, online training of deep neural networks (DNNs) on edge and mobile devices has attracted increasing interest in practical use cases due to their adaptability to new environments, personalization, and privacy preservation. Despite these advantages, online learning on resource-restricted devices is challenging. This work demonstrates a 16-bit floating-point, flexible, power-and memory-efficient neural learning unit (NLU) that can be integrated into processors to accelerate the learning process. To achieve this, we implemented three key strategies: a dynamic control unit, a tile allocation engine, and a neural compute pipeline, which together enhance data reuse and improve the flexibility of the NLU. The NLU was integrated into a system-on-chip (SoC) featuring a 32-bit RISC-V core and memory subsystems, fabricated using GlobalFoundries 22nm FDSOI technology. The design occupies just 0.015mm2 of silicon area and consumes only 0.379 mW of power. The results show that the NLU can accelerate the training process by up to 24.38× and reduce energy consumption by up to 37.37× compared to a RISC-V implementation with a floating-point unit (FPU). Additionally, compared to the state-of-the-art RISC-V with vector coprocessor, the NLU achieves 4.2× higher energy efficiency (measured in GFLOPS/W). These results demonstrate the feasibility of our design for edge and IoT devices, positioning it favorably among state-of-the-art on-chip learning solutions. Furthermore, we performed mixed-precision on-chip training from scratch for keyword spotting tasks using the Google Speech Commands (GSC) dataset. Training on just 40% of the dataset, the NLU achieved a training accuracy of 89.34% with stochastic rounding.

Details

Originalsprache	Englisch
Fachzeitschrift	IEEE Open Journal of Circuits and Systems
Publikationsstatus	Elektronische Veröffentlichung vor Drucklegung - 26 Feb. 2025
Peer-Review-Status	Ja

Schlagworte

Ziele für nachhaltige Entwicklung

SDG 7 – Erschwingliche und saubere Energie

ASJC Scopus Sachgebiete

Schlagwörter

application specific integrated circuit (ASIC), bfloat16, co-design, deep neural network (DNN), energy efficient, hardware accelerator, on-chip training, on-device learning, Online learning, reduced precision computation

Forschungsportal der TU Dresden