Deep neural networks have evolved to be the state-of-the-art technique for machine learning tasks. However, running such neural network is both computationally intensive and memory intensive, making it power hungry to deploy on embedded systems or data center with limited power budget. To address this limitation, this talk presents an algorithm and hardware co-design methodology for improving the efficiency of deep learning.
Starting with the algorithm, this talk introduces “Deep Compression” that can compress the deep neural network models by 10x-49x without loss of prediction accuracy for a broad range of CNN, RNN, and LSTMs. Followed by changing the hardware architecture and efficiently implement deep compression, this talk introduces EIE, the “Efficient Inference Engine” that can do decompression and inference simultaneously, which significantly saves memory bandwidth. Taking advantage of the compressed model, and being able to deal with the irregular computation pattern efficiently, EIE achieves 13x speedup and 3000x better energy efficient over GPU. Finally, this talk closes the loop by revisiting the inefficiencies in current learning algorithms and proposes DSD training, and discuss the challenges and future work of efficient methods and hardware for deep learning.