Our team developed the Deep Learning Hardware Accelerator (DLHA), a coprocessor designed to run deep neural networks faster and more power-efficiently than general purpose processors alone. Convolutional neural networks have revolutionized applications in image processing, robotics, and autonomous driving. However such methods are computationally very intensive, which prevents deployment in constrained embedded systems like mobile, IoT or in-vehicle platforms. Ardavan Pedram’s Linear Algebra Processor (LAP) design for acceleration of general matrix-matrix multiplication operations, which are at the core of every convolutional neural network application, was used as a basis for the development of the DLHA. The LAP was implemented in RTL, integrated into Zynq-7000 SoC platform, and prototyped on an FPGA. This first working hardware prototype of the LAP serves as an example for those who wish to integrate our open-source RTL implementation into their own SoC or custom FPGA design. Furthermore, necessary drivers and software layers were developed in order to integrate the DLHA into a standard deep learning framework, so that arbitrary network designs could utilize our hardware.