Skip to content

NPU Introduce

The V853 chip has a NPU with a maximum processing performance of 1 TOPS and a 128KB internal cache for high-speed data exchange. It supports API calls of OpenCL, OpenVX, android NN and ONNX, and also supports to import a large number of commonly used deep learning models.

NPU System Structure

The system architecture of the NPU is shown in the following figure:

image-20220712100607889

The upper-layer application can perform calculations by loading models and data into the NPU, or use the software API provided by the NPU to operate the NPU to perform calculations.

The NPU consists of three parts: Programmable Engines (PPU), Neural Network Engine (NN) and caches at all levels.

The programmable engine can be programmed using EVIS hardware accelerated instructions and Shader language, and can also implement operations such as activation functions.

The neural network engine includes two parts: NN core and Tensor Process Fabric (TPF, abbreviated as Fabric in the figure). The NN core generally calculates convolution operations, and the Tensor Process Fabric is used as a high-speed data exchange path in the NN core. The operator is implemented jointly by the programmable engine and the neural network engine.

The NPU supports three data formats: UINT8, INT8, and INT16.

NPU Model Conversion

The model used by the NPU is a type of model structure customized by the NPU, and the model trained by the network cannot be directly imported into the NPU for calculation. This requires the conversion model trained by the network to the model of the NPU.

The model conversion steps of NPU are shown in the following figure:

image-20220712113105463

The NPU model conversion includes a preparation , quantization and verification 3 part.

Preparation

First we import the prepared model using tools and create a configuration file.

At this time, the tool will import and convert the model into the network model, weight model and configuration file used by the NPU.

The configuration file is used to describe and configure the input and output parameters of the network. These parameters include the shape of the input/output tensor, normalization coefficients (mean/zero), image format, output format of the tensor, post-processing methods, and more.

Quantization

Since a trained neural network is insensitive to data precision and noise, parameters can be converted from floating-point to fixed-point by quantization. This has two advantages:

(1) The amount of data is reduced, and storage devices with smaller capacity can be used, which saves costs;

(2) Due to the reduction in the amount of data, the conversion of floating-point to fixed-point numbers also greatly reduces the amount of calculation of the system and improves the speed of calculation.

But quantization also has a fatal flaw - a loss of precision.

Since the amount of data is greatly reduced when the floating-point number is converted to a fixed-point number, the actual weight parameter accuracy is reduced. In a simple network this is not a big problem, but if it is a complex multi-layer multi-model network, small errors in each layer will lead to errors in the final data.

So, can the original data be used directly without quantification? Of course it is possible.

However, due to the use of floating-point numbers, the data cannot be imported into the NN core that only supports fixed-point operations for calculation, which requires a programmable engine to replace the NN core for calculation, which can greatly reduce the computing efficiency.

In addition, during the quantization process, not only the parameters are quantized, but also the input and output data are quantized. If the model has no input data, it does not know the data range of the input and output. At this time, we need to prepare some representative inputs to participate in quantification. These input data are generally obtained from the dataset used to train the model, such as the images in the image dataset.

In addition, the selected data set does not necessarily need to quantify all the training data. Usually, we can select several hundred pieces of input data that can represent all scenarios. In theory, the more quantized data is put in, the better the accuracy after quantization may be, but after reaching a certain threshold, the effect will grow very slowly or even no longer.

Here is a case where accuracy is lost and recognition fails due to wrong quantization:

yolo_v3_abnormal

And normally it should be like this:

yolo_v3_output

Verification

Due to the loss of accuracy due to the quantification of the model in the previous stage, it is necessary to verify the model of each stage to see if the comparison results are consistent.

First, we need to use the model run in the non-quantized case to generate the tensor of each layer as the Golden tensor. The input data can be any data in the dataset. Then, use the same pre-inference data to output the tensor again after quantization, and compare the difference between the tensor of each layer output this time and the Golden tensor.

If the difference is large, you can try to replace the quantization model and quantization method. The difference is small and can be simulated using the IDE. It can also be deployed directly to the V853 for testing.

At this time, the test will also output tensor data. Comparing the difference between the tensor of each layer output this time and the Golden tensor, the difference is not big and can be integrated into the APP.

Model Conversion Practice

See:NPU Conversion YOLO V3 Module

Deployment of NPU Models

The model deployment process of the NPU system generally includes the following four parts:

image-20220712110126757

Data Preprocessing

Data preprocessing is the process of processing data into a model suitable for use by a model.

Here is an example of an image subject recognition case: the camera captures the image data, and its data format is YUV, and the input data used by our model is RGB data, so it is necessary to use preprocessing to convert the YUV data to RGB.

Model Peployment Practice

The next step is to load the model into the NPU, initialize the NPU environment and allocate memory, and then hand over the previously preprocessed data to the NPU for calculation. After the calculation, the NPU will output a tensor data. At this time, data post-processing is required, and the tensor data is converted into specific coordinates and types, which can be fed back to the upper-layer application for application processing.

For details on deployment, see: Deployment of NPU Models

FAQ

(1) Does the NPU support calling operator-level operations? Which operators are supported?

The NPU uses network-level calls by default, but NPU also supports operator-level calls. However, due to the direct call of the operator, the data needs to be exchanged in the memory, which cannot be exchanged through the built-in SRAM of the NPU, which greatly reduces the efficiency and is not recommended.

The NPU uses a dual-operator structure, in which the neural network engine uses hard operators, which have high performance and fast speed; while the programmable engine belongs to soft operators, which can cover some operators that are not supported by hardware operators. structure. Hard operators cover most of the convolution operations. Soft operators can be implemented by programming. The specific operator support table can be found in the document "Operation Mapping and Support".

(2) Does the NPU support FP16, FP32?

not support.

(3) Does the NPU support multi-model operation?

Support multi-model operation

(4) Is it possible to quantize using its own quantization function?

Yes, as long as the output quantization table conforms to the format.

(5) Models supported by NPU

Common deep learning framework models supported by V853 are:

  • TensorFlow
  • Caffe
  • TFLite
  • Keras
  • Pytorch
  • Onnx NN
  • Darknet
  • and so on...