Optimizing the Neural Compute Stick for Vision Models…

Jump to section

Introduction
Converting a Model to Run on the NCS
Optimizing the Model for Fast Inference
Asynchronous execution
Evaluation and performance results
Conclusion

Share on Social!

Introduction

Tulip Vision is a no-code computer vision solution for shop-floor operations. It enables creating Tulip Apps that use computer vision to drive and monitor operations. Tulip Vision can be used for detecting activity in the workstation, tracking objects as well as persons. The vision signals can be built to monitor safety, manual assembly operations, kitting and picking, and many other applications that increase reliability and lower mistakes. At the center of Vision for the shop-floor lie computer vision algorithms based on deep neural networks. The algorithms are designed to run on existing hardware on shop floors, which is oftentimes a modest PC running Windows. In order to bring the very latest AI to underpowered computers on the shop-floor we are using the Intel Neural Compute Stick version 2 (NCS v2), a.k.a the Movidius Vision Processing Unit (VPU). A USB device with computation hardware designed for running deep neural networks, that takes away the arithmetic work from the PC’s CPU or GPU. The NCS is a low-cost low-power solution for AI on the edge, which is also plug-and-play.

Why is optimization needed?

We found the Intel NCS to be a very capable processor for our needs at Tulip Vision. We are focused on detecting humans and their operations on the line, so being able to detect hands and persons in an incoming camera stream is important. Latest human detection models employ deep neural networks and heavy models, which challenge the NCS in terms of performance.

With vanilla execution of a hand detection model we were able to achieve 14 frames-per-second (FPS) speed for inference. Some optimization is required so we can hit our goals of detecting human operations in real-time (colloquially considered to be 30 FPS and above, the usual running speed of most cameras). We were also looking to get the most out of the NCS hardware, the cost of which is paid for upfront. The NCS specifications specify a theoretic 100 GFLOP/s (Giga floating point operations per second) processing rate, while initially we were able to eek out a mere 20-25 GFLOP/s (with a 1.5 GFLOP network at 14 FPS) with considerable latency.

In this article we discuss the optimization techniques we explored, and the one that actually made the difference for us and put us well in the real-time inference range. Intel already supplied a useful guide on optimizing networks for the NCSv2, which we used in our work.

Converting a Model to Run on the NCS

To begin, we obtained the graph and weights for a hand detection network. We train our own models for hand detection, tuned for performance in shop-floor scenes, but there are freely available models online that work surprisingly well. Here is an example model on GitHub that we used for testing and baseline comparisons. With a pre-trained model, one may want to fine tune it for their purpose and data. We used a suite of tools from Google’s TensorFlow to test fine tuning and transfer learning techniques.

For any given model to be able to run on the NCS it must be converted to the right format. Intel NCS is using OpenVINO to develop for the NCS. It’s a very comprehensive toolkit that allows for executing models in heterogeneous environments (e.g. CPU, GPU, VPU), converting from many source architectures such as TensorFlow, ONYX, PyTorch, etc. and optimizing models in various ways.

To convert a model it must first be in a frozen graph state. That means only the parts of the neural network needed for inference are kept, while everything else is discarded, as well as making sure all the weights of the graph are finalized. There are several data that accompany neural network execution graphs that have to do with training it, which are not needed for inference and may actually cause problems in conversion. To get our example hand detection model to a frozen graph state we use the following script:

python3 /content/models/research/object_detection/export_inference_graph.py \
 --input_type=image_tensor \ 
 --pipeline_config_path=/path/to/pipeline.config \
 --output_directory=/path/to/output_directory \
 --trained_checkpoint_prefix=/path/to/model.ckpt-xyz # Replace xyz with the checkpoint step value.

Figure 1. Export inference graph script for models trained with Tensorflow Object Detection API

For conversion we utilize the DL (Deep Learning) Workbench, from OpenVINO. It is easily installed on any major operating system and works as a standalone graphical (GUI) application, or operable from the command line. We focus on using the GUI here for visualizing our operations. OpenVINO provides a great "getting started" guide for the DL workbench. On the DL workbench we load the model and specify the input and output tensor shapes.

Optimizing the Neural Compute Stick for Vision Models Inference on the Edge

Figure 2. Intel Deep Learning Workbench model configuration

The workbench also has very useful tools for benchmarking model execution and performance, as well as tests to make sure the model is operable on the NCS in terms of implemented NN layers.

Figure 3. Model benchmarking on test data (not annotated) using Intel i7-8700T CPU

Finally the workbench makes it very easy to convert the model to the format required by the NCS:

Figure 4. Downloading the converted OpenVINO model

Optimizing the Model for Fast Inference

For starters, we used the example code from OpenVINO for object detection. It’s using a simple synchronous API that puts an inference request (IR) for the NCS and waits (blocks the thread) for it to complete, assuming the execution of the model simply takes some time to finish. This gave us our original underwhelming performance point of about 14 FPS. Running the numbers, we figured out that this is highly under-performant with respect to what the NCS is capable of.

We theorized the model is too big, and that’s the reason for the poor performance. So we first tried reducing the size of the input image to the model. The running theory was that the big initial convolutional layers on large input sizes before pooling carry a lot of the computational weight. We changed the input shape to 64x64 and 128x128, down from the original 224x224. But this also reduced the accuracy by a big margin - which is something we cannot accept.

Figure 5. Bar chart depicting FPS vs Model input size. All models are Mobilenetv2 based SSDLite models.For mobilenetv2 xy-abc, xy=Depth multiplier, abc=Input size.

We also tried different backbones and detectors architecture, e.g. MobileNet V3, EfficientDet, that are available as part of the OpenVINO model zoo. But concluded that the speed-accuracy trade-off while inferencing on NCS isn’t favorable for our use case.

Figure 6. Model backbone and architecture vs FPS. All Mobilenet backbones use SSDLite as box detectors.

Stumped by these results, we tried pruning the model. In pruning we remove parts of the model that are not contributing a lot to its performance, sacrificing speed for accuracy. We tried smart model pruning from Tensorflow, based on finding “light layers”. Currently Tensorflow supports model pruning only for Keras based sequential models and doesn't support models trained using the Tensorflow Object Detection API. This further thwarted our efforts to create a light-weight pruned model.

Another angle for optimization is quantization. In quantization we change the numeric accuracy of some layers in the network to a lighter, faster type. This assumes that weaker execution hardware, such as mobile phones, is slower at executing floating-point (e.g. 32bit floating-point, FLOAT32) arithmetic operations than integer operations (e.g. 8bit integer, INT8). This is one of the best tricks for NN optimization. However again, the NCS does not support INT8 operations and were back to square one.

Asynchronous execution

Finally, after many trials of optimization techniques, we found that we can simply not wait for the inference request to complete and run another one in parallel using OpenVINO’s asynchronous API (as described in the guide). We can get another inference request while the others are still running, and pass it in another image for detection. Since we have real time frame grabbing from the camera, this introduces a small lag in the output but a very big gain in speed. The following diagram shows the benefit of asynchronous IR stream:

In practice, we run a pool of 4 concurrent inference requests. The size of the pool is determined by the inference device, i.e. the NCS. A CPU will have more IRs than the NCS, and a powerful GPU may have many more still. At any rate, if a model is big and therefore slow, taking a long time to complete the request, we may exhaust the pool of IRs and still have a reduction in framerate.

Evaluation and performance results

Our key metrics for performance evaluation are FPS and latency. We want higher FPS rate and lower latency, however with the asynchronous API approach these two are at a tradeoff. Running more concurrent IRs means a longer warmup period and latency, but it does provide for an overall faster turnaround time.

To measure performance we have internal tools, but we also used the available C++ benchmark tool from OpenVINO. It is capable of running the models on the NCS and gives very useful statistics. While it is not a true simulation of model inference in Tulip Vision, it gave us very good data without the need to actually run the model in our framework.

Figure 7. Model performance for SSDLite MobilenetV2 using asynchronous and synchronous API.

Conclusion

We found the Intel NCSv2 and the OpenVINO toolkit to be very useful for our needs. Beyond being a heterogeneous platform for executing models, working smoothly with the NCS VPU, it also has pretty extensive support for conversion and benchmarking and useful tools for optimization. We tried many things before we found the asynchronous inference option, and learned a lot in the process. This is how we deliver top-notch performance to our clients using Tulip Vision at very low cost. The asynchronous API in OpenVINO gave the best increase in performance relative to many other traditional approaches in model inference optimization tricks, but those may still come in handy as we roll out more deep models into Vision production.

Optimizing the Neural Compute Stick for Vision Models Inference on the Edge