We are trying to use faster R-CNN network (also is an example in mxnet) to automatically extract bird from pictures. But it will cost 10 seconds to recognize a bird from a picture by using CPU, which is too slow to be used in product environment. To improve the performance, I download the MKL with version-2017u4 from Intel site and install it in the server. After recompile mxnet:

make clean
make USE_BLAS=openblas USE_CUDNN=1 USE_CUDA_PATH=/usr/local/cuda-8.0/ USE_CUDA=1 USE_MKL2017=1 -j

it only cost 3~4 seconds to recognize bird from picture. MKL really works!
Using GPU to do inference is a another option. But a EC2 instance with a GPU device is much more expensive than a normal EC2 instance. So we will still using CPU in the near future.

Price of EC2 instance in US-West(Oregon)

vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage
t2.large 2 Variable 8 EBS Only $0.0928 per Hour
g2.2xlarge 8 26 15 60 SSD $0.65 per Hour