Try to get a fast (what I mean is detecting in lesss than 1 second on mainstream CPU) object-detection tool from Github, I experiment with some repositories written by PyTorch (because I am familiar with it). Below are some conclusions:
1. detectron2
This the official tool from Facebook Corporation. I download and installed it successfully. The test python code is:

import detectron2
from detectron2.utils.logger import setup_logger
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor
from detectron2 import model_zoo
# import some common libraries
import numpy as np
import cv2
import sys
import time
cfg = get_cfg()
# add project-specific config (e.g., TensorMask) here if you're not running a model in detectron2's core library
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
# Find a model from detectron2's model zoo. You can use the https://dl.fbaipublicfiles... url as well
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.DEVICE = "cpu"
predictor = DefaultPredictor(cfg)
img = cv2.imread(sys.argv[1])
begin = time.time()
outputs = predictor(img)
print("time:", time.time() - begin)

Although can’t recognize all birds in below image, it will cost more than 5 seconds on CPU (my MackbookPro). Performance is not as good as my expectation.

2. efficientdet
From the paper, the EfficientDet should be fast and accurate. But after I wrote a test program, it totally couldn’t recognize the object at all. Then I gave up this solution.
3. EfficientDet.Pytorch
Couldn’t download models from it’s model_zoo.
4. ssd.pytorch
Finally, I came to my sweet ssd(Single Shot Detection). Since have studied it for more than half a year, I wrote below snippet quickly:

def base_transform(image, size, mean):
    x = cv2.resize(image, (size, size)).astype(np.float32)
    x -= mean
    x = x.astype(np.float32)
    return x
class BaseTransform:
    def __init__(self, size, mean):
        self.size = size
        self.mean = np.array(mean, dtype=np.float32)
    def __call__(self, image, boxes=None, labels=None):
        return base_transform(image, self.size, self.mean), boxes, labels
def detect(img, net, transform):
    COLORS = [(255, 0, 0), (0, 255, 0), (0, 0, 255)]
    height, width = img.shape[:2]
    x = torch.from_numpy(transform(img)[0]).permute(2, 0, 1)
    x = Variable(x.unsqueeze(0))
    y = net(x)  # forward pass
    detections =[0]
    # scale each detection back up to the image
    scale = torch.Tensor([width, height, width, height])
    for index, loc in enumerate(detections[3]):
        score = loc.numpy()[0]
        if score >= 0.5:
            loc = loc[1:]
            pt = loc * scale
            print(score, pt)
                (int(pt[0]), int(pt[1])),
                (int(pt[2]), int(pt[3])),
                COLORS[index % 3],
                (int(pt[0]), int(pt[1])),
                (255, 255, 255),
    return img
img = cv2.imread("bird_matrix.jpg")
net = build_ssd("test", 300, 21)  # initialize SSD
net.load_state_dict(torch.load("ssd300_mAP_77.43_v2.pth", map_location="cpu"))
transform = BaseTransform(net.size, (104 / 256.0, 117 / 256.0, 123 / 256.0))
img = detect(img, net, transform)
cv2.imwrite("result.jpg", img)

The result is not perfect but good enough for my current situation.