自然场景人脸检测技术实践

一、背景

人脸检测技术是通过人工智能分析的方法自动返回图片中的人脸坐标位置和尺寸大小，是人脸智能分析应用的核心组成部分，具有广泛的学术研究价值和业务应用价值，比如人脸识别、人脸属性分析（年龄估计、性别识别、颜值打分和表情识别）、人脸Avatar、智能视频监控、人脸图像过滤、智能图像裁切、人脸AR游戏等等。因拍摄的场景不同，自然场景环境复杂多变，光照因素也不可控，人脸本身多姿态以及群体间的相互遮挡给检测任务带来了很大的挑战（如图1所示）。在过去20年里，该任务一直是学术界和产业界共同关注的热点。

自然场景人脸检测在美团业务中也有着广泛的应用需求，为了应对自然场景应用本身的技术挑战，同时满足业务的性能需求，美团视觉智能中心（Vision Intelligence Center，VIC）从底层算法模型和系统架构两个方面进行了改进，开发了高精度人脸检测模型VICFace。而且VICFace在国际知名的公开测评集WIDER FACE上达到了行业主流水平。

图1 自然场景人脸检测样本示例

二、技术发展现状

跟深度学习不同，传统方法解决自然场景人脸检测会从特征表示和分类器学习两个方面进行设计。最有代表性的工作是Viola-Jones算法[2]，它利用手工设计的Haar-like特征和Adaboost算法来完成模型训练。传统方法在CPU上检测速度快，结果可解释性强，在相对可控的环境下可以达到较好的性能。但是，当训练数据规模成指数增长时，传统方法的性能提升相对有限，在一些复杂场景下，甚至无法满足应用需求。

随着计算机算力的提升和训练数据的增长，基于深度学习的方法在人脸检测任务上取得了突破性进展，在检测性能上相对于传统方法具有压倒性优势。基于深度学习的人脸检测算法从算法结构上可以大致分为三类：

1）基于级联的人脸检测算法。

2）两阶段人脸检测算法。

3）单阶段人脸检测算法。

其中，第一类基于级联的人脸检测方法（如Cascade CNN[3]、MTCNN[4]）运行速度较快、检测性能适中，适用于算力有限、背景简单且人脸数量较少的场景。第二类两阶段人脸检测方法一般基于Faster-RCNN[6]框架，在第一阶段生成候选区域，然后在第二阶段对候选区域进行分类和回归，其检测准确率较高，缺点是检测速度较慢，代表方法有Face R-CNN[9] 、ScaleFace[10]、FDNet[11]。最后一类单阶段的人脸检测方法主要基于Anchor的分类和回归，通常会在经典框架（如SSD[12]、RetinaNet[13]）的基础上进行优化，其检测速度较两阶段法快，检测性能较级联法优，是一种检测性能和速度平衡的算法，也是当前人脸检测算法优化的主流方向。

三、优化思路和业务应用

在自然场景应用中，为了同时满足精度需求以及达到实用的目标，美团视觉智能中心（Vision Intelligence Center，VIC）采用了主流的Anchor-Based单阶段人脸检测方案，同时在数据增强和采样策略、模型结构设计和损失函数等三方面分别进行了优化，开发了高精度人脸检测模型VICFace，以下是相关技术细节的介绍。

1. 数据增强和采样策略

单阶段通用目标检测算法对数据增强方式比较敏感，如经典的SSD算法在VOC2007[50]数据集上通过数据增强性能指标mAP提升6.7。经典单阶段人脸检测算法S3FD[17]也设计了样本增强策略，使用了图片随机裁切，图片固定宽高比缩放，图像色彩扰动和水平翻转等。

百度在ECCV2018发表的PyramidBox[18]提出了Data-Anchor采样方法，将图像中一个随机选择的人脸进行尺度变换变成一个更小Anchor附近尺寸的人脸，同时训练图像的尺寸也进行同步变换。这样做的好处是通过将较大的人脸生成较小的人脸，提高了小尺度上样本的多样性，在WIDER FACE[1]数据集Easy、Medium、Hard集合上分别提升0.4（94.3->94.7），0.4（93.3->93.7），0.6（86.1->86.7）。ISRN[19]将SSD的样本增强方式和Data-Anchor采样方法结合，模型检测性能进一步提高。

而VICFace在ISRN样本增强方式的基础上对语义模糊的超小人脸做了过滤。而mixup[22]在图像分类和目标检测中已经被验证有效，现在用于人脸检测，有效地防止了模型过拟合问题。考虑到业务数据中人脸存在多姿态、遮挡和模糊的样本，且这些样本在训练集中占比小，检测难度大，因此在模型训练时动态的给这些难样本赋予更高的权重从而有可能提升这些样本的召回率。

2. 模型结构设计

人脸检测模型结构设计主要包括检测框架、主干网络、预测模块、Anchor设置与正负样本划分等四个部分，是单阶段人脸检测方法优化的核心。

检测框架

近年来单阶段人脸检测框架取得了重要的发展，代表性的结构有S3FD[17]中使用的SSD，SFDet[25]中使用的RetinaNet，SRN[23]中使用的两步结构（后简称SRN）以及DSFD[24]中使用的双重结构（后简称DSFD），如下图2所示。其中，SRN是一种单阶段两步人脸检测方法，利用第一步的检测结果，在小尺度人脸上过滤易分类的负样本，改善正负样本数量的均衡性，针对大尺度的人脸采用迭代求精的方式进行人脸定位，改善大尺度人脸的定位精度，提升了人脸检测的准确率。在WIDER FACE上测评SRN取得了最好的检测效果（按标准协议用AP平均精度来衡量），如表1所示。

S3FD:

SFDet:

SRN:

DSFD:

图2 四种检测结构

表1 Backbone为ResNet50时，四种检测结构在WIDER FACE上的评估结果

VICFace继承了当前性能最好的SRN检测结构，同时为了更好的融合自底向上和自顶向下的特征，为不同特征不同通道赋予不同的权重，以P4为例，其计算式为：

其中WC4向量的元素个数与Conv(C4)特征的通道数相等，WP4与Upsample(P5)的通道数相等，WC4与WP4是可学习的，其元素值均大于0，且WC4与WP4对应元素之和为1，结构如图3所示。

图3 视觉智能中心VICFace网络整体结构图

主干网络

单阶段人脸检测模型的主干网络通常使用分类任务中的经典结构（如VGG[26]、ResNet[27]等）。其中，主干网络在ImageNet数据集上分类任务表现越好，其在WIDER FACE上的人脸检测性能也越高，如表2所示。为了保证检测网络得到更高的召回，在性能测评时VICFace主干网络使用了在ImageNet上性能较优的ResNet152网络（其在ImageNet上Top1分类准确率为80.26），并且在实现时将Kernel为7x7，Stride为2的卷积模块调整为为3个3x3的卷积模块，其中第一个模块的Stride为2，其它的为1；将Kernel为1x1，Stride为2的下采样模块替换为Stride为2的Avgpool模块。

表2 不同主干网络在ImageNet的性能对比和其在RetinaNet框架下的检测精度

预测模块

利用上下文信息可以进一步提高模型的检测性能。SSH[36]是将上下文信息用于单阶段人脸检测模型的早期方案，PyramidBox、SRN、DSFD等也设计了不同上下文模块。如图4所示，SRN上下文模块使用1xk，kx1的卷积层提供多种矩形感受野，多种不同形状的感受野助于检测极端姿势的人脸；DSFD使用多个带孔洞的卷积，极大的提升了感受野的范围。

图4 不同网络结构中的Context Module

在VICFace中，将带孔洞的卷积模块和1xk，kx1的卷积模块联合作为Context Module，既提升了感受野的范围也有助于检测极端姿势的人脸，同时使用Maxout模块提升召回率，降低误检率。它还利用Cn层特征预测的人脸位置，校准Pn层特征对应的区域，如图5所示。Cn层预测的人脸位置相对特征位置的偏移作为可变卷积的Offset输入，Pn层特征作为可变卷积的Data输入，经过可变卷积后特征对应的区域与人脸区域对应更好，相对更具有表示能力，可以提升人脸检测模型的性能。

图5 自研检测模型结构中的预测模块

Anchor设置与正负样本划分

基于Anchor的单阶段人脸检方法通过Anchor的合理设置可以有效的控制正负样本比例和缓解不同尺度人脸定位损失差异大的问题。现有主流人脸检测方法中Anchor的大小设置主要有以下三种（S代表Stride）：

根据数据集中人脸的特点，Anchor的宽高也可以进一步丰富，如{1}，{0.8}，{1，0.67}。

在自研方案中，在C3、P3层，Anchor的大小为2S和4S，其它层Anchor大小为4S（S代表对应层的Stride），这样的Anchor设置方式在保证人脸召回率的同时，减少了负样本的数量，在一定程度上缓解了正负样本不均衡现象。根据人脸样本宽高比的统计信息，将Anchor的宽高比设置为0.8，同时将Cn层IoU大于0.7的样本划分为正样本，小于0.3的划分为负样本，Pn层IoU大于0.5的样本划分为正样本，小于0.4的划分为负样本。

3. 损失函数

人脸检测的优化目标不仅需要区分正负样本（是否是人脸），还需要定位出人脸位置和尺寸。S3FD中区分正负样本使用交叉熵损失函数，定位人脸位置和尺寸使用Smooth L1 Loss，同时使用困难负样本挖掘解决正负样本数量不均衡的问题。另一种缓解正负样本不均衡带来的性能损失更直接的方式是Lin等人提出Focal Loss[13]。UnitBox[41]提出IoU Loss可以缓解不同尺度人脸的定位损失差异大导致的性能损失。AlnnoFace[40]同时使用Focal Loss和IoU Loss提升了人脸检测模型的性能。引入其它相关辅助任务也可以提升人脸检测算法的性能，RetinaFace[42]引入关键点定位任务，提升人脸检测算法的定位精度；DFS[43]引入人脸分割任务，提升了特征的表示能力。

综合前述方法的优点，VICFace充分利用人脸检测及相关任务的互补信息，使用多任务方式训练人脸检测模型。在人脸分类中使用Focal Loss来缓解样本不均衡问题，同时使用人脸关键点定位和人脸分割来辅助分类目标的训练，从而提升整体的分类准确率。在人脸定位中使用Complete IoU Loss[47]，以目标与预测框的交并比作为损失函数，缓解不同尺度人脸损失的差异较大的问题，同时兼顾目标和预测框的中心点距离和宽高比差异，从而可以达到更好整体检测性能。

4. 优化结果和业务应用

在集群平台的支持下，美团视觉智能中心的自然场景人脸检测基础模型VICFace与现有主流方案进行了性能对比，在国际公开人脸检测测评集WIDER FACE的三个验证集Easy、Medium、Hard中均达到领先水平（AP为平均精度，数值越高越好），如图6和表3所示。

图6 VICFace以及当前主流人脸检测方法在WIDER FACE上的测评结果

表3 VICFace以及当前主流人脸检测方法在WIDER FACE上的测评结果

注：SRN是中科院在AAAI2019提出的新方法，DSFD是腾讯优图在CVPR2019提出的新方法，PyramidBox++是百度在2019年提出的新方法，AInnoFace是创新奇智在2019提出的新方法，RetinaFace是ICCV2019 Wider Challenge亚军。

在业务应用中，自然场景人脸检测服务目前已接入美团多个业务线，满足了业务在UGC图像智能过滤和广告POI图像展示等应用的性能需求，前者保护用户隐私，预防侵犯用户肖像权，后者可以有效的预防图像中人脸局部被裁切的现象，从而提升了用户体验。此外，VICFace还为其它人脸智能分析应用提供了核心基础模型，如自动检测后厨工作人员的着装合规性（是否穿戴帽子和口罩），为食品安全增加了一道保障。

在未来的工作中，为了给用户提供更好的体验，同时满足高并发的需求，在模型结构设计和模型推理效率方面将会做进一步探索和优化。此外，在算法设计方面，基于Anchor-Free的单阶段目标检测方法近年来在通用目标检测领域表现出较高的潜力，也是视觉智能中心未来会关注的重要方向。

参考文献

1. Yang S, Luo P, Loy C C, et al. Wider face: A face detection benchmark[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 5525-5533.

2. Viola P, Jones M J. Robust real-time face detection[J]. International journal of computer vision, 2004, 57(2): 137-154.

3. Li H, Lin Z, Shen X, et al. A convolutional neural network cascade for face detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 5325-5334.

4. Zhang K, Zhang Z, Li Z, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.

5. Hao Z, Liu Y, Qin H, et al. Scale-aware face detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6186-6195.

6. Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.

7. Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.

8. Jiang H, Learned-Miller E. Face detection with the faster R-CNN[C]//2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017: 650-657.

9. Wang H, Li Zhif, et al. Face R-CNN. arXiv preprint arXiv: 1706.01061, 2017.

10. Yang S, Xiong Y, Loy C C, et al. Face detection through scale-friendly deep convolutional networks[J]. arXiv preprint arXiv:1706.02863, 2017.

11. Zhang C, Xu X, Tu D. Face detection using improved faster rcnn[J]. arXiv preprint arXiv:1802.02142, 2018.

12. Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//European conference on computer vision. Springer, Cham, 2016: 21-37.

13. Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.

14. Huang L, Yang Y, Deng Y, et al. Densebox: Unifying landmark localization with end to end object detection[J]. arXiv preprint arXiv:1509.04874, 2015.

15. Liu W, Liao S, Ren W, et al. High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 5187-5196.

16. Zhang Z, He T, Zhang H, et al. Bag of freebies for training object detection neural networks[J]. arXiv preprint arXiv:1902.04103, 2019.

17. Zhang S, Zhu X, Lei Z, et al. S3fd: Single shot scale-invariant face detector[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 192-201.

18. Tang X, Du D K, He Z, et al. Pyramidbox: A context-assisted single shot face detector[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 797-813.

19. Zhang S, Zhu R, Wang X, et al. Improved selective refinement network for face detection[J]. arXiv preprint arXiv:1901.06651, 2019.

20. Li Z, Tang X, Han J, et al. PyramidBox++: High Performance Detector for Finding Tiny Face[J]. arXiv preprint arXiv:1904.00386, 2019.

21. Zhang S, Zhu X, Lei Z, et al. Faceboxes: A CPU real-time face detector with high accuracy[C]//2017 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 2017: 1-9.

22. Zhang H, Cisse M, Dauphin Y N, et al. mixup: Beyond empirical risk minimization[J]. arXiv preprint arXiv:1710.09412, 2017.

23. Chi C, Zhang S, Xing J, et al. Selective refinement network for high performance face detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8231-8238.

24. Li J, Wang Y, Wang C, et al. Dsfd: dual shot face detector[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 5060-5069.

25. Zhang S, Wen L, Shi H, et al. Single-shot scale-aware network for real-time face detection[J]. International Journal of Computer Vision, 2019, 127(6-7): 537-559.

26. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

27. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

28. Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500.

29. Iandola F, Moskewicz M, Karayev S, et al. Densenet: Implementing efficient convnet descriptor pyramids[J]. arXiv preprint arXiv:1404.1869, 2014.

30. Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.

31. Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4510-4520.

32. Bazarevsky V, Kartynnik Y, Vakunov A, et al. BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs[J]. arXiv preprint arXiv:1907.05047, 2019.

33. He Y, Xu D, Wu L, et al. LFFD: A Light and Fast Face Detector for Edge Devices[J]. arXiv preprint arXiv:1904.10633, 2019.

34. Zhu R, Zhang S, Wang X, et al. Scratchdet: Exploring to train single-shot object detectors from scratch[J]. arXiv preprint arXiv:1810.08425, 2018, 2.

35. Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Springer, Cham, 2014: 740-755.

36. Najibi M, Samangouei P, Chellappa R, et al. Ssh: Single stage headless face detector[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 4875-4884.

37. Sa. Earp, P. Noinongyao, J. Cairns, A. Ganguly Face Detection with Feature Pyramids and Landmarks. arXiv preprint arXiv:1912.00596, 2019.

38. Goodfellow I J, Warde-Farley D, Mirza M, et al. Maxout networks[J]. arXiv preprint arXiv:1302.4389, 2013.

39. Zhu C, Tao R, Luu K, et al. Seeing Small Faces from Robust Anchor’s Perspective[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5127-5136.

40. F. Zhang, X. Fan, G. Ai, J. Song, Y. Qin, J. Wu Accurate Face Detection for High Performance. arXiv preprint arXiv:1905.01585, 2019.

41. Yu J, Jiang Y, Wang Z, et al. Unitbox: An advanced object detection network[C]//Proceedings of the 24th ACM international conference on Multimedia. ACM, 2016: 516-520.

42. Deng J, Guo J, Zhou Y, et al. RetinaFace: Single-stage Dense Face Localisation in the Wild[J]. arXiv preprint arXiv:1905.00641, 2019.

43. Tian W, Wang Z, Shen H, et al. Learning better features for face detection with feature fusion and segmentation supervision[J]. arXiv preprint arXiv:1811.08557, 2018.

44. Y. Zhang, X. Xu, X. Liu Robust and High Performance Face Detector. arXiv preprint arXiv:1901.02350, 2019.

45. S. Zhang, C. Chi, Z. Lei, Stan Z. Li RefineFace: Refinement Neural Network for High Performance Face Detection. arXiv preprint arXiv:1909.04376, 2019.

46. Wang J, Yuan Y, Li B, et al. Sface: An efficient network for face detection in large scale variations[J]. arXiv preprint arXiv:1804.06559, 2018.

47. Zheng Z, Wang P, Liu W, et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression[J]. arXiv preprint arXiv:1911.08287, 2019.

48. Bay H, Tuytelaars T, Van Gool L. Surf: Speeded up robust features[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2006: 404-417.

49. Yang B, Yan J, Lei Z, et al. Aggregate channel features for multi-view face detection[C]//IEEE international joint conference on biometrics. IEEE, 2014: 1-8.

50. Everingham M, Van Gool L, Williams C K I, et al. The PASCAL visual object classes challenge 2007 (VOC2007) results[J]. 2007.

51. Redmon J, Farhadi A. Yolov3: An incremental improvement[J]. arXiv preprint arXiv:1804.02767, 2018.

作者简介

振华、欢欢、晓林，均为美团视觉智能中心工程师。

招聘信息

美团视觉智能中心基础视觉组的主要职责是夯实视觉智能底层核心基础技术，为集团业务提供平台级视觉解决方案。主要方向有基础模型优化、大规模分布式训练、Server效率优化、移动端适配优化和创新产品孵化。

欢迎计算机视觉相关领域小伙伴加入我们，简历可发邮件至 tech@meituan.com（邮件标题注明：美团视觉智能中心基础视觉组）。

一、 背景

二、 技术发展现状

三、 优化思路和业务应用