Text detection with kernel-sharing dilated convolutions and attention-guided FPN

Yue-bo MENG; Dan JIN; Guang-hui LIU; Sheng-jun XU; Jiu-qiang HAN; De-wang SHI

doi:10.37188/OPE.20212908.1955

您当前的位置：

首页 >

文章列表页 >

Text detection with kernel-sharing dilated convolutions and attention-guided FPN

Information Sciences | 更新时间：2021-09-01

- Text detection with kernel-sharing dilated convolutions and attention-guided FPN
- Optics and Precision Engineering Vol. 29, Issue 8, Pages: 1955-1967(2021)
- 作者机构：
  
  西安建筑科技大学信息与控制工程学院，陕西西安 710055
- 作者简介：
- 基金信息：
- DOI：10.37188/OPE.20212908.1955
  CLC： TP273
- Received：07 January 2021，
  
  Revised：13 April 2021，
  
  Published：15 August 2021
- 稿件说明：
移动端阅览
孟月波,金丹,刘光辉等.共享核空洞卷积与注意力引导FPN文本检测[J].光学精密工程,2021,29(08):1955-1967.

MENG Yue-bo,JIN Dan,LIU Guang-hui,et al.Text detection with kernel-sharing dilated convolutions and attention-guided FPN[J].Optics and Precision Engineering,2021,29(08):1955-1967.
孟月波,金丹,刘光辉等.共享核空洞卷积与注意力引导FPN文本检测[J].光学精密工程,2021,29(08):1955-1967. DOI： 10.37188/OPE.20212908.1955.

MENG Yue-bo,JIN Dan,LIU Guang-hui,et al.Text detection with kernel-sharing dilated convolutions and attention-guided FPN[J].Optics and Precision Engineering,2021,29(08):1955-1967. DOI： 10.37188/OPE.20212908.1955.

摘要

高分辨率图像具有特征尺度差异较大的特点，针对其造成的细粒度特征难以捕获、多尺度特征融合不佳问题，提出一种共享核空洞卷积与注意力引导（Kernel-Sharing Dilated Convolutions and Attention-guided FPN，KDA-FPN）的复杂场景文本检测方法；提出最小交集（Intersection Over Minimum，IOM）后处理策略，改善因文本长宽比变化较大特性导致的掩膜重叠现象，提升检测效果。首先，模型以Resnet50为主干网络采用FPN结构捕获多尺度特征；然后，利用空洞卷积扩大特征感受野，提高特征信息的多尺度捕获能力，深层次挖掘文本细粒度特征，并通过共享核手段减少模型参数量，降低计算成本；同时，采用上下文注意模块（Context Attention Module，CxAM）捕捉多感受野间的语义信息关系，通过内容注意模块（Content Attention Module，CnAM）精确定位目标位置信息，增强多尺度融合能力，提升特征图质量；最后，将同一文本区域预测的候选框按大小排列，提出将面积最大的框与相邻文本框之间区域的交集面积占较小框面积的比值作为候选框筛选指标，抑制检测结果的掩模重叠现象，实现文本的精准检测。采用ICDAR2013、ICDAR2015、Total-Text数据集进行对比实验，实验结果表明，本文模型对于水平场景文本检测的精度和召回率分别为95.3和90.4；对于倾斜文本检测的精度和召回率分别为87.1和84.2；对于任意形状文本检测的精度和召回率分别为69.6和57.3。提出的算法有效克服了图像分辨率、文本形状与长度等因素的影响，提高了检测精度，得到了更为精准的文本边界。

Abstract

High-resolution images have characteristic large differences in feature scales. To overcome the difficulty in capturing fine-grained features and the poor fusion of multi-scale features， a text detection method for complex scenes with kernel-sharing dilated convolutions and an attention-guided feature pyramid network （KDA-FPN） is proposed. An intersection over minimum （IOM） strategy is proposed to improve the mask overlap phenomenon （caused by the large change of the text aspect ratio） and detection effect. Firstly， the model uses ResNet50 as the backbone network to capture multi-scale features using the FPN structure. It then uses hole convolution to expand the feature receptive field， improve the multi-scale capture capability of feature information， deeply mine the fine-grained features of text， and reduce it by sharing the core. The model parameter quantity reduces the calculation cost. Concurrently， the context attention module （CxAM） is adopted to capture the semantic information relationship between multiple receptive fields， while the content attention module （CnAM） is applied to accurately locate the target position information to enhance the multi-scale fusion ability and improve the quality of the feature map. Finally， the candidate frames predicted by the same text area are arranged according to their sizes. To suppress the mask overlap of the detection result and achieve accurate text detection， the use of the intersection area ratio of the area between the largest area and adjacent text box to the area of smaller box， as the candidate box screening index， is proposed. The comparative experimental results based on the ICDAR2013 and ICDAR2015 Total-Text datasets show that the accuracy and recall rate of this model are 95.3 and 90.4， respectively， for horizontal scene text detection； 87.1 and 84.2， respectively， for the inclined text detection； and 69.6 and 57.3， respectively， for arbitrary shape text detection. The proposed algorithm effectively overcomes the influence of image resolution， text shape， length， and other factors， resulting in enhanced detection accuracy and highly accurate text boundaries.

关键词

Keywords

references

王润民，桑农，丁丁，等 . 自然场景图像中的文本检测综述［J］. 自动化学报， 2018 ， 44 （ 12 ）： 2113 - 2141 .

WANG R M ， SANG N ， DING D ， et al . Text detection in natural scene image： a survey ［J］. Acta Automatica Sinica ， 2018 ， 44 （ 12 ）： 2113 - 2141 . （in Chinese）

范丽丽，赵宏伟，赵浩宇，等 . 基于深度卷积神经网络的目标检测研究综述［J］. 光学精密工程， 2020 ， 28 （ 5 ）： 1152 - 1164 .

FAN L L ， ZHAO H W ， ZHAO H Y ， et al . Survey of target detection based on deep convolutional neural networks ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 5 ）： 1152 - 1164 . （in Chinese）

王建新，王子亚，田萱 . 基于深度学习的自然场景文本检测与识别综述［J］. 软件学报， 2020 ， 31 （ 5 ）： 1465 - 1496 .

WANG J X ， WANG Z Y ， TIAN X . Review of natural scene text detection and recognition based on deep learning ［J］. Journal of Software ， 2020 ， 31 （ 5 ）： 1465 - 1496 . （in Chinese）

TIAN Z ， HUANG W L ， HE T ， et al . Detecting text in natural image with connectionist text proposal network ［C］. Computer Vision – ECCV 2016. Cham ： Springer International Publishing ， 2016 ： 56 - 72 .

SHI B G ， BAI X ， BELONGIE S . Detecting oriented text in natural images by linking segments ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 2126，2017 ， Honolulu， HI， USA . IEEE ， 2017 ： 3482 - 3490 .

ZHANG Z ， ZHANG C Q ， SHEN W ， et al . Multi-oriented text detection with fully convolutional networks ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 2730，2016 ， Las Vegas， NV， USA . IEEE ， 2016 ： 4159 - 4167 .

ZHONG Z Y ， JIN L W ， ZHANG S Y ， et al . Deep Text： a new approach for text proposal generation and text detection in natural images ［C］. 2017 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. March 5 - 9 ， 2017 . New Orleans， LA. IEEE ， 2017 ： 1605 - 1617 .

ZHOU X Y ， YAO C ， WEN H ， et al . EAST： an efficient and accurate scene text detector ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 2126，2017 ， Honolulu， HI， USA . IEEE ， 2017 ： 2642 - 2651 .

HUO Y J . AdvancedEAST ［EB/OL］. https：//github.com/huoyijie/AdvancedEAST. 2017 https://github.com/huoyijie/AdvancedEAST.2017 .

LIAO M H ， SHI B G ， BAI X . TextBoxes： a single-shot oriented scene text detector ［J］. IEEE Transactions on Image Processing ， 2018 ， 27 （ 8 ）： 3676 - 3690 .

HE W H ， ZHANG X Y ， YIN F ， et al . Deep direct regression for multi-oriented scene text detection ［C］. 2017 IEEE International Conference on Computer Vision （ICCV） . October 22-29， 2017 ， Venice， Italy . IEEE ， 2017 ： 745 - 753 .

DAI Y C ， HUANG Z ， GAO Y T ， et al . Fused text segmentation networks for multi-oriented scene text detection ［C］. 2018 24th International Conference on Pattern Recognition （ICPR） . 2024，2018 ， Beijing， China . IEEE ， 2018 ： 3604 - 3609 .

LIAO M H ， SHI B G ， BAI X ， et al . TextBoxes： a fast text detector with a single deep neural network ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 2730，2016 ， Las Vegas， Nevada. USA. IEEE ， 2016 ： 1611 - 1618 .

LONG S B ， RUAN J Q ， ZHANG W J ， et al . TextSnake： A flexible representation for detecting text of arbitrary shapes ［C］. Computer Vision – ECCV 2018. Cham ： Springer International Publishing ， 2018 ： 19 - 35 .

ZHU Y X ， DU J . Sliding line point regression for shape robust scene text detection ［C］. 2018 24th International Conference on Pattern Recognition （ICPR） . 2024，2018 ， Beijing， China . IEEE ， 2018 ： 3735 - 3740 .

LYU P Y ， LIAO M H ， YAO C ， et al . Mask TextSpotter： an end-to-end trainable neural network for spotting text with arbitrary shapes ［C］. Computer Vision-ECCV 2018. Cham ： Springer International Publishing ， 2018 ： 71 - 88 .

LONG J ， SHELHAMER E ， DARRELL T . Fully convolutional networks for semantic segmentation ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 712，2015 ， Boston， MA， USA . IEEE ， 2015 ： 3431 - 3440 .

NEUBECK A ， GOOL LVAN . Efficient non-maximum suppression ［C］. 18th International Conference on Pattern Recognition （ICPR' 06 ）. 2024，2006 ， Hong Kong， China. IEEE ， 2006： 850 - 855 .

LIN T Y ， DOLLÁR P ， GIRSHICK R ， et al . Feature pyramid networks for object detection ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 2126，2017 ， Honolulu， HI， USA . IEEE ， 2017 ： 936 - 944 .

REN S Q ， HE K M ， GIRSHICK R ， et al . Faster R-CNN： towards real-time object detection with region proposal networks ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2017 ， 39 （ 6 ）： 1137 - 1149 .

KARATZAS D ， SHAFAIT F ， UCHIDA S ， et al . ICDAR 2013 robust reading competition ［C］. 2013 12th International Conference on Document Analysis and Recognition . 2528，2013 ， Washington ， DC ， USA . IEEE ， 2013 ： 1484 - 1493 .

NEUMANN L ， MATAS J . Efficient Scene text localization and recognition with local character refinement ［C］. 2015 13th International Conference on Document Analysis and Recognition （ICDAR） . 2326，2015 ， Tunis ， Tunisia. IEEE ， 2015 ： 746 - 750 .

CH'NG C K ， CHAN C S . Total-text： a comprehensive dataset for scene text detection and recognition ［C］. 2017 14th IAPR International Conference on Document Analysis and Recognition （ICDAR） . 915，2017 ， Kyoto ， Japan . IEEE ， 2017 ： 935 - 942 .

HE K M ， GKIOXARI G ， DOLLÁR P ， et al . Mask R-CNN ［C］. 2017 IEEE International Conference on Computer Vision （ICCV） . 2229，2017 ， Venice， Italy . IEEE ， 2017 ： 2980 - 2988 .

GIRSHICK R . Fast R-CNN ［C］. 2015 IEEE International Conference on Computer Vision （ICCV） . 713，2015 ， Santiago， Chile . IEEE ， 2015 ： 1440 - 1448 .

HE K M ， ZHANG X Y ， REN S Q ， et al . Deep residual learning for image recognition ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . 2730，2016 ， Las Vegas， NV， USA . IEEE ， 2016 ： 770 - 778 .

徐胜军，欧阳朴衍，郭学源，等 . 多尺度特征融合空洞卷积 ResNet遥感图像建筑物分割［J］. 光学精密工程， 2020 ， 28 （ 7 ）： 1588 - 1599 .

XU SH J ， OUYANG P Y ， GUO X Y ， et al . Building segmentation in remote sensing image based on multiscale-feature fusion dilated convolution resnet ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 7 ）： 1588 - 1599 . （in Chinese）

余永维，韩鑫，杜柳青 . 基于Inception-SSD算法的零件识别［J］. 光学精密工程， 2020 ， 28 （ 8 ）： 1799 - 1809 .

YU Y W ， HAN X ， DU L Q . Target part recognition based Inception-SSD algorithm ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 8 ）： 1799 - 1809 . （in Chinese）

DENG J ， DONG W ， SOCHER R ， et al . ImageNet： a large-scale hierarchical image database ［C］. 2009 IEEE Conference on Computer Vision and Pattern Recognition . 2025，2009 ， Miami ， FL， USA . IEEE ， 2009 ： 248 - 255 .

WANG W H ， XIE E Z ， SONG X G ， et al . Efficient and accurate arbitrary-shaped text detection with pixel aggregation network ［C］. 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. October 27 - November 2 ， 2019 ， Seoul， Korea （South）. IEEE ， 2019： 8439 - 8448 .

ZHEN Y Q ， et al . " Scale Robust Deep Oriented-text Detection Network ［C］. 2020 Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . Honolulu， HI， USA ： IEEE ， 2020 ： 1071 - 1080 .

刘燕，温静 . 基于注意力机制的复杂场景文本检测［J］. 计算机科学， 2020 ， 47 （ 7 ）： 135 - 140 .

LIU Y ， WEN J . Complex scene text detection based on attention mechanism ［J］. Computer Science ， 2020 ， 47 （ 7 ）： 135 - 140 . （in Chinese）

Views

929

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

AI问答

Address：No.3888 Dong Nanhu Road, Changchun, Jilin, China Postal code：130033
Tel：0431-86176855 Email：gxjmgc@ciomp.ac.cn
Technical support is provided by Beijing Founder electronics co., LTD 吉ICP备11002662号-17 京公网安备11010802024621
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰