结合残差收缩和时空上下文的行为检测网络

黄忠; 陶孟元; 胡敏; 刘娟; 占生宝

doi:10.37188/OPE.20233104.0552

您当前的位置：

首页 >

文章列表页 >

结合残差收缩和时空上下文的行为检测网络

信息科学 | 更新时间：2023-02-28

- 结合残差收缩和时空上下文的行为检测网络
- Combining residual shrinkage and spatio-temporal context for behavior detection network
- 光学精密工程 2023年31卷第4期页码：552-564
- 作者机构：
  
  1.安庆师范大学电子工程与智能制造学院，安徽安庆 246133
  2.合肥工业大学计算机与信息学院，安徽合肥 230009
- 作者简介：
  
  [ "黄忠（1981-），男，安徽安庆人，博士，副教授，硕士生导师，2016年于合肥工业大学获博士学位，现为合肥工业大学信息与通信工程科研流动站在站博后，主要从事图像处理、光学检测等方面的研究。E-mail：huangzhong3315@163.com" ]
  陶孟元（1996-），女，安徽合肥人，硕士研究生，2020 年于安庆师范大学获学士学位，主要从事光学图像处理、自然人机交互等方面的研究。E-mail： tmy2424851034@163.com
- 基金信息：
  
  国家自然科学基金面上项目资助(62176084);安徽省自然科学基金面上项目资助(1908085MF195);安徽省高校优秀青年人才基金项目资助(gxyqZD2021122)
- DOI：10.37188/OPE.20233104.0552
  中图分类号： TP394.1;TH691.9
- 收稿日期：2022-05-16，
  
  修回日期：2022-07-08，
  
  纸质出版日期：2023-02-25
- 稿件说明：
移动端阅览
黄忠,陶孟元,胡敏等.结合残差收缩和时空上下文的行为检测网络[J].光学精密工程,2023,31(04):552-564.

HUANG Zhong,TAO Mengyuan,HU Min,et al.Combining residual shrinkage and spatio-temporal context for behavior detection network[J].Optics and Precision Engineering,2023,31(04):552-564.
黄忠,陶孟元,胡敏等.结合残差收缩和时空上下文的行为检测网络[J].光学精密工程,2023,31(04):552-564. DOI： 10.37188/OPE.20233104.0552.

HUANG Zhong,TAO Mengyuan,HU Min,et al.Combining residual shrinkage and spatio-temporal context for behavior detection network[J].Optics and Precision Engineering,2023,31(04):552-564. DOI： 10.37188/OPE.20233104.0552.

摘要

针对R-C3D行为检测网络特征提取冗余度高及边界定位不准确的问题，结合残差收缩结构和时空上下文，提出一种改进的行为检测网络（RS-STCBD）。首先，将收缩结构和软阈值化操作融入到3D-ResNet的残差模块中，设计通道自适应阈值的残差收缩单元（3D-RSST），并级联多个3D-RSST单元构建特征提取网络以消除行为特征中的噪声、背景等冗余信息；然后，在时序候选子网中嵌入多层卷积替代一次卷积，以增加时序侯选片段的时序维度感受野；最后，在行为分类子网引入非局部注意力机制，通过捕获优质行为时序片段间的远程依赖以获取动作时空上下文信息。在THUMOS14和ActivityNet1.2数据集上的实验结果表明：改进网络的mAP@0.5分别达到36.9%和41.6%，比R-C3D方法提升了8.0%和14.8%。基于改进网络的行为检测方法提高了动作边界定位精度和行为分类准确率，有利于改善自然场景下的人机交互质量。

Abstract

To solve the problems of high redundancy of behavior feature extraction and inaccurate localization of behavior boundary of R-C3D， an improved behavior detection network （RS-STCBD） based on residual shrinkage and spatio-temporal context is proposed. First， the residual shrinkage structure and soft threshold operation are integrated into the residual module of 3D-ResNet， and a unit of 3D residual shrinkage with channel-adaptive soft thresholds （3D-RSST） is designed. Moreover， multiple 3D-RSSTs are cascaded to construct a feature extraction network to adaptively eliminate redundant information such as noise and background in behavioral features. Second， instead of single convolution， multi-layer convolutions are embedded into the proposed subnet to increase the temporal dimension receptive field of the temporal proposal fragments. Finally， a non-local attention mechanism is introduced into the behavior classification subnet to obtain the spatio-temporal context information of behavior by capturing remote dependencies among high-quality behavior proposals. Experimental results on THUMOS14 and ActivityNet1.2 datasets show that the mAP@0.5 values of the improved network reach 36.9% and 41.6%， which are 8.0% and 14.8% higher than those of R-C3D， respectively. The behavior detection method based on the improved network， which increases the accuracy of behavior boundary localization and behavior classification， is beneficial and enhances the quality of human-robot interaction in natural scenes.

关键词

Keywords

references

LIU C ， LI X ， LI Q ， et al . Robot recognizing humans intention and interacting with humans based on a multi-task model combining ST-GCN-LSTM model and YOLO model ［J］. Neurocomputing ， 2021 ， 430 ： 174 - 184 . doi: 10.1016/j.neucom.2020.10.016 http://dx.doi.org/10.1016/j.neucom.2020.10.016

HU X J ， DAI J Z ， LI M ， et al . Online human action detection and anticipation in videos： a survey ［J］. Neurocomputing ， 2022 ， 491 ： 395 - 413 . doi: 10.1016/j.neucom.2022.03.069 http://dx.doi.org/10.1016/j.neucom.2022.03.069

张红颖，安征 . 基于改进双流时空网络的人体行为识别［J］. 光学精密工程， 2021 ， 29 （ 2 ）： 420 - 429 . doi: 10.37188/OPE.20212902.0420 http://dx.doi.org/10.37188/OPE.20212902.0420

ZHANG H Y ， AN ZH . Human action recognition based on improved two-stream spatiotemporal network ［J］. Opt. Precision Eng. ， 2021 ， 29 （ 2 ）： 420 - 429 . （in Chinese） . doi: 10.37188/OPE.20212902.0420 http://dx.doi.org/10.37188/OPE.20212902.0420

LIU Y ， YANG F ， GINHAC D . ACDnet： an action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation ［J］. Pattern Recognition Letters ， 2021 ， 145 ： 118 - 126 . doi: 10.1016/j.patrec.2021.02.001 http://dx.doi.org/10.1016/j.patrec.2021.02.001

YUAN Z H ， STROUD J C ， LU T ， et al . Temporal action localization by structured maximal sums ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2126，2017 ， Honolulu， HI， USA. IEEE ， 2017 ： 3215 - 3223 . doi: 10.1109/cvpr.2017.342 http://dx.doi.org/10.1109/cvpr.2017.342

WEI ， ZHANG . I2Net： Mining intra-video and inter-video attention for temporal action localization ［J］. Neurocomputing ， 2021 ， 444 ： 16 - 29 . doi: 10.1016/j.neucom.2021.02.085 http://dx.doi.org/10.1016/j.neucom.2021.02.085

HUANG Y P ， DAI Q ， LU Y T . Decoupling localization and classification in single shot temporal action detection ［C］. 2019 IEEE International Conference on Multimedia and Expo （ICME）. 812，2019 ， Shanghai， China. IEEE ， 2019 ： 1288 - 1293 . doi: 10.1109/icme.2019.00224 http://dx.doi.org/10.1109/icme.2019.00224

ZHAO Y ， XIONG Y J ， WANG L M ， et al . Temporal action detection with structured segment networks ［J］. International Journal of Computer Vision ， 2020 ， 128 （ 1 ）： 74 - 95 . doi: 10.1007/s11263-019-01211-2 http://dx.doi.org/10.1007/s11263-019-01211-2

LIN T W ， ZHAO X ， SU H S . Joint learning of local and global context for temporal action proposal generation ［J］. IEEE Transactions on Circuits and Systems for Video Technology ， 2020 ， 30 （ 12 ）： 4899 - 4912 . doi: 10.1109/tcsvt.2019.2962063 http://dx.doi.org/10.1109/tcsvt.2019.2962063

XU H J ， DAS A ， SAENKO K . R-C3D： region convolutional 3D network for temporal activity detection ［C］. 2017 IEEE International Conference on Computer Vision （ICCV）. 2229，2017 ， Venice， Italy. IEEE ， 2017 ： 5794 - 5803 . doi: 10.1109/iccv.2017.617 http://dx.doi.org/10.1109/iccv.2017.617

CHEN G ， ZHANG C ， ZOU Y X . AFNet： temporal locality-aware network with dual structure for accurate and fast action detection ［J］. IEEE Transactions on Multimedia ， 2021 ， 23 ： 2672 - 2682 . doi: 10.1109/tmm.2020.3014555 http://dx.doi.org/10.1109/tmm.2020.3014555

XU H J ， DAS A ， SAENKO K . Two-stream region convolutional 3D network for temporal activity detection ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2019 ， 41 （ 10 ）： 2319 - 2332 . doi: 10.1109/tpami.2019.2921539 http://dx.doi.org/10.1109/tpami.2019.2921539

YANG L ， PENG H W ， ZHANG D W ， et al . Revisiting anchor mechanisms for temporal action localization ［J］. IEEE Transactions on Image Processing： a Publication of the IEEE Signal Processing Society ， 2020 . doi: 10.1109/tip.2020.3016486 http://dx.doi.org/10.1109/tip.2020.3016486

孟月波，金丹，刘光辉，等 . 共享核空洞卷积与注意力引导FPN文本检测［J］. 光学精密工程， 2021 ， 29 （ 8 ）： 1955 - 1967 . doi: 10.37188/OPE.20212908.1955 http://dx.doi.org/10.37188/OPE.20212908.1955

MENG Y B ， JIN D ， LIU G H ， et al . Text detection with kernel-sharing dilated convolutions and attention-guided FPN ［J］. Opt. Precision Eng. ， 2021 ， 29 （ 8 ）： 1955 - 1967 . （in Chinese） . doi: 10.37188/OPE.20212908.1955 http://dx.doi.org/10.37188/OPE.20212908.1955

毛琳，曹哲，杨大伟，等 . 多阶段边界参考网络的动作分割［J］. 光学精密工程， 2022 ， 30 （ 3 ）： 340 - 349 . doi: 10.37188/OPE.20223003.0340 http://dx.doi.org/10.37188/OPE.20223003.0340

MAO L ， CAO ZH ， YANG D W ， et al . Multi-stage boundary reference network for action segmentation ［J］. Opt. Precision Eng. ， 2022 ， 30 （ 3 ）： 340 - 349 . （in Chinese） . doi: 10.37188/OPE.20223003.0340 http://dx.doi.org/10.37188/OPE.20223003.0340

BAIRONG ， LI . Learning frame-level affinity with video-level labels for weakly supervised temporal action detection ［J］. Neurocomputing ， 2021 ， 463 ： 109 - 121 . doi: 10.1016/j.neucom.2021.07.059 http://dx.doi.org/10.1016/j.neucom.2021.07.059

YANG W F ， ZHANG T Z ， MAO Z D ， et al . Multi-scale structure-aware network for weakly supervised temporal action detection ［J］. IEEE Transactions on Image Processing： a Publication of the IEEE Signal Processing Society ， 2021 ， 30 ： 5848 - 5861 . doi: 10.1109/tip.2021.3089361 http://dx.doi.org/10.1109/tip.2021.3089361

YANG L ， HAN J W ， ZHAO T ， et al . Background-click supervision for temporal action localization ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2022 ， 44 （ 12 ）： 9814 - 9829 . doi: 10.1109/tpami.2021.3132058 http://dx.doi.org/10.1109/tpami.2021.3132058

ZHAO M H ， ZHONG S S ， FU X Y ， et al . Deep residual shrinkage networks for fault diagnosis ［J］. IEEE Transactions on Industrial Informatics ， 2020 ， 16 （ 7 ）： 4681 - 4690 . doi: 10.1109/tii.2019.2943898 http://dx.doi.org/10.1109/tii.2019.2943898

LI L W ， QIN S Y ， LU Z ， et al . One-shot learning gesture recognition based on joint training of 3D ResNet and memory module ［J］. Multimedia Tools and Applications ， 2020 ， 79 （ 9 ）： 6727 - 6757 . doi: 10.1007/s11042-019-08429-9 http://dx.doi.org/10.1007/s11042-019-08429-9

YIWEI ， WANG . Temporal convolutional network with soft thresholding and attention mechanism for machinery prognostics ［J］. Journal of Manufacturing Systems ， 2021 ， 60 ： 512 - 526 . doi: 10.1016/j.jmsy.2021.07.008 http://dx.doi.org/10.1016/j.jmsy.2021.07.008

CUI W X ， LIU S H ， JIANG F ， et al . Image compressed sensing using non-local neural network ［J］. IEEE Transactions on Multimedia ， 2021 ， PP（99）： 1. doi: 10.1109/tmm.2021.3132489 http://dx.doi.org/10.1109/tmm.2021.3132489

JIANG Y ， LIU J ， ZAMIR A ， et al . THUMOS challenge： Action recognition with a large number of classes ［OL］ http：//crcv.ucf.edu/THUMOS14/ ， 2014 .

HEILBRON F C ， ESCORCIA V ， GHANEM B ， et al . ActivityNet： a large-scale video benchmark for human activity understanding ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 712，2015 ， Boston， MA， USA. IEEE ， 2015 ： 961 - 970 . doi: 10.1109/cvpr.2015.7298698 http://dx.doi.org/10.1109/cvpr.2015.7298698

ZHANG X Y ， SHI H C ， LI C S ， et al . TwinNet： twin structured knowledge transfer network for weakly supervised action localization ［J］. Machine Intelligence Research ， 2022 ， 19 （ 3 ）： 227 - 246 . doi: 10.1007/s11633-022-1333-4 http://dx.doi.org/10.1007/s11633-022-1333-4

LI G Z ， LI J ， WANG N N ， et al . Multi-hierarchical category supervision for weakly-supervised temporal action localization ［J］. IEEE Transactions on Image Processing： a Publication of the IEEE Signal Processing Society ， 2021 ， 30 ： 9332 - 9344 . doi: 10.1109/tip.2021.3124671 http://dx.doi.org/10.1109/tip.2021.3124671

浏览量

515

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据