浏览全部资源
扫码关注微信
中国民航大学 电子信息与自动化学院,天津 300300
[ "张红颖(1978-),女,天津人,博士,教授,分别于2001年、2004年、2007年在天津大学获得学士、硕士、博士学位,主要从事图像工程与计算机视觉方面的研究。E-mail: carole_zhang0716@163.com" ]
[ "安 征(1995-),男,河北唐山人,硕士研究生,2018年于河北科技大学获得学士学位,主要从事计算机视觉、人体行为识别方面的研究。E-mail: 2018021038@cauc.edu.cn" ]
收稿日期:2020-09-21,
修回日期:2020-11-12,
纸质出版日期:2021-02-15
移动端阅览
张红颖,安征.基于改进双流时空网络的人体行为识别[J].光学精密工程,2021,29(02):420-429.
ZHANG Hong-ying,AN Zheng.Human action recognition based on improved two-stream spatiotemporal network[J].Optics and Precision Engineering,2021,29(02):420-429.
张红颖,安征.基于改进双流时空网络的人体行为识别[J].光学精密工程,2021,29(02):420-429. DOI: 10.37188/OPE.20212902.0420.
ZHANG Hong-ying,AN Zheng.Human action recognition based on improved two-stream spatiotemporal network[J].Optics and Precision Engineering,2021,29(02):420-429. DOI: 10.37188/OPE.20212902.0420.
针对传统双流网络无法捕捉视频序列中的时序关系从而导致对时序依赖较大的行为识别效果不理想的问题,提出一种基于改进双流时空网络的人体行为识别算法。首先利用时间移位思想,使卷积神经网络对视频中的时序关系建模,从而高效地捕捉视频中的时空信息;同时使用注意力机制改善由于通道信息在时间轴上移动导致的空间特征学习能力下降的问题;在此基础上构建了一个包含时空表观信息流和时空运动信息流的双流网络结构;最后,采用加权平均的方式融合双流网络,得到最终的识别结果。在UCF101和HMDB51数据集上分别进行了实验,识别准确率为96.3%和77.7%,实验结果表明,与传统双流网络方法相比,识别准确率得到了一定的提升,验证了本文算法能够有效捕捉视频中的时序关系,增强网络的特征表达能力,提高对时序依赖较大的行为和近似行为的辨识能力。
Traditional two-stream networks cannot capture the temporal relationship in video sequences, which leads to poor performance of temporal-dependent action recognition. To solve this problem, a human action recognition method based on an improved two-stream spatiotemporal network is proposed. First, a convolutional neural network is used to model the temporal relationship by using the temporal shift idea to efficiently capture the spatiotemporal information in the video. Further, an attention mechanism is used to improve the downward learning ability of spatial features caused by the movement of channel information in the temporal dimension. On this basis, a two-stream network is designed that includes spatiotemporal apparent information flow and spatiotemporal motion information flow. Finally, the weighted average method is used to fuse the two streams to obtain the final result. Experiments on UCF101 and HMDB51 datasets exhibited accuracies of 96.3% and 77.7%, respectively. The results demonstrate improved accuracy compared with that of the traditional two-stream network method, which verifies that the proposed algorithm can effectively capture the temporal relationship, enhance the ability of network feature expression, and improve the accuracy of temporal-dependent action and similar actions.
李庆辉 , 李艾华 , 崔智高 , 等 . 结合限制密集轨迹与时空共生特征的行为识别 [J]. 光学 精密工程 , 2018 , 26 ( 1 ): 230 - 237 .
LI Q H , LI A H , CUI ZH G , et al . . Action recognition via restricted dense trajectories and spatio-temporal co-occurrence feature [J]. Opt. Precision Eng. , 2018 , 26 ( 1 ): 230 - 237 . (in Chinese)
马世伟 , 刘丽娜 , 傅琪 , 等 . 采用PHOG融合特征和多类别Adaboost分类器的行为识别 [J]. 光学 精密工程 , 2018 , 26 ( 11 ): 2827 - 2837 .
MA SH W , LIU L N , FU Q , et al . . Using PHOG fusion and multi-class Adaboost classifier for human behavior recognition [J]. Opt. Precision Eng. , 2018 , 26 ( 11 ): 2827 - 2837 . (in Chinese)
张红颖 , 王汇三 , 胡文博 . 基于双模型的相关滤波跟踪 [J]. 光学 精密工程 , 2019 , 27 ( 11 ): 2450 - 2458 .
ZHANG H Y , WANG H S , HU W B . Correlation filter tracking algorithm based on double model [J]. Opt. Precision Eng. , 2019 , 27 ( 11 ): 2450 - 2458 . (in Chinese)
罗会兰 , 王婵娟 , 卢飞 . 视频行为识别综述 [J]. 通信学报 , 2018 , 39 ( 6 ): 169 - 180 .
LUO H L , WANG CH J , LU F . Survey of video behavior recognition [J]. Journal on Communication , 2018 , 39 ( 6 ): 169 - 180 . (in Chinese)
WANG H , SCHMID C . Action recognition with improved trajectories [C]. Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos : IEEE Computer Society Press , 2013 : 3551 - 3558 .
朱煜 , 赵江坤 , 王逸宁 , 等 . 基于深度学习的人体行为识别算法综述 [J]. 自动化学报 , 2016 , 42 ( 6 ): 848 - 857 .
ZHU Y , ZHAO J K , WANG Y N , et al . . A review of human action recognition based on deep learning [J]. Acta Automatica Sinica , 2016 , 42 ( 6 ): 848 - 857 . (in Chinese)
SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [C]. Proceedings of Advances in Neural Information Processing Systems . Cambridge : MIT Press , 2014 : 568 - 576 .
WANG L M , XIONG Y J , WANG Z , et al . Temporal segment networks: towards good practices for deep action recognition [C]. Proceedings of European Conference on Computer Vision . Heidelberg : Springer , 2016 : 20 - 36 .
石跃祥 , 曾智超 . 基于特征传播的时域分割网络行为识别 [J]. 计算机辅助设计与图形学学报 , 2020 , ( 32 ) 4 : 582 - 589 .
SHI Y X , ZENG Z C . Temporal segment network based on feature propagation for action recognition [J]. Journal of Computer-Aided Design & Computer Graphics , 2020 , ( 32 ) 4 : 582 - 589 . (in Chinese)
BILEN H , FERNANDO B , GAVVES E , et al .. Action recognition with dynamic image networks [C]. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2016 : 3034 - 3042 .
FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal multiplier networks for video action recognition [C]. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu , Hawaii , USA : IEEE , 2017 : 7445 - 7454 .
FEICHTENHOFER C , PINZ A , ZISSERMAN A . Convolutional two-stream network fusion for video action recognition [C]. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA : IEEE , 2016 : 1933 - 1941 .
TRAN D , BOURDEV L , FERGUS R , et al .. Learning spatiotemporal features with 3D convolutional networks [C]. Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos : IEEE Computer Society Press , 2015 : 4489 - 4497 .
LIN J , GAN C , HAN S . TSM: Temporal shift module for efficient video understanding [C]. 2019 IEEE/CVF International Conference on Computer Vision. Seoul , Korea. New York: IEEE , 2019 , 2 ( 11 ): 19410282 .
WOO S , PARK J , LEE J , et al .. CBAM: Convolutional block attention module [C]. European Conference on Computer Vision , Munich , Germany: Spring , 2018 : 3 - 19 .
He K , ZHANG X , REN S , et al .. Deep residual learning for image recognition [C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas , NV: IEEE , 2016 : 770 - 778 .
SOOMRO K , ZAMIR A R , SHAH M . UCF101: A dataset of 101 human actions classes from videos in the wild [J]. arXiv:1212 . 0402 , 2012 : 1 - 7 .
KUEHNE H , JHUANG H , GARROTE E , et al .. HMDB: A large video database for human motion recognition [C]. Proceedings of the 2011 IEEE International Conference on Computer Vision. Barcelona , ES: IEEE , 2011 . 2556 - 2563 .
DONAHUE J , HENDRICKS L A , ROHRBACH M , et al . . Long-term recurrent convolutional networks for visual recognition and description [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2014 , 39 ( 4 ): 677 - 691 .
CARREIRA J , ZISSERMAN A . Quo vadis, action recognition? A new model and the kinetics dataset [C]. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu , HI , 2017 . 4724 - 4733 .
0
浏览量
1206
下载量
7
CSCD
关联资源
相关文章
相关作者
相关机构