Chained semantic generation network for video captioning

MAO Lin; GAO Hang; YANG Dawei; ZHANG Rubo

doi:10.37188/OPE.20223024.3198

您当前的位置：

首页 >

文章列表页 >

Chained semantic generation network for video captioning

Information Sciences | 更新时间：2022-12-27

- Chained semantic generation network for video captioning
- Optics and Precision Engineering Vol. 30, Issue 24, Pages: 3198-3209(2022)
- 作者机构：
  
  大连民族大学机电工程学院，辽宁大连 116600
- 作者简介：
- 基金信息：
- DOI：10.37188/OPE.20223024.3198
  CLC： TP394.1;
- Received：16 May 2022，
  
  Revised：21 August 2022，
  
  Published：25 December 2022
- 稿件说明：
移动端阅览
毛琳,高航,杨大伟等.视频描述中链式语义生成网络[J].光学精密工程,2022,30(24):3198-3209.

MAO Lin,GAO Hang,YANG Dawei,et al.Chained semantic generation network for video captioning[J].Optics and Precision Engineering,2022,30(24):3198-3209.
毛琳,高航,杨大伟等.视频描述中链式语义生成网络[J].光学精密工程,2022,30(24):3198-3209. DOI： 10.37188/OPE.20223024.3198.

MAO Lin,GAO Hang,YANG Dawei,et al.Chained semantic generation network for video captioning[J].Optics and Precision Engineering,2022,30(24):3198-3209. DOI： 10.37188/OPE.20223024.3198.

摘要

针对视频描述中语义特征表达能力不足导致文本描述不准确问题，本文提出一种视频描述中链式语义生成网络（Chained Semantic generation Network，ChainS-Net）。构建了多阶段双路交叉的链式特征提取结构，该结构以全局域和局部域模块为基本单元，分别从视觉特征的全局和局部捕获视频语义；在网络的各阶段，将语义信息在全局域和局部域之间变换解析，实现视觉和语义信息的交互参考，提升语义特征表达能力；在此基础上，网络通过多阶段迭代的处理方式获取更为有效的语义表示，提升视频描述模型性能。在MSR-VTT和MSVD数据集上的实验结果表明，本文提出的链式语义生成网络ChainS-Net优于现有同类方法，相比于语义辅助视频描述网络（Semantics-Assisted Video Captioning network，SAVC），视频描述的四个评价指标平均提升了2.5%。

Abstract

Aiming to address the unsatisfactory expression ability of semantics， which results in inaccurate text descriptions in video captioning， a chained semantic generation network （ChainS-Net） for video captioning is proposed. A multistage two-branch crossing chained feature extraction structure is constructed that uses global and local domain modules as basic units and captures the video semantics from global and local visual features， respectively. At each stage of the network， semantic information is transformed and parsed between the global and local domains. This method allows visual and semantic information to be cross referenced and improves the semantic expression ability. Furthermore， it allows a more effective semantic representation to be obtained through multistage iterative processing， thereby improving video captioning. Experimental results on MSR-VTT and MSVD datasets show that the proposed ChainS-Net outperforms other similar algorithms. Compared with the semantics-assisted video captioning network， SAVC， ChainS-Net shows average improvements of 2.5% in four metrics of video captioning.

关键词

Keywords

references

汤鹏杰，王瀚漓 . 从视频到语言：视频标题生成与描述研究综述［J］. 自动化学报， 2022 ， 48 （ 2 ）： 375 - 397 .

TANG P J ， WANG H L . From video to language： survey of video captioning and description ［J］. Acta Automatica Sinica ， 2022 ， 48 （ 2 ）： 375 - 397 . （in Chinese）

GU J X ， WANG Z H ， KUEN J ， et al . Recent advances in convolutional neural networks ［J］. Pattern Recognition ， 2018 ， 77 ： 354 - 377 . doi: 10.1016/j.patcog.2017.10.013 http://dx.doi.org/10.1016/j.patcog.2017.10.013

ELMAN J L . Finding structure in time ［J］. Cognitive Science ， 1990 ， 14 （ 2 ）： 179 - 211 . doi: 10.1207/s15516709cog1402_1 http://dx.doi.org/10.1207/s15516709cog1402_1

VENUGOPALAN S ， ROHRBACH M ， DONAHUE J ， et al . Sequence to sequence： video to text ［C］. 2015 IEEE International Conference on Computer Vision . Santiago， Chile . IEEE ， 2015 ： 4534 - 4542 .

HOCHREITER S ， SCHMIDHUBER J . Long Short-Term Memory ［J］. Neural Computation ， 1997 ， 9 （ 8 ）： 1735 - 1780 . doi: 10.1162/neco.1997.9.8.1735 http://dx.doi.org/10.1162/neco.1997.9.8.1735

陈科峻，张叶 . 循环神经网络多标签航空图像分类［J］. 光学精密工程， 2020 ， 28 （ 6 ）： 1404 - 1413 .

CHEN K J ， ZHANG Y . Recurrent neural network multi-label aerial images classification ［J］. Opt.Precision Eng. ， 2020 ， 28 （ 6 ）： 1404 - 1413 . （in Chinese）

HOCHREITER S ， SCHMIDHUBER J . Long short-term memory ［J］. Neural Computation ， 1997 ， 9 （ 8 ）： 1735 - 1780 . doi: 10.1162/neco.1997.9.8.1735 http://dx.doi.org/10.1162/neco.1997.9.8.1735

YAN C G ， TU Y B ， WANG X Z ， et al . STAT： spatial-temporal attention mechanism for video captioning ［J］. IEEE Transactions on Multimedia ， 2020 ， 22 （ 1 ）： 229 - 241 .

CHEN H R ， LIN K ， MAYE A ， et al . A semantics-assisted video captioning model trained with scheduled sampling ［J］. Frontiers in Robotics and AI ， 2020 ， 7 ： 475767 .

ZHANG Z Q ， SHI Y Y ， YUAN C F ， et al . Object relational graph with teacher-recommended learning for video captioning ［C］. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle ， WA ， USA . IEEE ， 2020 ： 13275 - 13285 .

PEI W J ， ZHANG J Y ， WANG X R ， et al . Memory-attended recurrent network for video captioning ［C］. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach ， CA， USA . IEEE ， 2019 ： 8339 - 8348 .

赵海英，周伟，侯小刚，等 . 多标签分类的传统民族服饰纹样图像语义理解［J］. 光学精密工程， 2020 ， 28 （ 3 ）： 695 - 703 . doi: 10.3788/OPE.20202803.0695 http://dx.doi.org/10.3788/OPE.20202803.0695

ZHAO H Y ， ZHOU W ， HOU X G ， et al . Multi-label classification of traditional national costume pattern image semantic understanding ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 3 ）： 695 - 703 . （in Chinese） . doi: 10.3788/OPE.20202803.0695 http://dx.doi.org/10.3788/OPE.20202803.0695

SHETTY R ， LAAKSONEN J T . Video captioning with recurrent networks based on frame- and video-level features and visual content classification ［J］. arXiv preprint ， arXiv：1512. 02949 .， 2015

TU Y B ， ZHANG X S ， LIU B T ， et al . Video description with spatial-temporal attention ［C］. MM '17 ： Proceedings of the 25th ACM international conference on Multimedia . 2017 ： 1014 - 1022 .

GAN Z ， GAN C ， HE X D ， et al . Semantic compositional networks for visual captioning ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， HI， USA . IEEE ， 2017 ： 1141 - 1150 .

杨其利，周炳红，郑伟，等 . 注意力卷积长短时记忆网络的弱小目标轨迹检测［J］. 光学精密工程， 2020 ， 28 （ 11 ）： 2535 - 2548 . doi: 10.37188/OPE.20202811.2535 http://dx.doi.org/10.37188/OPE.20202811.2535

YANG Q L ， ZHOU B H ， ZHENG W ， et al . Trajectory detection of small targets based on convolutional long short-term memory with attention mechanisms ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 11 ）： 2535 - 2548 . （in Chinese） . doi: 10.37188/OPE.20202811.2535 http://dx.doi.org/10.37188/OPE.20202811.2535

XU J ， MEI T ， YAO T ， et al . MSR-VTT： a large video description dataset for bridging video and language ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， NV， USA . IEEE ， 2016 ： 5288 - 5296 .

GUADARRAMA S ， KRISHNAMOORTHY N ， MALKARNENKAR G ， et al . YouTube2Text： recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition ［C］. 2013 IEEE International Conference on Computer Vision . Sydney， NSW， Australia . IEEE ， 2013 ： 2712 - 2719 .

ZOLFAGHARI M ， SINGH K ， BROX T . ECO： efficient convolutional network for online video understanding ［C］. Computer Vision-ECCV 2018 ， 2018 ： 713 - 730 .

XIE S N ， GIRSHICK R ， DOLLÁR P ， et al . Aggregated residual transformations for deep neural networks ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， HI， USA . IEEE ， 2017 ： 5987 - 5995 . doi: 10.1109/cvpr.2017.634 http://dx.doi.org/10.1109/cvpr.2017.634

PASUNURU R ， BANSAL M . Multi-task video captioning with video and entailment generation ［C］. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1 ： Long Papers）. Vancouver ， Canada. Stroudsburg ， PA， USA ： Association for Computational Linguistics ， 2017 ： 1273 - 1283 .

PASUNURU R ， BANSAL M . Reinforced video captioning with entailment rewards ［C］. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . Copenhagen， Denmark. Stroudsburg， PA， USA ： Association for Computational Linguistics ， 2017 ： 979 – 985 .

LIU S ， REN Z ， YUAN J S . SibNet： sibling convolutional encoder for video captioning ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2021 ， 43 （ 9 ）： 3259 - 3272 .

WANG X ， WANG Y F ， WANG W Y . Watch， listen， and describe： globally and locally aligned cross-modal attentions for video captioning ［C］. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics ： Human Language Technologies， Volume 2 （Short Papers）. New Orleans， Louisiana. Stroudsburg， PA， USA： Association for Computational Linguistics ， 2018 ： 795 – 801 .

WANG X ， WU J W ， ZHANG D ， et al . Learning to compose topic-aware mixture of experts for zero-shot video captioning ［J］. Proceedings of the AAAI Conference on Artificial Intelligence ， 2019 ， 33 （ 1 ）： 8965 - 8972 .

WANG B R ， MA L ， ZHANG W ， et al . Controllable video captioning with POS sequence guidance based on gated fusion network ［C］. 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul ， Korea （South） . IEEE ， 2019 ： 2641 - 2650 .

HOU J Y ， WU X X ， ZHAO W T ， et al . Joint syntax representation learning and visual cue translation for video captioning ［C］. 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. Seoul ， Korea （South） . IEEE ， 2019 ： 8917 - 8926 .

AAFAQ N ， AKHTAR N ， LIU W ， et al . Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning ［C］. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach ， CA， USA . IEEE ， 2019 ： 12479 - 12488 .

PAN B X ， CAI H Y ， HUANG D A ， et al . Spatio-temporal graph for video captioning with knowledge distillation ［C］. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle ， WA ， USA . IEEE ， 2020 ： 10867 - 10876 .

ZHENG Q ， WANG C Y ， TAO D C . Syntax-aware action targeting for video captioning ［C］. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Seattle ， WA ， USA . IEEE ， 2020 ： 13093 - 13102 .

PAN Y W ， MEI T ， YAO T ， et al . Jointly modeling embedding and translation to bridge video and language ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， NV， USA . IEEE ， 2016 ： 4594 - 4602 .

YU H N ， WANG J ， HUANG Z H ， et al . Video paragraph captioning using hierarchical recurrent neural networks ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， NV， USA . IEEE ， 2016 ： 4584 - 4593 .

GAO L L ， GUO Z ， ZHANG H W ， et al . Video captioning with attention-based LSTM and semantic consistency ［J］. IEEE Transactions on Multimedia ， 2017 ， 19 （ 9 ）： 2045 - 2055 .

Views

680

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

AI问答

Address：No.3888 Dong Nanhu Road, Changchun, Jilin, China Postal code：130033
Tel：0431-86176855 Email：gxjmgc@ciomp.ac.cn
Technical support is provided by Beijing Founder electronics co., LTD 吉ICP备11002662号-17 京公网安备11010802024621
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰