Design and application of image captioning algorithm based on fusion gate neural network

Zi-wei ZHOU; Chao-yang WANG; Liang XU

doi:10.37188/OPE.20212904.0906

您当前的位置：

首页 >

文章列表页 >

Design and application of image captioning algorithm based on fusion gate neural network

Information Sciences | 更新时间：2021-05-12

- Design and application of image captioning algorithm based on fusion gate neural network
- Optics and Precision Engineering Vol. 29, Issue 4, Pages: 906-915(2021)
- 作者机构：
  
  辽宁科技大学电子信息工程学院，辽宁鞍山 114000
- 作者简介：
- 基金信息：
- DOI：10.37188/OPE.20212904.0906
  CLC： TP394.1;TH691.9
- Received：28 August 2020，
  
  Revised：12 October 2020，
  
  Published：15 April 2021
- 稿件说明：
移动端阅览
周自维,王朝阳,徐亮.基于融合门网络的图像理解算法设计与应用[J].光学精密工程,2021,29(04):906-915.

ZHOU Zi-wei,WANG Chao-yang,XU Liang.Design and application of image captioning algorithm based on fusion gate neural network[J].Optics and Precision Engineering,2021,29(04):906-915.
周自维,王朝阳,徐亮.基于融合门网络的图像理解算法设计与应用[J].光学精密工程,2021,29(04):906-915. DOI： 10.37188/OPE.20212904.0906.

ZHOU Zi-wei,WANG Chao-yang,XU Liang.Design and application of image captioning algorithm based on fusion gate neural network[J].Optics and Precision Engineering,2021,29(04):906-915. DOI： 10.37188/OPE.20212904.0906.

摘要

为了提高图像理解（Image Captioning）的预测性能，设计了一种基于“融合门”的深度神经网络模型。该“融合门”网络模型基于编码器-解码器结构设计，是卷积神经网络与循环神经网络的融合。算法首先将输入图像通过VGGNet-16网络进行卷积，得到对应的4096维输出向量，然后将卷积后的输出向量与标注语句向量合并，作为输入向量进入改进后的“融合门”网络，最后获得新的网络输出结果。上述过程按照时间步逐次迭代，最终完成网络训练。使用权威的CIDEr评价指标来评估该“融合门”网络的预测结果，实验结果表明，该网络的CIDEr值比“Neural Talk”网络的CIDEr值提高10.56%，其他相关的评价指标也有较大幅度提高。该网络结构不但预测指标高，而且其网络参数个数比“注意力机制”网络参数少21.1%，所需要的计算机资源更少，这使得将该网络应用在边缘计算中成为可能，对图像理解成果的推广起到关键作用。

Abstract

A novel neural network model based on "fusion gate" is proposed herein to improve the prediction performance of image captioning. The "fusion gate" network， which is based on the Encoder–Decoder structure， is a combination of the convolutional neural network and the recurrent neural network. The VGGNet-16 network was used during training to convolute the input image to generate 4096 output vectors. First， the output vectors were combined with the labeled statement to calculate the results， and these results were used as the input of the "fusion gate" network. Finally， the prediction result was obtained through the "fusion gate" network calculation， the preceding process was iterated successively according to time steps， and the network training was completed. Overall， the experimental results demonstrate that the CIDEr value is 10.56% higher than that of “Neural Talk” and other related evaluation indexes are significantly improved. The "fusion gate" neural network model not only has a high prediction index， but also involves 21.1% less parameters relative to the “Attention model” network. Moreover， the algorithm requires fewer computer resources and has a lower operating cost. This structure also makes it possible to deploy the neural network in edge computation processors， which plays a key role in promoting the wide range of image captioning applications.

关键词

Keywords

references

KULKAMI ， GIRISH ， et al . BabyTalk： Understanding and generating simple image descriptions ［C］. IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society ， 2011 ： 1601 - 1608 .

FARHADI ， ALI ， et al . Every Picture Tells a Story： generating sentences from images ［C］. Computer Vision-ECCV 2010. Spring Berlin Heidelberg ， 2010 ： 15 - 29 .

KARPATHY A ， FEI . FEI L . Deep visual-semantic alignments for generating image descriptions［C］. 2015 ， Computer Vision & Pattern Recognition. IEEE ， 2015： 1123 - 1132 .

VINYALS O ， TOSHEV A ， et al . Show and tell： A neural image caption generator ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition（CVPR）， 2015 ： 3156 - 3164 .

CHO K ， MERRIENBOER BVAN ， et al . Learning phrase representations using RNN encoder-decoder for statistical machine translation ［J］. arXiv preprint arXiv： 1406.1078 .

VOLODYMYR M ， NICLAS H ， et al .. Recurrent models of visual attention ［C］. Advances in Neural Information Processing Systems（NIPS）， 2014 ， 3 ： 2204 - 2212 .

LU J ， XIONG C ， PARIKH D ， et al .. Knowing when to look： adaptive attention via a visual sentinel for image captioning ［C］. IEEE Conference on Computer Vision and Pattern Recognition（CVPR 2017 ）， 2017： 5532 - 5540 .

CHEN X ， FANG H ， VENDANTAN R ， et al .. Microsoft coco captions： Data collection and evaluation server ［J］. arXiv preprint arXiv： 1504.00325 ， 2015 .

姚顺宇，王志武，颜国正 . 双层双向长短期记忆应用于云轨精确定位［J］. 光学精密工程， 2020 ， 28 （ 1 ）： 166 - 173 .

YAO SH Y ， WANG ZH W ， YAN G ZH . Precise positioning of cloud track by bi-direction long short memory ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 1 ）： 166 - 173 . （in Chinese）

ZHANG J ， YANG ZH Y . Iterature review of image description generation methods ［J］. Intelligent Computer and Application ， 2019 ， 9 （ 5 ）： 45 - 49 . （in Chinese）

BAHDANAU D ， CHO K ， BENGIO Y . Neural machine translation by jointly learning to align andtranslate ［J］. 2014 arXiv：1409.0473

CHENG J P ， DONG L ， LAPATA M . Long short-term memory-networks for machine reading ［J］. 2019 ， arXiv： 1601.06733 .

LIN Z ， FENG M ， SANTOS C N ， et al . A structured self-attentive sentence embedding ［J］. 2019 ， arXiv： 1705.04304 .

SHEN T ， ZHOU T Y ， LONG G D ， et al . DiSAN：Directional self-attention network for RNN/CNN-free language understanding ［J］. 2020 ， arXiv ：1709. 04696 .

XU K ， BA J ， KIROS R ， et al . Show， attend and tell： Neural image caption generation with visual attention . 2015 ， In ICML-2015.

DONG H SH ， XU J F ， SUN H ， et al . Image description generation guided by attention mechanism ［J］. Modern Computer ， 2019 ，（ 3 ）： 30 - 33 . （in Chinese）

李金轩，杜军平，周南 . 基于注意力特征提取网络的图像描述生成算法［J］. 南京信息工程大学学报（自然科学版）， 2019 ， 11 （ 3 ）： 295 - 301 .

LI J X ， DU J P ， ZHOU N . Image caption algorithm based on attention image teature extraction network ［J］. Journal of Nanjing University of Information Science & Technology （Natural Science Edition）， 2019 ， 11 （ 3 ）： 295 - 301 . （in Chinese）

巩笑天，欧阳航空 . Tiny YOLOV3目标检测改进［J］. 光学精密工程， 2020 ， 28 （ 4 ）： 988 - 995 .

GONG X T ， HANG-KONGOYANG . Improvement of Tiny YOLOV3 target detection ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 4 ）： 988 - 995 . （in Chinese）

杨稳，周明全，耿国华，等 . 层次优化的颅骨点云配准［J］. 光学精密工程， 2019 ， 27 （ 12 ）： 817 - 825 .

YANG W ， ZHOU M Q ， GENG G H ， et al . Hierarchical optimization of skull point cloud registration ［J］. Opt. Precision Eng. ， 2019 ， 27 （ 12 ）： 817 - 825 . （in Chinese）

陈清江，张雪，柴昱洲 . 基于卷积神经网络的图像去雾算法［J］. 液晶与显示， 2019 ， 34 （ 2 ）： 220 - 227 .

CHEN Q J ， ZHANG X ， CHAI Y ZH . Image defogging algorithms based on multiscale convolution neural network ［J］. Chinese Journal of Liquid Crystals and Displays ， 2019 ， 34 （ 2 ）： 220 - 227 . （in Chinese）

徐彦伟，刘明明，刘洋，等 . 基于信息融合的机器人薄壁轴承故障智能诊断［J］. 光学精密工程， 2019 ， 27 （ 7 ）： 1243 - 1251 .

XU Y W ， LIU M M LIU Y ， et al . Intelligent fault diagnosis of thin wall bearing based on information fusion ［J］. Opt. Precision Eng. ， 2019 ， 27 （ 7 ）： 1243 - 1251 . （in Chinese）

VASWANIASHISH ， SHAZEERNOAM ， PARMARNIKI ， et al . Attention is all you need ［J］. In Advances in Neural Information Pro-cessing Systems ， 2018 ， pages 6000- 6010 .

DEVLIN J ， CHANG M ， LEE K . BERT： Pre-training of Deep Bidirectional Transformers for Language Understanding . 2019 ， arXiv： Computation and Language，.

HOCHREITER S ， JURGEN S . Long short-term memory， Neural . computation ， 1997 ， 9 （ 8 ）： 1735 - 1780 .

LIN T Y ， MAIRE M ， BELONGIE S ， et al . Microsoft COCO：Common objects in context ［J］. Computer Vision and Pattern Recognition ， 2014 ， arXiv：1405. 0312 .

SIMONYAN K ， ZISSERMAN A . Very deep convolutional networks for large-scale image recognition ［C］ 2015 ， International Conference on Learning Representations- 2015.

KINGMA D P ， BA J . Adam： A method for stochastic optimization ［C］. 2014 ， International Conference on Learning Representations-2014.

CHEN X ， FANG H ， WEDANTAM R . Microsoft coco captions： Data collection and evaluation server . 2015 ， arXiv preprint arXiv：1504. 00325

KISHORE P ， KISHORE P ， SALIM R ， et al .. BLEU： a method for automatic evaluation of machine translation ［C］ Association for Computational Linguistics ， USA ，（ 2002 ） 311 – 318 .

LIN C Y . ROUGE： a package for automatic evaluation of summaries . 2004 ， Text summarization on branches out . P 2122 - 2131

DENKOWSKY M ， LAVIE A . Meteor universal： Language specific translation evaluation for any target language ［J］. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation ， 2014 .

ANDERSON P ， FERNANDO B ， JOHNSON M ， et al . Spice： Semantic propositional image caption evaluation ［C］. European Conference on Computer Vision . Springer ， Cham ， 2016 ： 382 - 398 .

VEDANTAM R ， ZITNICK C L ， PARIKH D . CIDEr：Consensus-based image description evaluation ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition ， 2015 ： 4566 - 4575 .

Views

684

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Secret key extraction from atmospheric wireless optical channels by combing with generative adversarial network

Progressive CNN-transformer semantic compensation network for polyp segmentation

Cascade residual-optimized image super-resolution reconstruction in Transformer network

Real-time urban street view semantic segmentation based on cross-layer aggregation network

Collaborative classification of hyperspectral and LiDAR data based on CNN-transformer

Related Author

WANG Fang

LI Yanfeng

YU Haiyang

HU Xiaojuan

CHEN Chunyi

TIAN Zhuozhan

LI Denghui

LI Daxiang

Related Institution

Chongqing Research Institute， Changchun University of Science and Technology

School of Computer Science and Technology， Changchun University of Science and Technology

College of Communication and Information Engineering，Xi′an University of Posts and Telecommunication，Xi′an

College of Physics and Telecommunication Engineering， Fuzhou University

Fujian Science and Technology Innovation Laboratory for Optoelectronic Information of China

AI问答

Address：No.3888 Dong Nanhu Road, Changchun, Jilin, China Postal code：130033
Tel：0431-86176855 Email：gxjmgc@ciomp.ac.cn
Technical support is provided by Beijing Founder electronics co., LTD 吉ICP备11002662号-17 京公网安备11010802024621
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰