基于融合门网络的图像理解算法设计与应用

周自维; 王朝阳; 徐亮

doi:10.37188/OPE.20212904.0906

您当前的位置：

首页 >

文章列表页 >

基于融合门网络的图像理解算法设计与应用

信息科学 | 更新时间：2021-05-12

- 基于融合门网络的图像理解算法设计与应用
- Design and application of image captioning algorithm based on fusion gate neural network
- 光学精密工程 2021年29卷第4期页码：906-915
- 作者机构：
  
  辽宁科技大学电子信息工程学院，辽宁鞍山 114000
- 作者简介：
  
  [ "周自维（1974-），男，辽宁鞍山人，副教授，硕士研究生导师，1997年，2007年于辽宁科技大学获得本科、硕士学位；2013年于哈尔滨工业大学获博士学位，主要研究方向为人工智能，三维视觉，深度学习，机器人系统研究，E-mail地址： 381431970@qq.com" ]
  [ "王朝阳（1996-），女，辽宁沈阳人，硕士研究生，主研领域为深度学习、计算机视觉，E-mail：1669106430@qq.com" ]
- 基金信息：
  
  国家自然科学基金资助项目(61575090)
- DOI：10.37188/OPE.20212904.0906
  中图分类号： TP394.1;TH691.9
- 收稿日期：2020-08-28，
  
  修回日期：2020-10-12，
  
  纸质出版日期：2021-04-15
- 稿件说明：
移动端阅览
周自维,王朝阳,徐亮.基于融合门网络的图像理解算法设计与应用[J].光学精密工程,2021,29(04):906-915.

ZHOU Zi-wei,WANG Chao-yang,XU Liang.Design and application of image captioning algorithm based on fusion gate neural network[J].Optics and Precision Engineering,2021,29(04):906-915.
周自维,王朝阳,徐亮.基于融合门网络的图像理解算法设计与应用[J].光学精密工程,2021,29(04):906-915. DOI： 10.37188/OPE.20212904.0906.

ZHOU Zi-wei,WANG Chao-yang,XU Liang.Design and application of image captioning algorithm based on fusion gate neural network[J].Optics and Precision Engineering,2021,29(04):906-915. DOI： 10.37188/OPE.20212904.0906.

摘要

为了提高图像理解（Image Captioning）的预测性能，设计了一种基于“融合门”的深度神经网络模型。该“融合门”网络模型基于编码器-解码器结构设计，是卷积神经网络与循环神经网络的融合。算法首先将输入图像通过VGGNet-16网络进行卷积，得到对应的4096维输出向量，然后将卷积后的输出向量与标注语句向量合并，作为输入向量进入改进后的“融合门”网络，最后获得新的网络输出结果。上述过程按照时间步逐次迭代，最终完成网络训练。使用权威的CIDEr评价指标来评估该“融合门”网络的预测结果，实验结果表明，该网络的CIDEr值比“Neural Talk”网络的CIDEr值提高10.56%，其他相关的评价指标也有较大幅度提高。该网络结构不但预测指标高，而且其网络参数个数比“注意力机制”网络参数少21.1%，所需要的计算机资源更少，这使得将该网络应用在边缘计算中成为可能，对图像理解成果的推广起到关键作用。

Abstract

A novel neural network model based on "fusion gate" is proposed herein to improve the prediction performance of image captioning. The "fusion gate" network， which is based on the Encoder–Decoder structure， is a combination of the convolutional neural network and the recurrent neural network. The VGGNet-16 network was used during training to convolute the input image to generate 4096 output vectors. First， the output vectors were combined with the labeled statement to calculate the results， and these results were used as the input of the "fusion gate" network. Finally， the prediction result was obtained through the "fusion gate" network calculation， the preceding process was iterated successively according to time steps， and the network training was completed. Overall， the experimental results demonstrate that the CIDEr value is 10.56% higher than that of “Neural Talk” and other related evaluation indexes are significantly improved. The "fusion gate" neural network model not only has a high prediction index， but also involves 21.1% less parameters relative to the “Attention model” network. Moreover， the algorithm requires fewer computer resources and has a lower operating cost. This structure also makes it possible to deploy the neural network in edge computation processors， which plays a key role in promoting the wide range of image captioning applications.

关键词

Keywords

references

KULKAMI ， GIRISH ， et al . BabyTalk： Understanding and generating simple image descriptions ［C］. IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society ， 2011 ： 1601 - 1608 .

FARHADI ， ALI ， et al . Every Picture Tells a Story： generating sentences from images ［C］. Computer Vision-ECCV 2010. Spring Berlin Heidelberg ， 2010 ： 15 - 29 .

KARPATHY A ， FEI . FEI L . Deep visual-semantic alignments for generating image descriptions［C］. 2015 ， Computer Vision & Pattern Recognition. IEEE ， 2015： 1123 - 1132 .

VINYALS O ， TOSHEV A ， et al . Show and tell： A neural image caption generator ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition（CVPR）， 2015 ： 3156 - 3164 .

CHO K ， MERRIENBOER BVAN ， et al . Learning phrase representations using RNN encoder-decoder for statistical machine translation ［J］. arXiv preprint arXiv： 1406.1078 .

VOLODYMYR M ， NICLAS H ， et al .. Recurrent models of visual attention ［C］. Advances in Neural Information Processing Systems（NIPS）， 2014 ， 3 ： 2204 - 2212 .

LU J ， XIONG C ， PARIKH D ， et al .. Knowing when to look： adaptive attention via a visual sentinel for image captioning ［C］. IEEE Conference on Computer Vision and Pattern Recognition（CVPR 2017 ）， 2017： 5532 - 5540 .

CHEN X ， FANG H ， VENDANTAN R ， et al .. Microsoft coco captions： Data collection and evaluation server ［J］. arXiv preprint arXiv： 1504.00325 ， 2015 .

姚顺宇，王志武，颜国正 . 双层双向长短期记忆应用于云轨精确定位［J］. 光学精密工程， 2020 ， 28 （ 1 ）： 166 - 173 .

YAO SH Y ， WANG ZH W ， YAN G ZH . Precise positioning of cloud track by bi-direction long short memory ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 1 ）： 166 - 173 . （in Chinese）

ZHANG J ， YANG ZH Y . Iterature review of image description generation methods ［J］. Intelligent Computer and Application ， 2019 ， 9 （ 5 ）： 45 - 49 . （in Chinese）

BAHDANAU D ， CHO K ， BENGIO Y . Neural machine translation by jointly learning to align andtranslate ［J］. 2014 arXiv：1409.0473

CHENG J P ， DONG L ， LAPATA M . Long short-term memory-networks for machine reading ［J］. 2019 ， arXiv： 1601.06733 .

LIN Z ， FENG M ， SANTOS C N ， et al . A structured self-attentive sentence embedding ［J］. 2019 ， arXiv： 1705.04304 .

SHEN T ， ZHOU T Y ， LONG G D ， et al . DiSAN：Directional self-attention network for RNN/CNN-free language understanding ［J］. 2020 ， arXiv ：1709. 04696 .

XU K ， BA J ， KIROS R ， et al . Show， attend and tell： Neural image caption generation with visual attention . 2015 ， In ICML-2015.

DONG H SH ， XU J F ， SUN H ， et al . Image description generation guided by attention mechanism ［J］. Modern Computer ， 2019 ，（ 3 ）： 30 - 33 . （in Chinese）

李金轩，杜军平，周南 . 基于注意力特征提取网络的图像描述生成算法［J］. 南京信息工程大学学报（自然科学版）， 2019 ， 11 （ 3 ）： 295 - 301 .

LI J X ， DU J P ， ZHOU N . Image caption algorithm based on attention image teature extraction network ［J］. Journal of Nanjing University of Information Science & Technology （Natural Science Edition）， 2019 ， 11 （ 3 ）： 295 - 301 . （in Chinese）

巩笑天，欧阳航空 . Tiny YOLOV3目标检测改进［J］. 光学精密工程， 2020 ， 28 （ 4 ）： 988 - 995 .

GONG X T ， HANG-KONGOYANG . Improvement of Tiny YOLOV3 target detection ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 4 ）： 988 - 995 . （in Chinese）

杨稳，周明全，耿国华，等 . 层次优化的颅骨点云配准［J］. 光学精密工程， 2019 ， 27 （ 12 ）： 817 - 825 .

YANG W ， ZHOU M Q ， GENG G H ， et al . Hierarchical optimization of skull point cloud registration ［J］. Opt. Precision Eng. ， 2019 ， 27 （ 12 ）： 817 - 825 . （in Chinese）

陈清江，张雪，柴昱洲 . 基于卷积神经网络的图像去雾算法［J］. 液晶与显示， 2019 ， 34 （ 2 ）： 220 - 227 .

CHEN Q J ， ZHANG X ， CHAI Y ZH . Image defogging algorithms based on multiscale convolution neural network ［J］. Chinese Journal of Liquid Crystals and Displays ， 2019 ， 34 （ 2 ）： 220 - 227 . （in Chinese）

徐彦伟，刘明明，刘洋，等 . 基于信息融合的机器人薄壁轴承故障智能诊断［J］. 光学精密工程， 2019 ， 27 （ 7 ）： 1243 - 1251 .

XU Y W ， LIU M M LIU Y ， et al . Intelligent fault diagnosis of thin wall bearing based on information fusion ［J］. Opt. Precision Eng. ， 2019 ， 27 （ 7 ）： 1243 - 1251 . （in Chinese）

VASWANIASHISH ， SHAZEERNOAM ， PARMARNIKI ， et al . Attention is all you need ［J］. In Advances in Neural Information Pro-cessing Systems ， 2018 ， pages 6000- 6010 .

DEVLIN J ， CHANG M ， LEE K . BERT： Pre-training of Deep Bidirectional Transformers for Language Understanding . 2019 ， arXiv： Computation and Language，.

HOCHREITER S ， JURGEN S . Long short-term memory， Neural . computation ， 1997 ， 9 （ 8 ）： 1735 - 1780 .

LIN T Y ， MAIRE M ， BELONGIE S ， et al . Microsoft COCO：Common objects in context ［J］. Computer Vision and Pattern Recognition ， 2014 ， arXiv：1405. 0312 .

SIMONYAN K ， ZISSERMAN A . Very deep convolutional networks for large-scale image recognition ［C］ 2015 ， International Conference on Learning Representations- 2015.

KINGMA D P ， BA J . Adam： A method for stochastic optimization ［C］. 2014 ， International Conference on Learning Representations-2014.

CHEN X ， FANG H ， WEDANTAM R . Microsoft coco captions： Data collection and evaluation server . 2015 ， arXiv preprint arXiv：1504. 00325

KISHORE P ， KISHORE P ， SALIM R ， et al .. BLEU： a method for automatic evaluation of machine translation ［C］ Association for Computational Linguistics ， USA ，（ 2002 ） 311 – 318 .

LIN C Y . ROUGE： a package for automatic evaluation of summaries . 2004 ， Text summarization on branches out . P 2122 - 2131

DENKOWSKY M ， LAVIE A . Meteor universal： Language specific translation evaluation for any target language ［J］. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation ， 2014 .

ANDERSON P ， FERNANDO B ， JOHNSON M ， et al . Spice： Semantic propositional image caption evaluation ［C］. European Conference on Computer Vision . Springer ， Cham ， 2016 ： 382 - 398 .

VEDANTAM R ， ZITNICK C L ， PARIKH D . CIDEr：Consensus-based image description evaluation ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition ， 2015 ： 4566 - 4575 .

浏览量

684

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

融合生成对抗网络的大气无线光信道密钥提取

渐进式CNN-Transformer语义补偿息肉分割网络

级联残差优化Transformer网络的图像超分辨率重建

基于跨层次聚合网络的实时城市街景语义分割

CNN-Transformer结合对比学习的高光谱与LiDAR数据协同分类