ZHOU Zi-wei,WANG Chao-yang,XU Liang.Design and application of image captioning algorithm based on fusion gate neural network[J].Optics and Precision Engineering,2021,29(04):906-915.
ZHOU Zi-wei,WANG Chao-yang,XU Liang.Design and application of image captioning algorithm based on fusion gate neural network[J].Optics and Precision Engineering,2021,29(04):906-915. DOI: 10.37188/OPE.20212904.0906.
Design and application of image captioning algorithm based on fusion gate neural network
A novel neural network model based on "fusion gate" is proposed herein to improve the prediction performance of image captioning. The "fusion gate" network, which is based on the Encoder–Decoder structure, is a combination of the convolutional neural network and the recurrent neural network. The VGGNet-16 network was used during training to convolute the input image to generate 4096 output vectors. First, the output vectors were combined with the labeled statement to calculate the results, and these results were used as the input of the "fusion gate" network. Finally, the prediction result was obtained through the "fusion gate" network calculation, the preceding process was iterated successively according to time steps, and the network training was completed. Overall, the experimental results demonstrate that the CIDEr value is 10.56% higher than that of “Neural Talk” and other related evaluation indexes are significantly improved. The "fusion gate" neural network model not only has a high prediction index, but also involves 21.1% less parameters relative to the “Attention model” network. Moreover, the algorithm requires fewer computer resources and has a lower operating cost. This structure also makes it possible to deploy the neural network in edge computation processors, which plays a key role in promoting the wide range of image captioning applications.
关键词
Keywords
references
KULKAMI , GIRISH , et al . BabyTalk: Understanding and generating simple image descriptions [C]. IEEE Conference on Computer Vision and Pattern Recognition IEEE Computer Society , 2011 : 1601 - 1608 .
FARHADI , ALI , et al . Every Picture Tells a Story: generating sentences from images [C]. Computer Vision-ECCV 2010. Spring Berlin Heidelberg , 2010 : 15 - 29 .
KARPATHY A , FEI . FEI L . Deep visual-semantic alignments for generating image descriptions[C]. 2015 , Computer Vision & Pattern Recognition. IEEE , 2015: 1123 - 1132 .
VINYALS O , TOSHEV A , et al . Show and tell: A neural image caption generator [C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2015 : 3156 - 3164 .
CHO K , MERRIENBOER BVAN , et al . Learning phrase representations using RNN encoder-decoder for statistical machine translation [J]. arXiv preprint arXiv: 1406.1078 .
VOLODYMYR M , NICLAS H , et al .. Recurrent models of visual attention [C]. Advances in Neural Information Processing Systems(NIPS) , 2014 , 3 : 2204 - 2212 .
LU J , XIONG C , PARIKH D , et al .. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]. IEEE Conference on Computer Vision and Pattern Recognition(CVPR 2017 ), 2017: 5532 - 5540 .
CHEN X , FANG H , VENDANTAN R , et al .. Microsoft coco captions: Data collection and evaluation server [J]. arXiv preprint arXiv: 1504.00325 , 2015 .
YAO SH Y , WANG ZH W , YAN G ZH . Precise positioning of cloud track by bi-direction long short memory [J]. Opt. Precision Eng. , 2020 , 28 ( 1 ): 166 - 173 . (in Chinese)
ZHANG J , YANG ZH Y . Iterature review of image description generation methods [J]. Intelligent Computer and Application , 2019 , 9 ( 5 ): 45 - 49 . (in Chinese)
BAHDANAU D , CHO K , BENGIO Y . Neural machine translation by jointly learning to align andtranslate [J]. 2014 arXiv:1409.0473
CHENG J P , DONG L , LAPATA M . Long short-term memory-networks for machine reading [J]. 2019 , arXiv: 1601.06733 .
LIN Z , FENG M , SANTOS C N , et al . A structured self-attentive sentence embedding [J]. 2019 , arXiv: 1705.04304 .
SHEN T , ZHOU T Y , LONG G D , et al . DiSAN:Directional self-attention network for RNN/CNN-free language understanding [J]. 2020 , arXiv :1709. 04696 .
XU K , BA J , KIROS R , et al . Show, attend and tell: Neural image caption generation with visual attention . 2015 , In ICML-2015.
DONG H SH , XU J F , SUN H , et al . Image description generation guided by attention mechanism [J]. Modern Computer , 2019 , ( 3 ): 30 - 33 . (in Chinese)
LI J X , DU J P , ZHOU N . Image caption algorithm based on attention image teature extraction network [J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition) , 2019 , 11 ( 3 ): 295 - 301 . (in Chinese)
YANG W , ZHOU M Q , GENG G H , et al . Hierarchical optimization of skull point cloud registration [J]. Opt. Precision Eng. , 2019 , 27 ( 12 ): 817 - 825 . (in Chinese)
XU Y W , LIU M M LIU Y , et al . Intelligent fault diagnosis of thin wall bearing based on information fusion [J]. Opt. Precision Eng. , 2019 , 27 ( 7 ): 1243 - 1251 . (in Chinese)
VASWANIASHISH , SHAZEERNOAM , PARMARNIKI , et al . Attention is all you need [J]. In Advances in Neural Information Pro-cessing Systems , 2018 , pages 6000- 6010 .
DEVLIN J , CHANG M , LEE K . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . 2019 , arXiv: Computation and Language,.
HOCHREITER S , JURGEN S . Long short-term memory, Neural . computation , 1997 , 9 ( 8 ): 1735 - 1780 .
LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft COCO:Common objects in context [J]. Computer Vision and Pattern Recognition , 2014 , arXiv:1405. 0312 .
SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [C] 2015 , International Conference on Learning Representations- 2015.
KINGMA D P , BA J . Adam: A method for stochastic optimization [C]. 2014 , International Conference on Learning Representations-2014.
CHEN X , FANG H , WEDANTAM R . Microsoft coco captions: Data collection and evaluation server . 2015 , arXiv preprint arXiv:1504. 00325
KISHORE P , KISHORE P , SALIM R , et al .. BLEU: a method for automatic evaluation of machine translation [C] Association for Computational Linguistics , USA ,( 2002 ) 311 – 318 .
LIN C Y . ROUGE: a package for automatic evaluation of summaries . 2004 , Text summarization on branches out . P 2122 - 2131
DENKOWSKY M , LAVIE A . Meteor universal: Language specific translation evaluation for any target language [J]. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation , 2014 .
ANDERSON P , FERNANDO B , JOHNSON M , et al . Spice: Semantic propositional image caption evaluation [C]. European Conference on Computer Vision . Springer , Cham , 2016 : 382 - 398 .
VEDANTAM R , ZITNICK C L , PARIKH D . CIDEr:Consensus-based image description evaluation [C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition , 2015 : 4566 - 4575 .