基于多模态学习的空间科学实验图像描述

李沛卓; 万雪; 李盛阳

doi:10.37188/OPE.2021.0244

您当前的位置：

首页 >

文章列表页 >

基于多模态学习的空间科学实验图像描述

信息科学 | 更新时间：2022-01-07

- 基于多模态学习的空间科学实验图像描述
- Image caption of space science experiment based on multi-modal learning
- 光学精密工程 2021年29卷第12期页码：2944-2955
- 作者机构：
  
  中国科学院大学中国科学院空间应用工程与技术中心中国科学院太空应用重点实验室，北京 100094
- 作者简介：
  
  [ "李沛卓（1996-），女，河南南阳人，硕士研究生，2018年于同济大学获得学士学位，现为中国科学院大学空间应用工程与技术中心在读硕士研究生，主要从事空间机器视觉方面的研究。E-mail： mailto：brightlishi@gmai l.comlipeizhuo18@mails，ucas.ac.cn" ]
  [ "万　雪（1988-），女，湖北武汉人，研究员，硕士生导师，2015年于英国帝国理工大学获得博士学位，现为中国科学院空间应用工程与技术中心研究员，主要从事计算机视觉算法研究。E-mail： wanxue@csu.ac.cn" ]
- 基金信息：
  
  中国科学院空间应用中心前瞻性课题重点项目(Y8031831WY)
- DOI：10.37188/OPE.2021.0244
  中图分类号： TP394.1;TH691.9
- 收稿日期：2021-04-29，
  
  修回日期：2021-05-28，
  
  纸质出版日期：2021-12-15
- 稿件说明：
移动端阅览
李沛卓,万雪,李盛阳.基于多模态学习的空间科学实验图像描述[J].光学精密工程,2021,29(12):2944-2955.

LI Pei-zhuo,WAN Xue,LI Sheng-yang.Image caption of space science experiment based on multi-modal learning[J].Optics and Precision Engineering,2021,29(12):2944-2955.
李沛卓,万雪,李盛阳.基于多模态学习的空间科学实验图像描述[J].光学精密工程,2021,29(12):2944-2955. DOI： 10.37188/OPE.2021.0244.

LI Pei-zhuo,WAN Xue,LI Sheng-yang.Image caption of space science experiment based on multi-modal learning[J].Optics and Precision Engineering,2021,29(12):2944-2955. DOI： 10.37188/OPE.2021.0244.

摘要

为了让科学家快速定位实验关键过程，获取更为详细的实验过程信息，需要对空间科学实验自动添加描述性文字内容。针对空间科学实验目标较小且数据样本较少的问题，本文提出了基于多模态学习的空间科学实验图像描述算法模型，主要分为四部分：基于改进U-Net的语义分割模型，基于语义分割的空间科学实验词汇候选，自下而上的通用场景图像特征向量提取和基于多模态学习的描述语句生成。此外，本文构建了空间科学实验目标数据集，包括语义掩码标注和图像描述标注，来对空间科学实验进行图像描述。实验结果表明：相对于经典的图像描述模型Neuraltalk2，本文提出的算法在精度评定方面，METEOR结果平均提升了0.089，SPICE结果平均提升了0.174；解决了空间科学实验目标较小、样本较少的难点，构建基于多模态学习的空间科学实验图像描述模型，满足对空间科学实验场景进行专业性、精准性的描述要求，实现从低层次感知到深层场景理解的能力。

Abstract

In order to enable scientists to quickly locate the key process of the experiment and obtain detailed experimental process information， it is necessary to automatically add descriptive content to space science experiments. Aiming at the problem of small target and small data sample of space science experiment， this paper proposes the image captioning of space science experiment based on multi-modal learning. It is mainly divided into four parts： semantic segmentation model based on improved U-Net， space science experimental vocabulary candidate based on semantic segmentation， general scene image feature vector extraction from bottom-up model and image caption based on multimodal learning. In addition， the dataset of space science experiment is constructed， including semantic masks and image caption annotations. Experimental results demonstrate that： compared with the state-of-the-art image caption model neuraltalk2， the accuracy evaluation of the proposed algorithm is improved by 0.089 for METEOR and 0.174 for SPICE. It solves the difficulty of small objectives and small data samples of space science experiment. It constructs a model of space science experiment image caption based on multi-modal learning， which meets the requirements of describing space science experiment professionally and accurately， and realizes the ability from low-level sense to deep scene understanding.

关键词

Keywords

references

刘媛媛，张硕，于海业，等 . 基于语义分割的复杂场景下的秸秆检测［J］. 光学精密工程， 2020 ， 28 （ 1 ）： 200 - 211 . doi: 10.3788/ope.20202801.0200 http://dx.doi.org/10.3788/ope.20202801.0200

LIU Y Y ， ZHANG SH ， YU H Y ， et al . Straw detection algorithm based on semantic segmentation in complex farm scenarios ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 1 ）： 200 - 211 . （in Chinese） . doi: 10.3788/ope.20202801.0200 http://dx.doi.org/10.3788/ope.20202801.0200

陈彦彤，李雨阳，吕石立，等 . 基于深度语义分割的多源遥感图像海面溢油监测［J］. 光学精密工程， 2020 ， 28 （ 5 ）： 1165 - 1176 .

CHEN Y T ， LI Y Y ， LÜ SH L ， et al . Research on oil spill monitoring of multi-source remote sensing image based on deep semantic segmentation ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 5 ）： 1165 - 1176 . （in Chinese）

王中宇，倪显扬，尚振东 . 利用卷积神经网络的自动驾驶场景语义分割［J］. 光学精密工程， 2019 ， 27 （ 11 ）： 2429 - 2438 . doi: 10.3788/ope.20192711.2429 http://dx.doi.org/10.3788/ope.20192711.2429

WANG ZH Y ， NI X Y ， SHANG ZH D . Autonomous driving semantic segmentation with convolution neural networks ［J］. Opt. Precision Eng. ， 2019 ， 27 （ 11 ）： 2429 - 2438 . （in Chinese） . doi: 10.3788/ope.20192711.2429 http://dx.doi.org/10.3788/ope.20192711.2429

HE K M ， GKIOXARI G ， DOLLÁR P ， et al . Mask R-CNN ［C］. 2017 IEEE International Conference on Computer Vision （ICCV）. 2229，2017 ， Venice， Italy. IEEE ， 2017 ： 2980 - 2988 . doi: 10.1109/iccv.2017.322 http://dx.doi.org/10.1109/iccv.2017.322

RONNEBERGER O ， FISCHER P ， BROX T . U-net： convolutional networks for biomedical image segmentation ［C］. Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015 ， pp： 234 - 241 . doi: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28

OTSU N . A threshold selection method from gray-level histograms ［J］. IEEE Transactions on Systems， Man， and Cybernetics ， 1979 ， 9 （ 1 ）： 62 - 66 . doi: 10.1109/tsmc.1979.4310076 http://dx.doi.org/10.1109/tsmc.1979.4310076

李彦，赵其峰，闫河，等 . Canny算子在PCBA目标边缘提取中的优化应用［J］. 光学精密工程， 2020 ， 28 （ 9 ）： 2096 - 2102 . doi: 10.37188/OPE.20202809.2096 http://dx.doi.org/10.37188/OPE.20202809.2096

LI Y ， ZHAO Q F ， YAN H ， et al . Optimized application of canny operator in PCBA target edge extraction ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 9 ）： 2096 - 2102 . （in Chinese） . doi: 10.37188/OPE.20202809.2096 http://dx.doi.org/10.37188/OPE.20202809.2096

HOU Q B ， CHENG M M ， HU X W ， et al . Deeply supervised salient object detection with short connections ［C］. IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE ，： 815 - 828 .

AGRAWAL H ， DESAI K R ， WANG Y F ， et al . Nocaps： novel object captioning at scale ［C］. 2019 IEEE/CVF International Conference on Computer Vision （ICCV） . 272，2019 ， Seoul， Korea （South）. IEEE ， 2019 ： 8947 - 8956 . doi: 10.1109/iccv.2019.00904 http://dx.doi.org/10.1109/iccv.2019.00904

KARPATHY A ， LI F F . Deep visual-semantic alignments for generating image descriptions ［C］. IEEE Transactions on Pattern Analysis and Machine Intelligence . IEEE ， 2015 ： 664 - 676 . doi: 10.1109/cvpr.2015.7298932 http://dx.doi.org/10.1109/cvpr.2015.7298932

VINYALS O ， TOSHEV A ， BENGIO S ， et al . Show and tell： a neural image caption generator ［C］. 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 712，2015 ， Boston， MA， USA. IEEE ， 2015 ： 3156 - 3164 . doi: 10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935

JOHNSON J ， KARPATHY A ， LI F F . DenseCap： fully convolutional localization networks for dense captioning ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2730，2016 ， Las Vegas， NV， USA. IEEE ， 2016 ： 4565 - 4574 . doi: 10.1109/cvpr.2016.494 http://dx.doi.org/10.1109/cvpr.2016.494

HENDRICKS L A ， VENUGOPALAN S ， ROHRBACH M ， et al . Deep compositional captioning： describing novel object categories without paired training data ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2730，2016 ， Las Vegas， NV， USA. IEEE ， 2016 ： 1 - 10 . doi: 10.1109/cvpr.2016.8 http://dx.doi.org/10.1109/cvpr.2016.8

VENUGOPALAN S ， HENDRICKS L A ， ROHRBACH M ， et al . Captioning images with diverse objects ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2126，2017 ， Honolulu， HI， USA. IEEE ， 2017 ： 1170 - 1178 . doi: 10.1109/cvpr.2017.130 http://dx.doi.org/10.1109/cvpr.2017.130

ANDERSON P ， HE X D ， BUEHLER C ， et al . Bottom-up and top-down attention for image captioning and visual question answering ［C］. 018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1823，2018 ， Salt Lake City， UT， USA . IEEE ， 2018 ： 6077 - 6086 . doi: 10.1109/cvpr.2018.00636 http://dx.doi.org/10.1109/cvpr.2018.00636

KULKARNI G ， PREMRAJ V ， ORDONEZ V ， et al . BabyTalk： understanding and generating simple image descriptions ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2013 ， 35 （ 12 ）： 2891 - 2903 . doi: 10.1109/tpami.2012.162 http://dx.doi.org/10.1109/tpami.2012.162

ANDERSON P ， FERNANDO B ， JOHNSON M ， et al . Guided open vocabulary image captioning with constrained beam search ［C］. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen， Denmark. Stroudsburg， PA， USA ： Association for Computational Linguistics ， 2017 ： 936 - 945 .. doi: 10.18653/v1/d17-1098 http://dx.doi.org/10.18653/v1/d17-1098

REN S Q ， HE K M ， GIRSHICK R ， et al . Faster R-CNN： towards real-time object detection with region proposal networks ［C］. IEEE Transactions on Pattern Analysis and Machine Intelligence . IEEE ， 2017 ： 1137 - 1149 . doi: 10.1109/tpami.2016.2577031 http://dx.doi.org/10.1109/tpami.2016.2577031

PERAZZI F ， PONT-TUSET J ， MCWILLIAMS B ， et al . A benchmark dataset and evaluation methodology for video object segmentation ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2730，2016 ， Las Vegas， NV， USA. IEEE ， 2016 ： 724 - 732 . doi: 10.1109/cvpr.2016.85 http://dx.doi.org/10.1109/cvpr.2016.85

Papineni K . BLEU ： a method for automatic evaluation of MT ［J］. Research Report ， Computer Science RC 22176 （ W0109 - 022 ）， 2001 .

BANERJEE S ， LAVIE A . METEOR： an automatic metric for MT evaluation with improved correlation with human judgments ［EB/OL］. 2005 .

ANDERSON P ， FERNANDO B ， JOHNSON M ， et al . SPICE： semantic propositional image caption evaluation ［C］. Computer Vision-ECCV ， 2016 ： 382 - 398 . doi: 10.1007/978-3-319-46454-1_24 http://dx.doi.org/10.1007/978-3-319-46454-1_24

浏览量

637

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

面向遥感图像道路提取的多尺度上下文感知网络

融合注意力机制的改进型DeepLabv3+语义分割

基于局部摄影的单目视觉输电线路弧垂测量

基于跨层次聚合网络的实时城市街景语义分割