掩码生成动态调控弱监督视频实例分割

何自芬; 徐林; 张印辉; 黄滢

doi:10.37188/OPE.20233119.2884

您当前的位置：

首页 >

文章列表页 >

掩码生成动态调控弱监督视频实例分割

信息科学 | 更新时间：2023-10-12

- 掩码生成动态调控弱监督视频实例分割
- Mask generation dynamically regulates weakly supervised video instance segmentation
- 光学精密工程 2023年31卷第19期页码：2884-2897
- 作者机构：
  
  昆明理工大学机电工程学院，云南昆明 650000
- 作者简介：
  
  [ "何自芬（1976-），女，山西阳泉人，博士，副教授，硕士生导师，2000年、2005年于西安理工大学分别获得学士和硕士学位，2013年于昆明理工大学获得博士学位，主要从事图像处理和机器视觉等方面的研究。E-mail：zyhhzf1998@163.com" ]
- 基金信息：
  
  国家自然科学基金资助项目(62171206;62061022)
- DOI：10.37188/OPE.20233119.2884
  中图分类号： TP394.1
- 收稿日期：2023-02-06，
  
  修回日期：2023-03-13，
  
  纸质出版日期：2023-10-10
- 稿件说明：
移动端阅览
何自芬,徐林,张印辉等.掩码生成动态调控弱监督视频实例分割[J].光学精密工程,2023,31(19):2884-2897.

HE Zifen,XU Lin,ZHANG Yinhui,et al.Mask generation dynamically regulates weakly supervised video instance segmentation[J].Optics and Precision Engineering,2023,31(19):2884-2897.
何自芬,徐林,张印辉等.掩码生成动态调控弱监督视频实例分割[J].光学精密工程,2023,31(19):2884-2897. DOI： 10.37188/OPE.20233119.2884.

HE Zifen,XU Lin,ZHANG Yinhui,et al.Mask generation dynamically regulates weakly supervised video instance segmentation[J].Optics and Precision Engineering,2023,31(19):2884-2897. DOI： 10.37188/OPE.20233119.2884.

摘要

针对全监督视频实例分割网络训练数据高度依赖精细掩码标注，时间和人工成本过高，导致智能机器无法快速适应新场景的问题，提出一种端到端的掩码生成动态调控弱监督视频实例分割（Weakly Supervised Video Instance Segmentation，WSVIS）网络。为克服初始掩码预测层通道维度突降导致的实例激活特征丢失问题，构建多级特征融合模块，利用特征复用策略预测初始实例特征并融合相对位置信息生成初始预测掩码。然后，提出动态调控机制在通道和空间维度上建立掩码特征依赖关系，强化初始预测掩码与实例感知信息之间的动态交互。最后，网络设计二元颜色相似性生成伪亲和标签取代精细掩码标注，联合边界框与掩码一致性损失实现仅边界框标注的弱监督视频实例分割。实验结果表明，在BoxSet和YT-VIS数据集上，WSVIS网络能达到与全监督网络相近的分割精度和分割效果，同时能够满足实时推理要求，为智能机器快速适应新场景实现实时环境感知和理解提供了理论支撑和算法依据。

Abstract

The training data of fully supervised video instance segmentation networks are highly dependent on accurate mask annotations under high labor and time costs， owing to which intelligent machines are unable to quickly adapt to new scenes. Therefore， a mask generation， dynamically regulated weakly supervised video instance segmentation （WSVIS） network was proposed. First， to overcome the loss of instance activation features caused by the sudden dimension drop of the initial mask prediction layer channel， a multi-level feature fusion module was used to predict the initial instance features through a step-by-step feature reuse strategy and to generate the initial mask by fusing the relative position information. Second， a dynamic regulation mechanism was introduced to establish mask feature dependencies in the channel and spatial dimensions to strengthen the dynamic interaction between the initial predicted mask and instance-aware information. Finally， the network replaces fine mask labeling with the binary color similarity of images， and the bounding box consistency loss and supervised video instance segmentation mask were replaced with bounding box labeling only. Experimental results reveal that on the BoxSet and YT-VIS datasets， the WSVIS network achieves similar segmentation accuracy and segmentation effect as the fully supervised network and can satisfy real-time reasoning， providing theoretical support and an algorithmic basis for intelligent machines to quickly adapt to new scenes to realize real-time environmental perception and understanding.

关键词

Keywords

references

YANG L J ， FAN Y C ， XU N . Video instance segmentation ［C］. 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. October 27 - November 2 ， 2019 ， Seoul， Korea （South）. IEEE ， 2020 ： 5187 - 5196 . doi: 10.1109/iccv.2019.00529 http://dx.doi.org/10.1109/iccv.2019.00529

毛琳，任凤至，杨大伟，等 . 实例特征深度链式学习全景分割网络［J］. 光学精密工程， 2020 ， 28 （ 12 ）： 2665 - 2673 . doi: 10.37188/ope.20202812.2665 http://dx.doi.org/10.37188/ope.20202812.2665

MAO L ， REN F ZH ， YANG D W ， et al . INFNet： Deep instance feature chain learning network for panoptic segmentation ［J］. Opt. Precision Eng. ， 2020 ， 28 （ 12 ）： 2665 - 2673 . （in Chinese） . doi: 10.37188/ope.20202812.2665 http://dx.doi.org/10.37188/ope.20202812.2665

梁新宇，林洗坤，权冀川，等 . 基于深度学习的图像实例分割技术研究进展［J］. 电子学报， 2020 ， 48 （ 12 ）： 2476 - 2486 . doi: 10.3969/j.issn.0372-2112.2020.12.025 http://dx.doi.org/10.3969/j.issn.0372-2112.2020.12.025

LIANG X Y ， LIN X K ， QUAN J CH ， et al . Research on the progress of image instance segmentation based on deep learning ［J］. Acta Electronica Sinica ， 2020 ， 48 （ 12 ）： 2476 - 2486 . （in Chinese） . doi: 10.3969/j.issn.0372-2112.2020.12.025 http://dx.doi.org/10.3969/j.issn.0372-2112.2020.12.025

曹天扬，蔡浩原，方东明，等 . 结合图像内容匹配的机器人视觉导航定位与全局地图构建系统［J］. 光学精密工程， 2017 ， 25 （ 8 ）： 2221 - 2232 . doi: 10.3788/ope.20172508.2221 http://dx.doi.org/10.3788/ope.20172508.2221

CAO T Y ， CAI H Y ， FANG D M ， et al . Robot vision system for keyframe global map establishment and robot localization based on graphic content matching ［J］. Opt. Precision Eng. ， 2017 ， 25 （ 8 ）： 2221 - 2232 . （in Chinese） . doi: 10.3788/ope.20172508.2221 http://dx.doi.org/10.3788/ope.20172508.2221

钱夔，宋爱国 . 一种改进型机器人仿生认知神经网络［J］. 电子学报， 2015 ， 43 （ 6 ）： 1084 - 1089 . doi: 10.3969/j.issn.0372-2112.2015.06.007 http://dx.doi.org/10.3969/j.issn.0372-2112.2015.06.007

QIAN K ， SONG A G . An improved bionic cognitive neural network for robot ［J］. Acta Electronica Sinica ， 2015 ， 43 （ 6 ）： 1084 - 1089 . （in Chinese） . doi: 10.3969/j.issn.0372-2112.2015.06.007 http://dx.doi.org/10.3969/j.issn.0372-2112.2015.06.007

伍锡如，薛其威 . 基于激光雷达的无人驾驶系统三维车辆检测［J］. 光学精密工程， 2022 ， 30 （ 4 ）： 489 - 497 . doi: 10.37188/OPE.20223004.0489 http://dx.doi.org/10.37188/OPE.20223004.0489

WU X R ， XUE Q W . 3D vehicle detection for unmanned driving systerm based on lidar ［J］. Opt. Precision Eng. ， 2022 ， 30 （ 4 ）： 489 - 497 . （in Chinese） . doi: 10.37188/OPE.20223004.0489 http://dx.doi.org/10.37188/OPE.20223004.0489

秦飞巍，沈希乐，彭勇，等 . 无人驾驶中的场景实时语义分割方法［J］. 计算机辅助设计与图形学学报， 2021 ， 33 （ 7 ）： 1026 - 1037 . doi: 10.3724/SP.J.1089.2021.18631 http://dx.doi.org/10.3724/SP.J.1089.2021.18631

QIN F W ， SHEN X Y ， PENG Y ， et al . A real-time semantic segmentation approach for autonomous driving scenes ［J］. Journal of Computer-Aided Design & Computer Graphics ， 2021 ， 33 （ 7 ）： 1026 - 1037 . （in Chinese） . doi: 10.3724/SP.J.1089.2021.18631 http://dx.doi.org/10.3724/SP.J.1089.2021.18631

李淑慧，邓志红，冯肖雪，等 . 强杂波背景下基于变分贝叶斯推理的机载雷达目标跟踪算法［J］. 电子学报， 2022 ， 50 （ 5 ）： 1089 - 1097 . doi: 10.12263/DZXB.20210374 http://dx.doi.org/10.12263/DZXB.20210374

LI SH H ， DENG ZH H ， FENG X X ， et al . Variational Bayesian Inference？ Based airborne radar target tracking algorithm in strong clutter ［J］. Acta Electronica Sinica ， 2022 ， 50 （ 5 ）： 1089 - 1097 . （in Chinese） . doi: 10.12263/DZXB.20210374 http://dx.doi.org/10.12263/DZXB.20210374

王树亮，毕大平，阮怀林，等 . 基于信息熵准则的认知雷达机动目标跟踪算法［J］. 电子学报， 2019 ， 47 （ 6 ）： 1277 - 1284 . doi: 10.3969/j.issn.0372-2112.2019.06.014 http://dx.doi.org/10.3969/j.issn.0372-2112.2019.06.014

WANG SH L ， BI D P ， RUAN H L ， et al . Cognitive radar maneuvering target tracking algorithm based on information entropy criterion ［J］. Acta Electronica Sinica ， 2019 ， 47 （ 6 ）： 1277 - 1284 . （in Chinese） . doi: 10.3969/j.issn.0372-2112.2019.06.014 http://dx.doi.org/10.3969/j.issn.0372-2112.2019.06.014

KHOREVA A ， BENENSON R ， HOSANG J ， et al . Simple does it： weakly supervised instance and semantic segmentation ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2126，2017 ， Honolulu， HI， USA. IEEE ， 2017 ： 1665 - 1674 . doi: 10.1109/cvpr.2017.181 http://dx.doi.org/10.1109/cvpr.2017.181

任冬伟，王旗龙，魏云超，等 . 视觉弱监督学习研究进展［J］. 中国图象图形学报， 2022 ， 27 （ 6 ）： 1768 - 1798 . doi: 10.11834/jig.220178 http://dx.doi.org/10.11834/jig.220178

REN D W ， WANG Q L ， WEI Y CH ， et al . Progress in weakly supervised learning for visual understanding ［J］. Journal of Image and Graphics ， 2022 ， 27 （ 6 ）： 1768 - 1798 . （in Chinese） . doi: 10.11834/jig.220178 http://dx.doi.org/10.11834/jig.220178

LIU Q ， RAMANATHAN V ， MAHAJAN D ， et al . Weakly supervised instance segmentation for videos with temporal mask consistency ［C］. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2025，2021 ， Nashville， TN， USA. IEEE ， 2021 ： 13963 - 13973 . doi: 10.1109/cvpr46437.2021.01375 http://dx.doi.org/10.1109/cvpr46437.2021.01375

AHN J ， CHO S ， KWAK S . Weakly supervised learning of instance segmentation with inter-pixel relations ［C］. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 1520，2019 ， Long Beach， CA， USA. IEEE ， 2020 ： 2204 - 2213 . doi: 10.1109/cvpr.2019.00231 http://dx.doi.org/10.1109/cvpr.2019.00231

IKEDA J ， MORI J . Weakly supervised instance segmentation using motion information via optical flow ［J/OL］. arXiv preprint arXiv： 2202.13006 .

TIAN Z ， SHEN C H ， WANG X L ， et al . BoxInst： high-performance instance segmentation with box annotations ［C］. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2025，2021 ， Nashville， TN， USA. IEEE ， 2021 ： 5439 - 5448 . doi: 10.1109/cvpr46437.2021.00540 http://dx.doi.org/10.1109/cvpr46437.2021.00540

MANINIS K K ， PONT-TUSET J ， ARBELÁEZ P ， et al . Convolutional oriented boundaries： from image segmentation to high-level tasks ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2018 ， 40 （ 4 ）： 819 - 833 . doi: 10.1109/tpami.2017.2700300 http://dx.doi.org/10.1109/tpami.2017.2700300

HSU CC ， HSU KJ ， TSAI CC ， et al . Weakly supervised instance segmentation using the bounding box tightness prior ［J］. Advances in Neural Information Processing Systems ， 2019 ， 32 ： 6586 - 6597 .

HE K M ， GKIOXARI G ， DOLLÁR P ， et al . Mask R-CNN ［C］. 2017 IEEE International Conference on Computer Vision （ICCV）. 2229，2017 ， Venice， Italy. IEEE ， 2017 ： 2980 - 2988 . doi: 10.1109/iccv.2017.322 http://dx.doi.org/10.1109/iccv.2017.322

TIAN Z ， SHEN C H ， CHEN H . Conditional Convolutions for Instance Segmentation ［M］. Computer Vision - ECCV 2020 . Cham ： Springer International Publishing ， 2020 ： 282 - 298 . doi: 10.1007/978-3-030-58452-8_17 http://dx.doi.org/10.1007/978-3-030-58452-8_17

BOLYA D ， ZHOU C ， XIAO F Y ， et al . YOLACT： real-time instance segmentation ［C］. 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. October 27 - November 2 ， 2019 ， Seoul， Korea （South）. IEEE ， 2020 ： 9156 - 9165 . doi: 10.1109/iccv.2019.00925 http://dx.doi.org/10.1109/iccv.2019.00925

CAO J L ， ANWER R M ， CHOLAKKAL H ， et al . SipMask ： Spatial Information Preservation for Fast Image and Video Instance Segmentation ［M］. Computer Vision - ECCV 2020 . Cham ： Springer International Publishing ， 2020 ： 1 - 18 . doi: 10.1007/978-3-030-58568-6_1 http://dx.doi.org/10.1007/978-3-030-58568-6_1

LI M H ， LI S ， LI L D ， et al . Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation ［C］. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2025，2021 ， Nashville， TN， USA. IEEE ， 2021 ： 11210 - 11219 . doi: 10.1109/cvpr46437.2021.01106 http://dx.doi.org/10.1109/cvpr46437.2021.01106

YANG S S ， FANG Y X ， WANG X G ， et al . Crossover learning for fast online video instance segmentation ［C］. 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. 1017，2021 ， Montreal， QC， Canada. IEEE ， 2022 ： 8023 - 8032 . doi: 10.1109/iccv48922.2021.00794 http://dx.doi.org/10.1109/iccv48922.2021.00794

HE K M ， ZHANG X Y ， REN S Q ， et al . Deep residual learning for image recognition ［C］. 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2730，2016 ， Las Vegas， NV， USA. IEEE ， 2016 ： 770 - 778 . doi: 10.1109/cvpr.2016.90 http://dx.doi.org/10.1109/cvpr.2016.90

LIN T Y ， DOLLÁR P ， GIRSHICK R ， et al . Feature pyramid networks for object detection ［C］. 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2126，2017 ， Honolulu， HI， USA. IEEE ， 2017 ： 936 - 944 . doi: 10.1109/cvpr.2017.106 http://dx.doi.org/10.1109/cvpr.2017.106

SUN K ， XIAO B ， LIU D ， et al . Deep high-resolution representation learning for human pose estimation ［C］. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 1520，2019 ， Long Beach， CA， USA. IEEE ， 2020 ： 5686 - 5696 . doi: 10.1109/cvpr.2019.00584 http://dx.doi.org/10.1109/cvpr.2019.00584

YANG B ， BENDER G ， LE Q V ， et al . CondConv ： Conditionally Parameterized Convolutions for Efficient Inference ［EB/OL］. 2019 ： arXiv ： 1904 . 04971 . https：//arxiv.org/abs/1904.04971 https://arxiv.org/abs/1904.04971 .

MILLETARI F ， NAVAB N ， AHMADI S A . V-net： fully convolutional neural networks for volumetric medical image segmentation ［C］. 2016 Fourth International Conference on 3D Vision （3DV）. 2528，2016 ， Stanford， CA， USA. IEEE ， 2016 ： 565 - 571 . doi: 10.1109/3dv.2016.79 http://dx.doi.org/10.1109/3dv.2016.79

YU F ， KOLTUN V . Multi-scale Context Aggregation by Dilated Convolutions ［EB/OL］. 2015 ： arXiv ： 1511 . 07122 . https：//arxiv.org/abs/1511.07122 https://arxiv.org/abs/1511.07122 . doi: 10.48550/arXiv.1511.07122 http://dx.doi.org/10.48550/arXiv.1511.07122

浏览量

473

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据