多模态语义交互的文本图像超分辨率重构

韩玉兰; 罗轶宏; 崔玉杰; 兰朝凤

doi:10.37188/OPE.20253301.0135CSTR：32169.14.OPE.20253301.0135

您当前的位置：

首页 >

文章列表页 >

多模态语义交互的文本图像超分辨率重构

信息科学 | 更新时间：2025-03-03

- 多模态语义交互的文本图像超分辨率重构
- Super-resolution reconstruction of text image with multimodal semantic interaction
- 光学精密工程 2025年33卷第1期页码：135-147
- 作者机构：
  
  哈尔滨理工大学测控技术与通信工程学院，黑龙江哈尔滨 150080
- 作者简介：
  
  [ "韩玉兰（1984-），女，黑龙江大庆人，博士，讲师，硕士生导师，主要从事图像重构、计算机视觉和机器学习。E-mail： hanyulan@hrbust.edu.cn" ]
  [ "罗轶宏（1998-），女，黑龙江海林人，硕士研究生，主要从事图像超分辨率重建方面的研究。E-mail： luoy1h@163.com" ]
- 基金信息：
  
  国家自然科学基金资助项目(11804068);黑龙江省自然科学基金资助项目(LH2020F033);黑龙江省省属高等学校基本科研业务资助项目(2020-KYYWF-0342)
- DOI：10.37188/OPE.20253301.0135CSTR：32169.14.OPE.20253301.0135
  中图分类号： TP391
- 收稿日期：2024-07-31，
  
  修回日期：2024-09-13，
  
  纸质出版日期：2025-01-10
- 稿件说明：
移动端阅览
韩玉兰,罗轶宏,崔玉杰等.多模态语义交互的文本图像超分辨率重构[J].光学精密工程,2025,33(01):135-147.

HAN Yulan,LUO Yihong,CUI Yujie,et al.Super-resolution reconstruction of text image with multimodal semantic interaction[J].Optics and Precision Engineering,2025,33(01):135-147.
韩玉兰,罗轶宏,崔玉杰等.多模态语义交互的文本图像超分辨率重构[J].光学精密工程,2025,33(01):135-147. DOI： 10.37188/OPE.20253301.0135CSTR：32169.14.OPE.20253301.0135.

HAN Yulan,LUO Yihong,CUI Yujie,et al.Super-resolution reconstruction of text image with multimodal semantic interaction[J].Optics and Precision Engineering,2025,33(01):135-147. DOI： 10.37188/OPE.20253301.0135CSTR：32169.14.OPE.20253301.0135.

摘要

针对现有方法在文本图像特征表示缺乏尺度变换，分辨率不足导致识别器难以提取到正确的文本内容信息指导重构网络的问题，提出多模态语义交互的文本图像超分辨率重构方法。利用语义推理模块中的注意力掩码对文本内容信息进行校正，获得语义先验信息，约束并指导网络重构语义正确的文本超分辨率重构图像。为增强网络的表征能力，适应不同形状和长度的文本图像，设计了多模态语义交互块，其基本单元由视觉双流集成块、跨模态自适应融合块和正交双向门控循环单元组成。视觉双流集成块利用全局统计特性和局部拟合能力互补优势，获得包含上下文理解的多粒度视觉信息，跨模态自适应融合块动态执行语义信息与多粒度视觉特征之间的交互协作，缩小模态间的特征差异；最后，正交双向门控循环单元建立多模态特征在垂直和水平方向上的文本依赖。实验结果表明，在TextZoom测试集上，本文提出的方法在PSNR和SSIM定量指标上相比于其他主流方法均有所提升，并且在ASTER，MORAN，CRNN 3种识别器的平均识别精度相比TPGSR模型分别提高了2.9%，3.6%和3.7%。由此表明，采用多模态语义交互方法的文本图像超分辨率重构，可以有效提高文本识别精度。

Abstract

The accurate extraction of text content from images is hindered by the absence of scale transformation in feature representation and insufficient resolution， which misguides the reconstruction network. To address this challenge， this paper proposes a novel multi-modal semantic interactive text image super-resolution reconstruction method. By incorporating an attention mask within the semantic inference module， the method corrects text content information and employs semantic prior knowledge to constrain and guide the reconstruction of semantically accurate super-resolution text images. To enhance the network's representational capacity and accommodate text images of varying shapes and lengths， a multimodal semantic interaction block is introduced. This block consists of three key components： a visual dual-flow integration module， a cross-modal adaptive fusion module， and an orthogonal bidirectional gated recurrent unit. First， the visual dual-flow integration module captures multi-granularity visual information， including contextual understanding， by leveraging the complementary strengths of global statistical features and robust local approximations. Next， the cross-modal adaptive fusion module dynamically facilitates interaction and alignment between semantic information and multi-granularity visual features， effectively reducing cross-modal feature discrepancies. Finally， the orthogonal bidirectional gated recurrent unit establishes multimodal feature dependencies in both vertical and horizontal orientations. Experimental results on the TextZoom test set demonstrate that the proposed method outperforms state-of-the-art approaches in terms of quantitative metrics， achieving significant improvements in PSNR and SSIM. Compared to the TPGSR model， the proposed method increases the average recognition accuracy of ASTER， MORAN， and CRNN by 2.9%， 3.6%， and 3.7%， respectively. These findings highlight the effectiveness of multimodal semantic interaction in enhancing text image super-resolution and improving text recognition accuracy.

关键词

Keywords

references

GUAN T K ， SHEN W ， YANG X ， et al . Self-supervised character-to-character distillation for text recognition ［C］. 2023 IEEE/CVF International Conference on Computer Vision （ICCV）. October 1 - 6 ， 2023 . Paris， France. IEEE ， 2023 ： 19473 - 19484 .

LI M H ， LV T C ， CHEN J Y ， et al . TrOCR： transformer-based optical character recognition with pre-trained models ［J］. Proceedings of the AAAI Conference on Artificial Intelligence ， 2023 ， 37 （ 11 ）： 13094 - 13102 . doi: 10.1609/aaai.v37i11.26538 http://dx.doi.org/10.1609/aaai.v37i11.26538

DONG C ， LOY C C ， HE K M ， et al . Learning a Deep Convolutional Network for Image Super - resolution ［M］. Computer Vision-ECCV 2014. Cham ： Springer International Publishing ， 2014 ： 184 - 199 . doi: 10.1007/978-3-319-10593-2_13 http://dx.doi.org/10.1007/978-3-319-10593-2_13

NIU B ， WEN W L ， REN W Q ， et al . Single image Super - resolution Via a Holistic Attention Network ［M］. Computer Vision-ECCV 2020. Cham ： Springer International Publishing ， 2020 ： 191 - 207 . doi: 10.1007/978-3-030-58610-2_12 http://dx.doi.org/10.1007/978-3-030-58610-2_12

寇旗旗，李超，程德强，等 . 基于注意力和宽激活密集残差网络的图像超分辨率重建［J］. 光学精密工程， 2023 ， 31 （ 15 ）： 2273 - 2286 . doi: 10.37188/ope.20233115.2273 http://dx.doi.org/10.37188/ope.20233115.2273

KOU Q Q ， LI CH ， CHENG D Q ， et al . Image super-resolution reconstruction based on attention and wide-activated dense residual network ［J］. Opt. Precision Eng. ， 2023 ， 31 （ 15 ）： 2273 - 2286 . （in Chinese） . doi: 10.37188/ope.20233115.2273 http://dx.doi.org/10.37188/ope.20233115.2273

周颖，裴盛虎，陈海永，等 . 基于多尺度自适应注意力的图像超分辨率网络［J］. 光学精密工程， 2024 ， 32 （ 6 ）： 843 - 856 . doi: 10.37188/ope.20243206.0843 http://dx.doi.org/10.37188/ope.20243206.0843

ZHOU Y ， PEI SH H ， CHEN H Y ， et al . Image super-resolution network based on multi-scale adaptive attention ［J］. Opt. Precision Eng. ， 2024 ， 32 （ 6 ）： 843 - 856 . （in Chinese） . doi: 10.37188/ope.20243206.0843 http://dx.doi.org/10.37188/ope.20243206.0843

夏振平，陈豪，张宇宁，等 . 基于混合时空卷积的轻量级视频超分辨率重建［J］. 光学精密工程， 2024 ， 32 （ 16 ）： 2564 - 2576 .

XIA ZH P ， CHEN H ， ZHANG Y N ， et al . Lightweight video super-resolution based on hybrid spatio-temporal convolution ［J］. Opt. Precision Eng. ， 2024 ， 32 （ 16 ）： 2564 - 2576 . （in Chinese）

ZHU S P ， ZHAO Z Y ， FANG P F ， et al . Improving scene text image super-resolution via dual prior modulation network ［C］. Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence . ACM ， 2023 ： 3843 - 3851 . doi: 10.1609/aaai.v37i3.25497 http://dx.doi.org/10.1609/aaai.v37i3.25497

CHEN X Y ， WANG X T ， ZHOU J T ， et al . Activating more pixels in image super-resolution transformer ［C］. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. June 17 - 24 ， 2023 . Vancouver， BC， Canada. IEEE ， 2023 ： 22367 - 22377 .

WANG W J ， XIE E Z ， SUN P Z ， et al . TextSR： content-aware text super-resolution guided by recognition ［EB/OL］. 2019： 1909 . 07113 . https：//arxiv.org/abs/1909.07113v4 https://arxiv.org/abs/1909.07113v4 .

WANG Y Y ， SU F ， QIAN Y . Text-attentional conditional generative adversarial network for super-resolution of text images ［C］. 2019 IEEE International Conference on Multimedia and Expo （ICME）. July 8 - 12 ， 2019 . Shanghai， China. IEEE ， 2019 ： 1024 - 1029 .

MOU Y Q ， TAN L ， YANG H ， et al . PlugNet ： Degradation Aware Scene Text Recognition Supervised by a Pluggable Super - resolution Unit ［M］. Computer Vision-ECCV 2020. Cham ： Springer International Publishing ， 2020 ： 158 - 174 . doi: 10.1007/978-3-030-58555-6_10 http://dx.doi.org/10.1007/978-3-030-58555-6_10

WANG W J ， XIE E Z ， LIU X B ， et al . Scene text image super-resolution in the wild ［C］. Computer Vision-ECCV 2020. Cham ： Springer International Publishing ， 2020 ： 650 - 666 . doi: 10.1007/978-3-030-58607-2_38 http://dx.doi.org/10.1007/978-3-030-58607-2_38

MA J Q ， GUO S ， ZHANG L . Text prior guided scene text image super-resolution ［J］. IEEE Transactions on Image Processing ， 2023 ， PP. DOI： 10.1109/TIP.2023.3237002 http://dx.doi.org/10.1109/TIP.2023.3237002 .

MA J Q ， LIANG Z T ， ZHANG L . A text attention network for spatial deformation robust scene text image super-resolution ［C］. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. June 18 - 24 ， 2022 . New Orleans， LA， USA. IEEE ， 2022 ： 5911 - 5920 .

YANG H ， ZHOU H B . Degradation prior guided scene text image super-resolution ［C］. 2022 4th International Conference on Frontiers Technology of Information and Computer （ICFTIC）. December 2 - 4 ， 2022 . Qingdao， China. IEEE ， 2022 ： 170 - 175 .

MA J Z ， JIN L W ， ZHANG J X ， et al . TextSRNet： scene text super-resolution based on contour prior and atrous convolution ［C］. 2022 26th International Conference on Pattern Recognition （ICPR）. August 21 - 25 ， 2022 . Montreal， QC， Canada. IEEE ， 2022 ： 3252 - 3258 .

FU X Y ， CH'NG E ， AICKELIN U ， et al . CRNN： a joint neural network for redundancy detection ［C］. 2017 IEEE International Conference on Smart Computing （SMARTCOMP）. May 29 - 31 ， 2017 . Hong Kong， China. IEEE ， 2017 ： 1 - 8 .

FANG S C ， XIE H T ， WANG Y X ， et al . Read like humans： autonomous， bidirectional and iterative language modeling for scene text recognition ［C］. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. June 20 - 25 ， 2021 . Nashville， TN， USA. IEEE ， 2021 ： 7098 - 7107 .

LIU Z ， LIN Y T ， CAO Y ， et al . Swin transformer： hierarchical vision transformer using shifted windows ［C］. 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. October 10 - 17 ， 2021 . Montreal， QC， Canada. IEEE ， 2021 ： 10012 - 10022 .

LI J F ， WEN Y ， HE L H . SCConv： spatial and channel reconstruction convolution for feature redundancy ［C］. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. June 17 - 24 ， 2023 . Vancouver， BC， Canada. IEEE ， 2023 ： 6153 - 6162 .

SHI B G ， YANG M K ， WANG X G ， et al . ASTER： an attentional scene text recognizer with flexible rectification ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence ， 2019 ， 41 （ 9 ）： 2035 - 2048 . doi: 10.1109/tpami.2018.2848939 http://dx.doi.org/10.1109/tpami.2018.2848939

LUO C J ， JIN L W ， SUN Z H . MORAN： a multi-object rectified attention network for scene text recognition ［J］. Pattern Recognition ， 2019 ， 90 ： 109 - 118 . doi: 10.1016/j.patcog.2019.01.020 http://dx.doi.org/10.1016/j.patcog.2019.01.020

ZHAO C R ， FENG S Y ， ZHAO B N ， et al . Scene text image super-resolution via parallelly contextual attention network ［C］. Proceedings of the 29th ACM International Conference on Multimedia . 20-24，2021 ， Virtual Event ， China . ACM ， 2021 ： 2908 - 2917 . doi: 10.1145/3474085.3475469 http://dx.doi.org/10.1145/3474085.3475469

CHEN J Y ， LI B ， XUE X Y . Scene text telescope： text-focused scene image super-resolution ［C］. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. June 20 - 25 ， 2021 . Nashville， TN， USA. IEEE ， 2021 ： 12026 - 12035 .

CHEN J Y ， YU H Y ， MA J Q ， et al . Text gestalt： stroke-aware scene text image super-resolution ［J］. Proceedings of the AAAI Conference on Artificial Intelligence ， 2022 ， 36 （ 1 ）： 285 - 293 . doi: 10.1609/aaai.v36i1.19904 http://dx.doi.org/10.1609/aaai.v36i1.19904

HONDA K ， KUREMATSU M ， FUJITA H ， et al . Multi-task learning for scene text image super-resolution with multiple transformers ［J］. Electronics ， 2022 ， 11 （ 22 ）： 3813 . doi: 10.3390/electronics11223813 http://dx.doi.org/10.3390/electronics11223813

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

多模态跨级特征知识转移下音频目标检测网络

基于多尺度自适应注意力的图像超分辨率网络

基于注意力和宽激活密集残差网络的图像超分辨率重建

融合边缘增强与非局部模块的遥感图像超分辨率重建生成对抗网络