Xiao-yu WU, Chao-nan GU, Sheng-jin WANG. Special video classification based on multitask learning and multimodal feature fusion[J]. Optics and precision engineering, 2020, 28(5): 1177-1186.
DOI:
Xiao-yu WU, Chao-nan GU, Sheng-jin WANG. Special video classification based on multitask learning and multimodal feature fusion[J]. Optics and precision engineering, 2020, 28(5): 1177-1186. DOI: 10.3788/OPE.20202805.1177.
Special video classification based on multitask learning and multimodal feature fusion
Classification of special videos is significant for intelligent surveillance of internet content. Existing algorithms that fuse multimodal features forclassification of special videoscannot measure multimodal audio-visual semantic correspondence.An algorithm for recognizing special videos based on multimodal audio-visual feature fusion was proposed herein over the framework of multitask learning. First
audio semantic features and spatial-temporal visual semantic cues
including appearance and motion
were extracted. A latent subspace to fuse audio and visual features whilst preserving their semantic information was learned and developed through jointly learning audio-visual semantic correspondence and special video classification. Subsequently
a multitask learning loss function was presented viacombination of the correspondence loss
obtained based on the measured audio-visual semantic information
and the cross-entropy loss of special video classification. Finally
an end-to-end intelligent system for special video recognition was implemented. Experimental results demonstrate that the accuracy of the proposed algorithm is 97.97% with respect to the Violent Flow dataset
and the average accuracy is 39.76% with respect to the Media Eval VSD 2015 dataset
where by the algorithm outperforms the other existing methods. These results show that the proposed algorithm is effective for improving the intelligence of network content surveillance.
X CH MA , SH K WEI , X JIANG , 等 . Early warning of illegal video chats based on camera source identification . Opt. Precision Eng. , 2018 . 26 ( 11 ): 2785 - 2794 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201811020 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201811020 .
H D CLAIRE , P CEDRIC , S MOHAMMAD , 等 . VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation . Multimedia Tools and Applications , 2014 . 74 ( 17 ): 7379 - 7404 . http://cn.bing.com/academic/profile?id=257232beac94c67b5f15c46adb8eb79a&encoded=0&v=paper_preview&mkt=zh-cn http://cn.bing.com/academic/profile?id=257232beac94c67b5f15c46adb8eb79a&encoded=0&v=paper_preview&mkt=zh-cn .
D MOREIRA , S AVILA , M PEREZ , 等 . Multimodal data fusion for sensitive scene localization . Information Fusion , 2019 . ( 45 ): 307 - 323 . http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=655af91660d23b119d58299cb1d77991 http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=655af91660d23b119d58299cb1d77991 .
WANG H M, YANG L, WU X Y, et al . A review of bloody violence in video classification [C]. International Conference on the Frontiers & Advances in Data Science , 2017: 86-91.
YI Y, WANG H, ZHANG B, et al . MIC-TJU at affective impact of movies task[C]. MediaEval Workshop , 2015, 7.
LAM, LE S P, DO T, et al . Computational optimization for violent scenes detection[C]. International Conference on Computer, Control, Informatics and its Applications , 2016: 141-146.
DAI Q, ZHAO R, WU Z, et al . Fudan-Huawei at mediaeval 2015: Detecting violent scenes and affective impact in movies with deep learning[C]. MediaEval Workshop , 2015, 5.
SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]. NeurIPS , 2014: 568-576.
SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]. NeurIPS , 2014: 3104-3112.
SWATHIKIRAN S, OSWALD L. Learning to detect violent videos using convolutional long short-term memory[C]. IEEE International Conference on Advanced Video and Signal Based Surveillance , 2017: 1-6.
T BALTRUSAITIS , C AHUJA , L P MORENCY . Multimodal machine learning: a survey and taxonomy . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019 . 41 ( 2 ): 423 - 443 . DOI: 10.1109/TPAMI.2018.2798607 http://doi.org/10.1109/TPAMI.2018.2798607 .
X CUI , Z J PENG , F CHEN . Joint Multi-feature fast coding for future video coding . Opt. Precision Eng. , 2019 . 27 ( 4 ): 990 - 999 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904026 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904026 .
WU Z, JIANG Y G, WANG X, et al . Multi-Stream multi-class fusion of deep networks for video classification[C]. ACM International Conference on Multimedia , 2016: 791-800.
X ZH PAN , SH Q ZHANG , W P GUO . Video-based facial expression recognition using multimodal deep convolutional neural networks . Opt. Precision Eng. , 2019 . 27 ( 4 ): 963 - 970 . http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904023 http://d.old.wanfangdata.com.cn/Periodical/gxjmgc201904023 .
P K ATREY , M A HOSSAIN , A E SADDIK , 等 . Multimodal fusion for multimedia analysis: a survey . Multimedia Systems , 2010 . 16 ( 6 ): 345 - 379 .
QIU Z, YAO T, TAO M. Learning spatial-temporal representation with pseudo-3d residual networks[C]. IEEE International Conference on Computer Vision , 2017: 5534-5542.
CARREIRA J, ZISSERMAN A. Quo vadis, action recognition A new model and the kinetics dataset[C]. IEEE Conference on Computer Vision and Pattern Recognition , 2017: 6299-6308.
HERSHEY S, CHAUDHURI S, ELLIS D P W, et al . CNN architectures for large-scale audio classification[C]. International Conference on Acoustics, Speech and Signal Processing , 2017: 131-135.
WU Z, JIANG Y G, WANG J, et al . Exploring inter-feature and inter-class relationships with deep neural networks for video classification[C]. ACM International Conference on Multimedia , 2014: 167-176.
HASSNER T, ITCHER Y, KLIPER C O. Violent flows: Real-time detection of violent crowd behavior[C]. IEEE Conference on Computer Vision and Pattern Recognition , 2012: 1-6
BILINSKI P, BREMOND F. Human violence recognition and detection in surveillance videos[C]. IEEE International Conference on Advanced Video and Signal Based Surveillance , 2016: 30-36.
T ZHANG , W JIA , X HE , 等 . Discriminative dictionary learning with motion weber local descriptor for violence detection . IEEE Transactions on Circuits and Systems for Video Technology , 2017 . 27 ( 3 ): 696 - 709 . DOI: 10.1109/TCSVT.2016.2589858 http://doi.org/10.1109/TCSVT.2016.2589858 .
MATS S, YOANN B, HANLI W, et al . The mediaeval 2015 affective impact of movies task[C]. MediaEval Workshop , 2015: 1.
A ESRA , H FRANK , A SAHIN . Breaking down violence detection: combining divide-et-impera and coarse-to-fine strategies . Neurocomputing , 2016 . 208 225 - 237 . DOI: 10.1016/j.neucom.2016.05.050 http://doi.org/10.1016/j.neucom.2016.05.050 .