Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are gr...Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are growing rapidly.Despite being the basis of deep learning,such correlation strongly depends on the distribution of the original data and is susceptible to uncontrolled factors.Without the guidance of prior knowledge,statistical correlations alone cannot correctly reflect the essential causal relations and may even introduce spurious correlations.As a result,researchers are now trying to enhance deep learningbased methods with causal theory.Causal theory can model the intrinsic causal structure unaffected by data bias and effectively avoids spurious correlations.This paper aims to comprehensively review the existing causal methods in typical vision and visionlanguage tasks such as semantic segmentation,object detection,and image captioning.The advantages of causality and the approaches for building causal paradigms will be summarized.Future roadmaps are also proposed,including facilitating the development of causal theory and its application in other complex scenarios and systems.展开更多
Cross-modal interactions between visual understanding and linguistic processing substantially contribute to the remarkable robustness of human language processing.We argue that the formation of cross-modal referential...Cross-modal interactions between visual understanding and linguistic processing substantially contribute to the remarkable robustness of human language processing.We argue that the formation of cross-modal referential links is a prerequisite for the occurrence of cross-modal interactions between vision and language.In this paper we examine a computational model for a cross-modal reference formation with respect to its robustness against conceptual underspecification in the visual modality.This investigation is motivated by the fact that natural systems are well capable of establishing a cross-modal reference between modalities with different degrees of conceptual specification.In the investigated model,conceptually underspecified context information continues to drive the syntactic disambiguation of verb-centered syntactic ambiguities as long as the visual context contains the situation arity information of the visual scene.展开更多
In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a visi...In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.展开更多
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa...We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.展开更多
基金supported by the National Natural Science Foundation of China(Grant Nos.62233005 and 62293502)the Programme of Introducing Talents of Discipline to Universities(the 111 Project,Grant No.B17017)+1 种基金the Fundamental Research Funds for the Central Universities(Grant No.222202317006)Shanghai AI Lab。
文摘Deep learning has revolutionized the field of artificial intelligence.Based on the statistical correlations uncovered by deep learning-based methods,computer vision tasks,such as autonomous driving and robotics,are growing rapidly.Despite being the basis of deep learning,such correlation strongly depends on the distribution of the original data and is susceptible to uncontrolled factors.Without the guidance of prior knowledge,statistical correlations alone cannot correctly reflect the essential causal relations and may even introduce spurious correlations.As a result,researchers are now trying to enhance deep learningbased methods with causal theory.Causal theory can model the intrinsic causal structure unaffected by data bias and effectively avoids spurious correlations.This paper aims to comprehensively review the existing causal methods in typical vision and visionlanguage tasks such as semantic segmentation,object detection,and image captioning.The advantages of causality and the approaches for building causal paradigms will be summarized.Future roadmaps are also proposed,including facilitating the development of causal theory and its application in other complex scenarios and systems.
基金Supported by the German Research Foundation (No. GRK 1247/1)
文摘Cross-modal interactions between visual understanding and linguistic processing substantially contribute to the remarkable robustness of human language processing.We argue that the formation of cross-modal referential links is a prerequisite for the occurrence of cross-modal interactions between vision and language.In this paper we examine a computational model for a cross-modal reference formation with respect to its robustness against conceptual underspecification in the visual modality.This investigation is motivated by the fact that natural systems are well capable of establishing a cross-modal reference between modalities with different degrees of conceptual specification.In the investigated model,conceptually underspecified context information continues to drive the syntactic disambiguation of verb-centered syntactic ambiguities as long as the visual context contains the situation arity information of the visual scene.
基金supported by the National Natural Science Foundation of China (61702528,61806212)。
文摘In the field of satellite imagery, remote sensing image captioning(RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote(DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention(VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a crossmodal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-toend Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.
文摘We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT.