Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexi...Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT.展开更多
基金National Natural Science Foundation of China under Grant Nos.61672273 and 61832008Science Foundation for Distinguished Young Scholars of Jiangsu under Grant No.BK20160021+1 种基金Postdoctoral Innovative Talent Support Program of China under Grant Nos.BX20200168,2020M681608General Research Fund of Hong Kong under Grant No.27208720。
文摘Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT.