Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

这篇paper给出了一个在高分辨率时仍能运行的transformer的backbone,并且能输出dense head.

在stage $i$ , 特征图会变成 $\frac{H_{i-1}W_{i-1}}{P_i^2}$ 个patches

对于高分辨率的图，可以采用如图的方式，spatially reduce key和value的维度，以控制在高分辨率时的维度.

PVTv2: Improved Baselines with Pyramid Vision Transformer

这篇paper 给出三个新的PVT改良:

overlapping patch embedding
Zero-padding and depth-wise convolution aided feed forward network
Linear SRA

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search