In previous research efforts, removing locality in traditional ConvNets has led to performance degradation. This is because most modern vision architectures, including those aimed at simplifying inductive biases, still maintain locality in their design1. Even pre-deep learning visual features like SIFT and HOG used local descriptors. Efforts to remove locality in ConvNets, such as replacing spatial convolutional filters with 1×1 filters, resulted in a decrease in performance. However, recent research by FAIR, Meta AI, and the University of Amsterdam has challenged this conventional belief by introducing the Pixel Transformer (PiT) approach, which removes locality inductive biases and demonstrates better performance than conventional approaches like ViT.
The primary goal of the deep learning revolution in computer vision, as described in the article, is to shift from manually crafted features to data-driven approaches. This aims to create more versatile systems that excel across various vision tasks while reducing feature biases. By eliminating inductive biases, such as locality, researchers hope to further improve model performance and versatility in computer vision.
The impact of self-supervised pre-training on the performance of PiT is significant. In the research, it was found that self-supervised pre-training with MAE (Masked Autoencoder) improves the accuracy of PiT compared to training from scratch. The gap between ViT (Vision Transformer) and PiT, with pre-training, gets larger when moving from Tiny to Small models. This suggests that PiT can potentially scale better than ViT.
In summary, self-supervised pre-training has a positive impact on the performance of PiT, leading to improved accuracy compared to training from scratch.