last posts

This Artificial Intelligence (AI) research examines the differences between Transformers and ConvNets using counterfactual simulation tests


For the past decade, convolutional neural networks (CNNs) have been the backbone of computer vision applications. Traditionally, computer vision tasks have been approached using CNNs, which are designed to process data with a grid-like structure, such as an image. CNNs apply a series of filters to the input data, extracting features such as edges, corners, and textures. Subsequent layers then process these features in the network, which combines them to form more complex features and eventually make a prediction.

CNN’s hit saga began around 2012 with the release of AlexNet and its extremely impressive performance in object detection. After that, people put a lot of effort to further improve them and applied them in several areas.

The dominance of CNNs has been challenged recently with the introduction of the Vision Transformer (ViT) framework. ViT showed impressive results in object detection performance, surpassing even state-of-the-art CNNs. However, the competition between CNN and ViT is still ongoing. Depending on the task and the dataset, one outperforms the other, and if we change the test environment, the results change.

How to Monitor Your Machine Learning ML Models (Sponsored)

ViT brings the power of processors to the field of computer vision by treating images as a sequence of patches rather than a grid of pixels. These patches are then processed using the same self-attention mechanisms as in NLP transformers, allowing the model to weigh the importance of different patches based on their relationship to other patches in the image.

One of the main advantages of ViT is that it is much more efficient than CNNs because it does not require the computation of convolutional filters. This makes training easier and allows for larger models, which can improve performance. Another advantage of ViT is that it is much more flexible than CNNs. Since it treats data as a sequence rather than a grid, it can handle data of any size and aspect ratio without requiring additional preprocessing. This contrasts with CNNs, which require the input data to be scaled and padded to fit a fixed-size grid.

Of course, people wanted to understand the real benefits of ViT over CNN, and there have been many studies on this recently. However, there is a common problem in all of these comparisons, more or less. They are trying to compare ViT and CNN using ImageNet accuracy as a metric. However, they do not consider that the compared ConvNets use slightly outdated design and training techniques.

So how can we ensure that we are making a fair comparison between ViT and CNN? We need to make sure that we only compare structural differences. Well, the researchers in this paper have identified how the comparison should be, and they describe it as follows: “We believe it is important to study the differences that arise in the learned representations between Transformers and ConvNets with respect to natural variations such as lighting, occlusions, object scale, pose of the object. object and others.

This is the main idea behind this paper. But how was it possible to create the environment to make this comparison? Two main obstacles prevented this comparison. First, the Transformer and ConvNet architectures were not comparable in terms of overall design techniques and training convolutional layer differences. Second, the paucity of datasets that include fine naturalistic variations in object scale, object pose, scene lighting, and 3D occlusions, among others.

The first problem was solved by comparing ConvNext CNN with a Swin transformer architecture; the only difference between these networks is the use of convolutions and transformers.

The main contribution of this article concerns the resolution of the second problem. They imagine a solution to test architectures counterfactually using simulated images. They constructed a synthetic dataset, named Naturalistic Variation Object Dataset (NVD), which includes different modifications of the scene.

Counterfactual simulation is a method of reasoning about what could have happened in the past or what could happen in the future under different conditions. It involves considering how the outcome of an event or sequence of events might have been different if one or more of the factors that contributed to the outcome had been different. So, in our context, it explores the result of the network if we change the pose of the object, the lighting of the scene, the 3D occlusions, etc. Would the network still predict the correct label for the object?

The results showed that ConvNext was consistently more robust than Swin in handling variations in object poses and camera rotations. Additionally, they also found that ConvNext tended to perform better than Swin in small-scale object recognition. However, when it came to occlusion management, the two architectures were roughly equivalent, with Swin slightly outperforming ConvNext in severe occlusion. On the other hand, both architectures struggled with naturalistic variations in test data. It has been observed that increasing the size of the network or the diversity and quantity of the training data leads to improved robustness.

Check Paper and Project. All credit for this research goes to the researchers of this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.

Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.



Font Size
lines height