Enhancing Vision Transformer Performance through Data Augmentation and Sharpness-Aware Techniques: A Comparative Study using RMT on CIFAR-10 and ImageNet
Pysyvä osoite
Kuvaus
The fast development of transformer architectures has revolutionized computer vision, making it attention-based mechanism, as they substitute the convolutional operations. Nevertheless, inductive biases of Vision Transformers (ViTs) tend to be weaker and more prone to overfitting, even in data-bound scenarios, despite their performance. The thesis explores whether regularization and data-augmentation algorithms that were initially developed in the pre-transformer era can be applied to transformer- based image classifiers, such as Sharpness-Aware Minimization (SAM), CutMix, and Sharpness-Aware Distilled Teachers (SADT). This study tests on the CIFAR-10 benchmark using Vision Retention Network (VisRetNet) and Retentive Networks Meet Vision Transformers (RMT-T) to systematically test the effect of optimizer-level, data level, and hybrid regularization strategies on model generalization, convergence stability, and computational cost. Experimental findings indicate that out of the methods analyzed, they all enhanced generalization when working in limited-data settings, with SAM + CutMix scoring the highest validation accuracy (89.39 percent) and SADT exhibiting a better training stability. SAM and SADT provided smaller gains on the largest-scale data (ImageNet) with smoother losses, indicating that large-scale data and prior transformer conditions are regularized. The results prove the hypothesis that the performance of the classic regularization methods is more context specific and is useful when working with small datasets but not as effective in the larger scale. The paper is a verified PyTorch implementation of SAM and SADT, a hybrid training paradigm that combines data-based and optimizer-based regularization, and empirical
data connecting classical principles of optimization in CNNs with transformer models of the present day. These insights support the fact that legacy methods are applicable
in the transformer age, providing useful advice regarding how to create efficient, robust, and general vision transformer training pipelines.
