Most of the Recent Advancements in Transformers are Useless😱
Google Research
Google study shows Transformer Modifications Fail To Transfer Across Implementations and Applications.
The researchers began by reimplementing and evaluating a variety of transformer variants on the tasks where they are most commonly applied. As a baseline, they used the original transformer model with two modifications: applying layer normalization before the self-attention and feed-forward blocks instead of after, and using relative attention with shared biases instead of sinusoidal positional embeddings.
👀
Surprise!
Most architecture modifications they looked at do not meaningfully improve performance on downstream NLP tasks - they fail to transfer across implementations and applications. See the table below👇 with results for transfer learning based on T5, and supervised machine translation on the WMT'14 English-German benchmark.
😅 Simple ideas are always the best, and more compute never hurts!
Modifications that were proved to improve performance are either
(1) relatively simple (e.g. a change in activation function), or
(2) rely on increase in parameter count or FLOPs (e.g. the Switch Transformer or Universal Transformer).
And this makes total sense to me.
My take on the reasons for such results is that researchers are often pressured by the urge to publishing new papers every year. This spurscherry-picking of the results, overstated claims, and spurious architectural modifications. The performance increase shown in many papers is just a result of overfitting over a specific benchmark or more accurate hyperparameter selection compared to the previous work. And such phenomenon is not only inherent for transformer and NLP papers but for other subfields of Deep Learning research as well.