Transformers are a deep learning architecture that has made an outstanding contribution to the advancement of AI. It’s a significant stage within the realm of both AI and technology as a whole, but it’s also a bit complicated. As of today, there are quite a few good resources on Transformers, so why make another one? Two reasons:
- I’m well versed in self-learning and from my experience, being able to read how different people describe the same ideas greatly enhances understanding.
- I very rarely read an article and think it’s explained simply enough. Tech content creators tend to overcomplicate or under-explain concepts all the time. It should be well understood that nothing is rocket science, not even rocket science. You can understand anything, you just need a good enough explanation. In this series, I try to make good enough explanations.
Moreover, As someone who owes his career to articles and open-source code, I see myself as obliged to return the favor.
This series will try to provide a reasonable guide both to people who know almost nothing about AI and to those who know how machines learn. How am I planning to do that? First and foremost — explain. I probably read something close to 1000 technical papers (such as this) in my career, and the main problem I faced is that authors (subconsciously probably) assume you know so many things. In this series, I am planning to assume you know less, than the Transformers articles I read in preparation for this one.
Furthermore, I’ll be combining intuition, math, code, and visualizations so the series is designed like a candy store — something for everyone. Taking into account that this is an advanced concept in quite a complicated field, I’ll take the risk of you thinking: “wow this is slow, stop explaining obvious stuff”, but much less so if you think to yourself: “What the hell is he talking about?”.
Transformers, is it worth your time?
What’s the fuss about? Is it really so important? well, as it is the basis of some of the world’s most advanced AI-driven technological tools (e.g. GPT et al), it probably is.
Although as with many scientific advancements, some of the ideas were previously described, the actual in-depth, full description of the architecture came from the “Attention is all you need” paper which claims the following to be a “simple network architecture”.
If you are like most people, you don’t think this is a simple network architecture. Therefore my job is to make a good effort so that by the time you finish reading this series, you think to yourself: this is still not simple, but I do get it.
So, this crazy diagram, what the heck?
What we are seeing is a Deep Learning architecture, which means that each of those squares should be translated to some piece of code and all that bunch of code together will do something that as of now, people don’t really know how to do otherwise.
Transformers can be applied to many different use cases, but probably the most famous one is an automated chat. A software that can speak about many subjects as if it knew a lot. Resembles the Matrix in a way.
I want to make it easy for people to only read what they actually need so the series will be broken down according to the way I think the Transformer story should be told. The first part is here and it will be about the first part of the architecture— inputs.