The brain’s versatility is also mirrored by our most advanced machine-learning models. While Transformers were initially developed for natural language processing, mirroring language’s inductive bias, they have since been applied to a wide array of data types, from images and audio to time series data (whether this is primarily a result of the current hype around them can be argued over). Just as the visual cortex in blind people can repurpose itself for auditory or tactile tasks, Transformers can be adapted for a variety of data types, suggesting the existence of universal patterns that these models can capture, irrespective of the specific input modality (in all fairness, this might also be in part because Transformers are well suited to be optimized on our current computational architectures, since they can be heavily parallelized).
Quoting the original NFL paper by Wolpert and Macready: ‘any two optimization algorithms are equal when their performance is averaged across all possible problems’. But paraphrasing Orwell, when we look at most of the problems that really concern us, some optimization algorithms are more equal than others.
For instance, the application of Transformers to image data has resulted in models such as Vision Transformers, which treat an image as a sequence of patches and apply the same self-attention mechanism.
This idea is further exemplified in multi-modal Transformers, which can process and understand multiple types of data simultaneously. Text-and image embeddings such as the CLIP model, are trained to understand images and their related text descriptions concurrently. CLIP learns the relationship by incorporating both modalities into a joint embedding space.
Such models capitalize on the common structures between different data types and effectively create a ‘translation’ system between them. A striking example of this capability of CLIP is familiar to most of us by now: by providing the foundation for diffusion-based models such as DALL-E, they allow us to generate impressive pictures from textual descriptions alone.
That this works so well might also not be altogether a coincidence: fundamental units of meaning in language might have counterparts in visual processing, and visual perception might also be understood in terms of a finite set of basic visual patterns or ‘visual words’.
In language, the units most meaningful to humans, like ‘cat’, ‘dog’, or ‘love’, are short, and language is structured along the same concepts that have salience to us in visual scenes: the internet very likely features more pictures of cats than of noisy blobs of colors. More generally, it’s important to note language is already a derived modality, and one that large groups of people jointly develop to condense all the things that are most important to them into a concise symbolic representation.