Motivating Self-Attention | Ryan Xu

We don’t want to completely replace the value of v_Riley with v_dog, so let’s say that we take a linear combination of v_Riley and v_dog as the new value for v_Riley:

v_Riley = get_value('Riley')
v_dog = get_value('dog')

ratio = .75
v_Riley = (ratio * v_Riley) + ((1-ratio) * v_dog)

This seems to work alright, we’ve embedded a bit of the meaning of the word “dog” into the word “Riley”.

Now we would like to try and apply this form of attention to the whole sentence by updating the vector representations of every single word by the vector representations of every other word.

What goes wrong here?

The core problem is that we don’t know which words should take on the meanings of other words. We would also like some measure of how much the value of each word should contribute to each other word.

Part 2

Alright. So we need to know how much two words should be related.

Time for attempt number 2.

I’ve redesigned our vector database so that each word actually has two associated vectors. The first is the same value vector that we had before, still denoted by v. In addition, we now have unit vectors denoted by k that store some notion of word relations. Specifically, if two k vectors are close together, it means that the values associated with these words are likely to influence each other’s meanings.

With our new k and v vectors, how can we modify our previous scheme to update v_Riley’s value with v_dog in a way that respects how much two words are related?

Let’s continue with the same linear combination business as before, but only if the k vectors of both are close in embedding space. Even better, we can use the dot product of the two k vectors (which range from 0–1 since they are unit vectors) to tell us how much we should update v_Riley with v_dog.

v_Riley, v_dog = get_value('Riley'), get_value('dog')
k_Riley, k_dog = get_key('Riley'), get_key('dog')

relevance = k_Riley · k_dog # dot product

v_Riley = (relevance) * v_Riley + (1 - relevance) * v_dog

This is a little bit strange since if relevance is 1, v_Riley gets completely replaced by v_dog, but let’s ignore that for a minute.

I want to instead think about what happens when we apply this kind of idea to the whole sequence. The word “Riley” will have a relevance value with each other word via dot product of ks. So, maybe we can instead update the value of each word proportionally to the value of the dot product. For simplicity, let’s also include it’s dot product with itself as a way to preserve it’s own value.

sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()

# obtain a list of values
values = get_values(words)

# oh yeah, that's what k stands for by the way
keys = get_keys(words)

# get riley's relevance key
riley_index = words.index('Riley')
riley_key = keys[riley_index]

# generate relevance of "Riley" to each other word
relevances = [riley_key · key for key in keys] #still pretending python has ·

# normalize relevances to sum to 1
relevances /= sum(relevances)

# takes a linear combination of values, weighted by relevances
v_Riley = relevances · values

Ok that’s good enough for now.

But once again, I claim that there’s something wrong with this approach. It’s not that any of our ideas have been implemented incorrectly, but rather there’s something fundamentally different between this approach and how we actually think about relationships between words.

If there’s any point in this article where I really really think that you should stop and think, it’s here. Even those of you who think you fully understand attention. What’s wrong with our approach?

A hint

Relationships between words are inherently asymmetric! The way that “Riley” attends to “dog” is different from the way that “dog” attends to “Riley”. It’s a much bigger deal that “Riley” refers to a dog, not a human, then the name of the dog.

In contrast, the dot product is a symmetric operation, which means that in our current setup, if a attends to b, then b attends equally strong to a! Actually, this is somewhat false because we’re normalizing the relevance scores, but the point is that the words should have the option of attending in an asymmetric way, even if the other tokens are held constant.

Part 3

We’re almost there! Finally, the question becomes:

How can we most naturally extend our current setup to allow for asymmetric relationships?

Well what can we do with one more vector type? We still have our value vectors v, and our relation vector k. Now we have yet another vector q for each token.

How can we modify our setup and use q to achieve the asymmetric relationship that we want?

Those of you who are familiar with how self-attention works will hopefully be smirking at this point.

Instead of computing relevance k_dog · k_Riley when “dog” attends to “Riley”, we can instead query q_Riley against the key k_dog by taking their dot product. When computing the other way around, we will have q_dog · k_Riley instead — asymmetric relevance!

Here’s the whole thing together, computing the update for every value at once!

sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()
seq_len = len(words)

# obtain arrays of queries, keys, and values, each of shape (seq_len, n)
Q = array(get_queries(words))
K = array(get_keys(words))
V = array(get_values(words))

relevances = Q @ K.T
normalized_relevances = relevances / relevances.sum(axis=1)

new_V = normalized_relevances @ V

Source link

Leave a Comment