Representing data in a non-Euclidean space
It is hard understanding how we can represent data in any other way than with vectors in Rn. Plus, how can we move away from the Euclidean distance we know so well to compare two vector representations?
A solution is described by manifolds, in Riemannian geometry. Manifolds are objects that look like Rn but only locally. That means we can locally use vectors for representing our data points. But only locally!
The notion of similarity or distances is key in Machine Learning. If we are building say an NLP model, we want to preserve the notion of similarity in semantics within the embedding space that represents textual input. In other words, we want two words that are similar in meaning to also be similar in the Euclidean space, i.e. with a low Euclidean distance. Similarly, two words that are dissimilar in meaning should be far away in the Euclidean space, i.e. with a high Euclidean distance.
Hence, there needs to be an equivalent approach when escaping Euclidean geometry. This approach is described by a Riemannian metric. The Riemannian metric allows us to compare two entities in the non-Euclidean space and preserve this intuitive notion of distance.
👀 I remember.
Now, we need to remember that in this non-Euclidean framework we can perform operations locally on our data representations and we have a metric to measure distances. Thus, we are equipped to do ML in non-Euclidean spaces.
🙌🏻 Why should I learn more about ML in a non-Euclidean space?
So far, we know that ML without the genius Euclid is actually something. There are actual projects that exist and approach our traditional machine learning problems with a different geometry framework.
Now, a question naturally emerges: is it worth our time learning more than the existence of this field?
It is a rather scary space that involves non-trivial mathematics. But my friend, Aniss Medbouhi, ML Doctoral Researcher at KTH, will help us get past the inherent complexity of this space.
The other reason I wasn’t convinced about this space is that I read that it was mostly fit for hierarchical data that can be described by trees. At a first glance, it doesn’t involve the data I work with on a daily basis.
However, the abstracts below give us an idea of relevant datasets of interest:
“However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space. We present a new concept that exploits these recent insights and propose learning neural embeddings of graphs in hyperbolic space. We provide experimental evidence that embedding graphs in their natural geometry signicantly improves performance on downstream tasks for several real-world public datasets.” Chamberlain et al.
“However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an n-dimensional Poincaré ball.” Nickel and Kiela
The datasets mentioned above are listed as follows, by Chamberlain et al. :
(1) Karate: Zachary’s karate club contains 34 vertices divided into two factions. 
(2) Polbooks: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. Edges between books represent frequent co-purchasing of books by the same buyers.
(3) Football: A network of American football games between Division IA colleges during regular season Fall 2000. 
(4) Adjnoun: Adjacency network of common adjectives and nouns in the novel David Coppereld by Charles Dickens. 
(5) Polblogs: A network of hyperlinks between weblogs on US politics, recorded in 2005. 
Additionally, in biology, we find this reference dataset:
- Biology: Evolutionary data like proteins. 
Finally, NLP data, i.e. textual data, is another type of hierarchical data. As a result, a lot of domains may benefit from understanding the advancements in non-Euclidean Machine Learning.
Now that we know how to better represent certain datasets, it is key to tie it back to Machine Learning. Any downstream ML tasks require ingesting data first. A lot of time is spent in cleaning our underlying data and representing it accurately. The quality of the data representation is essential as it directly impacts the performance of our models. For example, in NLP, I advise my students to focus on architectures that provide good embeddings, e.g. contextual embeddings. There has been extensive research in improving embeddings, moving from shallow neural networks (fasttext, word2vec) to deep neural networks and transformers (sentence-transformers, BERT, RoBERTa, XLM). However, it is also worth noting that data representation is very much linked to the task at hand, and research shows that certain shallow neural networks provide better results than deep neural networks, for certain tasks.
In this article, we saw that we can leverage non-Euclidean geometry to tackle existing problems specific to spherical data and hierarchical datasets like graphs. When embedding such datasets into a Euclidean space, the price to pay is a distortion that does not permit preserving distances from the original space to the embedding space. Such distortion is intuitive in our earth representation where we have many ways to represent our globe, some of which do not preserve core properties expected such as area preserving. Similarly for graphs, core properties need to be preserved and distorting the underlying space may result in poorer performance for downstream Machine Learning tasks.
In the next chapter, we will learn more about spherical and hyperbolic geometries. We will focus more on the latter and give intuition on how models in such space can better embed hierarchical data.