Graph Convolutional Networks: Introduction to GNNs


A step-by-step guide using PyTorch Geometric

Image by author

Graph Neural Networks (GNNs) represent one of the most captivating and rapidly evolving architectures within the deep learning landscape. As deep learning models designed to process data structured as graphs, GNNs bring remarkable versatility and powerful learning capabilities.

Among the various types of GNNs, the Graph Convolutional Networks (GCNs) have emerged as the most prevalent and broadly applied model. GCNs are innovative due to their ability to leverage both the features of a node and its locality to make predictions, providing an effective way to handle graph-structured data.

In this article, we will delve into the mechanics of the GCN layer and explain its inner workings. Furthermore, we will explore its practical application for node classification tasks, using PyTorch Geometric as our tool of choice.

PyTorch Geometric is a specialized extension of PyTorch that has been created specifically for the development and implementation of GNNs. It is an advanced, yet user-friendly library that provides a comprehensive suite of tools to facilitate graph-based machine learning. To commence our journey, the PyTorch Geometric installation will be required. If you are using Google Colab, PyTorch should already be in place, so all we need to do is execute a few additional commands.

All the code is available on Google Colab and GitHub.

!pip install torch_geometric
import torch
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

Now that PyTorch Geometric is installed, let’s explore the dataset we will use in this tutorial.

Graphs are an essential structure for representing relationships between objects. You can encounter graph data in a multitude of real-world scenarios, such as social and computer networks, chemical structures of molecules, natural language processing, and image recognition, to name a few.

In this article, we will study the infamous and much-used Zachary’s karate club dataset.

Image by author

The Zachary’s karate club dataset embodies the relationships formed within a karate club as observed by Wayne W. Zachary during the 1970s. It is a kind of social network, where each node represents a club member, and edges between nodes represent interactions that occurred outside the club environment.

In this particular scenario, the members of the club are split into four distinct groups. Our task is to assign the correct group to each member (node classification), based on the pattern of their interactions.

Let’s import the dataset with PyG’s built-in function and try to understand the Datasets object it uses.

from torch_geometric.datasets import KarateClub
# Import dataset from PyTorch Geometric
dataset = KarateClub()
# Print information
print(dataset)
print('------------')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')
KarateClub()
------------
Number of graphs: 1
Number of features: 34
Number of classes: 4

This dataset only has 1 graph, where each node has a feature vector of 34 dimensions and is part of one out of four classes (our four groups). Actually, the Datasets object can be seen as a collection of Data (graph) objects.

We can further inspect our unique graph to know more about it.

# Print first element
print(f'Graph: {dataset[0]}')
Graph: Data(x=[34, 34], edge_index=[2, 156], y=[34], train_mask=[34])

The Data object is particularly interesting. Printing it offers a good summary of the graph we’re studying:

  • x=[34, 34] is the node feature matrix with shape (number of nodes, number of features). In our case, it means that we have 34 nodes (our 34 members), each node being associated to a 34-dim feature vector.
  • edge_index=[2, 156] represents the graph connectivity (how the nodes are connected) with shape (2, number of directed edges).
  • y=[34] is the node ground-truth labels. In this problem, every node is assigned to one class (group), so we have one value for each node.
  • train_mask=[34] is an optional attribute that tells which nodes should be used for training with a list of True or False statements.

Let’s print each of these tensors to understand what they store. Let’s start with the node features.

data = dataset[0]
print(f'x = {data.x.shape}')
print(data.x)
x = torch.Size([34, 34])
tensor([[1., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 1.]])

Here, the node feature matrix x is an identity matrix: it doesn’t contain any relevant information about the nodes. It could contain information like age, skill level, etc. but this is not the case in this dataset. It means we’ll have to classify our nodes just by looking at their connections.

Now, let’s print the edge index.

print(f'edge_index = {data.edge_index.shape}')
print(data.edge_index)
edge_index = torch.Size([2, 156])
tensor([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7,
7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 13, 13, 13,
13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21,
21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27,
27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31,
31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33,
33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33],
[ 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 17, 19, 21, 31, 0, 2,
3, 7, 13, 17, 19, 21, 30, 0, 1, 3, 7, 8, 9, 13, 27, 28, 32, 0,
1, 2, 7, 12, 13, 0, 6, 10, 0, 6, 10, 16, 0, 4, 5, 16, 0, 1,
2, 3, 0, 2, 30, 32, 33, 2, 33, 0, 4, 5, 0, 0, 3, 0, 1, 2,
3, 33, 32, 33, 32, 33, 5, 6, 0, 1, 32, 33, 0, 1, 33, 32, 33, 0,
1, 32, 33, 25, 27, 29, 32, 33, 25, 27, 31, 23, 24, 31, 29, 33, 2, 23,
24, 33, 2, 31, 33, 23, 26, 32, 33, 1, 8, 32, 33, 0, 24, 25, 28, 32,
33, 2, 8, 14, 15, 18, 20, 22, 23, 29, 30, 31, 33, 8, 9, 13, 14, 15,
18, 19, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32]])

In graph theory and network analysis, connectivity between nodes is stored using a variety of data structures. The edge_index is one such data structure, where the graph’s connections are stored in two lists (156 directed edges, which equate to 78 bidirectional edges). The reason for these two lists is that one list stores the source nodes, while the second one identifies the destination nodes.

This method is known as a coordinate list (COO) format, which is essentially a means to efficiently store a sparse matrix. Sparse matrices are data structures that efficiently store matrices with a majority of zero elements. In the COO format, only non-zero elements are stored, saving memory and computational resources.

Contrarily, a more intuitive and straightforward way to represent graph connectivity is through an adjacency matrix A. This is a square matrix where each element Aᵢⱼ specifies the presence or absence of an edge from node i to node j in the graph. In other words, a non-zero element Aᵢⱼ implies a connection from node i to node j, and a zero indicates no direct connection.

Image by author

An adjacency matrix, however, is not as space-efficient as the COO format for sparse matrices or graphs with fewer edges. However, for clarity and easy interpretation, the adjacency matrix remains a popular choice for representing graph connectivity.

The adjacency matrix can be inferred from the edge_index with a utility function to_dense_adj().

from torch_geometric.utils import to_dense_adj
A = to_dense_adj(data.edge_index)[0].numpy().astype(int)
print(f'A = {A.shape}')
print(A)
A = (34, 34)
[[0 1 1 ... 1 0 0]
[1 0 1 ... 0 0 0]
[1 1 0 ... 0 1 0]
...
[1 0 0 ... 0 1 1]
[0 0 1 ... 1 0 1]
[0 0 0 ... 1 1 0]]

With graph data, it is relatively uncommon for nodes to be densely interconnected. As you can see, our adjacency matrix A is sparse (filled with zeros).

In many real-world graphs, most nodes are connected to only a few other nodes, resulting in a large number of zeros in the adjacency matrix. Storing so many zeros is not efficient at all, which is why the COO format is adopted by PyG.

On the contrary, ground-truth labels are easy to understand.

print(f'y = {data.y.shape}')
print(data.y)
y = torch.Size([34])
tensor([1, 1, 1, 1, 3, 3, 3, 1, 0, 1, 3, 1, 1, 1, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0,
2, 2, 0, 0, 2, 0, 0, 2, 0, 0])

Our node ground-truth labels stored in y simply encode the group number (0, 1, 2, 3) for each node, which is why we have 34 values.

Finally, let’s print the train mask.

print(f'train_mask = {data.train_mask.shape}')
print(data.train_mask)
train_mask = torch.Size([34])
tensor([ True, False, False, False, True, False, False, False, True, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, True, False, False, False, False, False,
False, False, False, False])

The train mask shows which nodes are supposed to be used for training with True statements. These nodes represent the training set, while the others can be considered as the test set. This division helps in model evaluation by providing unseen data for testing.

But we’re not done yet! The Data object has a lot more to offer. It provides various utility functions that enable the investigation of several properties of the graph. For instance:

  • is_directed() tells you if the graph is directed. A directed graph signifies that the adjacency matrix is not symmetric, i.e., the direction of edges matters in the connections between nodes.
  • isolated_nodes() checks if some nodes are not connected to the rest of the graph. These nodes are likely to pose challenges in tasks like classification due to their lack of connections.
  • has_self_loops() indicates if at least one node is connected to itself. This is distinct from the concept of loops: a loop implies a path that starts and ends at the same node, traversing other nodes in between.

In the context of the Zachary’s karate club dataset, all these properties return False. This implies that the graph is not directed, does not have any isolated nodes, and none of its nodes are connected to themselves.

print(f'Edges are directed: {data.is_directed()}')
print(f'Graph has isolated nodes: {data.has_isolated_nodes()}')
print(f'Graph has loops: {data.has_self_loops()}')
Edges are directed: False
Graph has isolated nodes: False
Graph has loops: False

Finally, we can convert a graph from PyTorch Geometric to the popular graph library NetworkX using to_networkx. This is particularly useful to visualize a small graph with networkx and matplotlib.

Let’s plot our dataset with a different color for each group.

from torch_geometric.utils import to_networkx
G = to_networkx(data, to_undirected=True)
plt.figure(figsize=(12,12))
plt.axis('off')
nx.draw_networkx(G,
pos=nx.spring_layout(G, seed=0),
with_labels=True,
node_size=800,
node_color=data.y,
cmap="hsv",
vmin=-2,
vmax=3,
width=0.8,
edge_color="grey",
font_size=14
)
plt.show()

This plot of Zachary’s karate club displays our 34 nodes, 78 (bidirectional) edges, and 4 labels with 4 different colors. Now that we’ve seen the essentials of loading and handling a dataset with PyTorch Geometric, we can introduce the Graph Convolutional Network architecture.

This section aims to introduce and build the graph convolutional layer from the ground up.

In traditional neural networks, linear layers apply a linear transformation to the incoming data. This transformation converts input features x into hidden vectors h through the use of a weight matrix 𝐖. Ignoring biases for the time being, this can be expressed as:

With graph data, an additional layer of complexity is added through the connections between nodes. These connections matter because, typically, in networks, it’s assumed that similar nodes are more likely to be linked to each other than dissimilar ones, a phenomenon known as network homophily.

We can enrich our node representation by merging its features with those of its neighbors. This operation is called convolution, or neighborhood aggregation. Let’s represent the neighborhood of node i including itself as Ñ.

Unlike filters in Convolutional Neural Networks (CNNs), our weight matrix 𝐖 is unique and shared among every node. But there is another issue: nodes do not have a fixed number of neighbors like pixels do.

How do we address cases where one node has only one neighbor, and another has 500? If we simply sum the feature vectors, the resulting embedding h would be much larger for the node with 500 neighbors. To ensure a similar range of values for all nodes and comparability between them, we can normalize the result based on the degree of nodes, where degree refers to the number of connections a node has.

We’re almost there! Introduced by Kipf et al. (2016), the graph convolutional layer has one final improvement.

The authors observed that features from nodes with numerous neighbors propagate much more easily than those from more isolated nodes. To offset this effect, they suggested assigning bigger weights to features from nodes with fewer neighbors, thus balancing the influence across all nodes. This operation is written as:

Note that when i and j have the same number of neighbors, it is equivalent to our own layer. Now, let’s see how to implement it in Python with PyTorch Geometric.

PyTorch Geometric provides the GCNConv function, which directly implements the graph convolutional layer.

In this example, we’ll create a basic Graph Convolutional Network with a single GCN layer, a ReLU activation function, and a linear output layer. This output layer will yield four values corresponding to our four categories, with the highest value determining the class of each node.

In the following code block, we define the GCN layer with a 3-dimensional hidden layer.

from torch.nn import Linear
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
def __init__(self):
super().__init__()
self.gcn = GCNConv(dataset.num_features, 3)
self.out = Linear(3, dataset.num_classes)
def forward(self, x, edge_index):
h = self.gcn(x, edge_index).relu()
z = self.out(h)
return h, z
model = GCN()
print(model)
GCN(
(gcn): GCNConv(34, 3)
(out): Linear(in_features=3, out_features=4, bias=True)
)

If we added a second GCN layer, our model would not only aggregate feature vectors from the neighbors of each node, but also from the neighbors of these neighbors.

We can stack several graph layers to aggregate more and more distant values, but there’s a catch: if we add too many layers, the aggregation becomes so intense that all the embeddings end up looking the same. This phenomenon is called over-smoothing and can be a real problem when you have too many layers.

Now that we’ve defined our GNN, let’s write a simple training loop with PyTorch. I chose a regular cross-entropy loss since it’s a multi-class classification task, with Adam as optimizer. In this article, we won’t implement a train/test split to keep things simple and focus on how GNNs learn instead.

The training loop is standard: we try to predict the correct labels, and we compare the GCN’s results to the values stored in data.y. The error is calculated by the cross-entropy loss and backpropagated with Adam to fine-tune our GNN’s weights and biases. Finally, we print metrics every 10 epochs.

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.02)
# Calculate accuracy
def accuracy(pred_y, y):
return (pred_y == y).sum() / len(y)
# Data for animations
embeddings = []
losses = []
accuracies = []
outputs = []
# Training loop
for epoch in range(201):
# Clear gradients
optimizer.zero_grad()
# Forward pass
h, z = model(data.x, data.edge_index)
# Calculate loss function
loss = criterion(z, data.y)
# Calculate accuracy
acc = accuracy(z.argmax(dim=1), data.y)
# Compute gradients
loss.backward()
# Tune parameters
optimizer.step()
# Store data for animations
embeddings.append(h)
losses.append(loss)
accuracies.append(acc)
outputs.append(z.argmax(dim=1))
# Print metrics every 10 epochs
if epoch % 10 == 0:
print(f'Epoch {epoch:>3} | Loss: {loss:.2f} | Acc: {acc*100:.2f}%')
Epoch   0 | Loss: 1.40 | Acc: 41.18%
Epoch 10 | Loss: 1.21 | Acc: 47.06%
Epoch 20 | Loss: 1.02 | Acc: 67.65%
Epoch 30 | Loss: 0.80 | Acc: 73.53%
Epoch 40 | Loss: 0.59 | Acc: 73.53%
Epoch 50 | Loss: 0.39 | Acc: 94.12%
Epoch 60 | Loss: 0.23 | Acc: 97.06%
Epoch 70 | Loss: 0.13 | Acc: 100.00%
Epoch 80 | Loss: 0.07 | Acc: 100.00%
Epoch 90 | Loss: 0.05 | Acc: 100.00%
Epoch 100 | Loss: 0.03 | Acc: 100.00%
Epoch 110 | Loss: 0.02 | Acc: 100.00%
Epoch 120 | Loss: 0.02 | Acc: 100.00%
Epoch 130 | Loss: 0.02 | Acc: 100.00%
Epoch 140 | Loss: 0.01 | Acc: 100.00%
Epoch 150 | Loss: 0.01 | Acc: 100.00%
Epoch 160 | Loss: 0.01 | Acc: 100.00%
Epoch 170 | Loss: 0.01 | Acc: 100.00%
Epoch 180 | Loss: 0.01 | Acc: 100.00%
Epoch 190 | Loss: 0.01 | Acc: 100.00%
Epoch 200 | Loss: 0.01 | Acc: 100.00%

Great! Without much surprise, we reach 100% accuracy on the training set (full dataset). It means that our model learned to correctly assign every member of the karate club to its correct group.

We can produce a neat visualization by animating the graph and see the evolution of the GNN’s predictions during the training process.

%%capture
from IPython.display import HTML
from matplotlib import animation
plt.rcParams["animation.bitrate"] = 3000
def animate(i):
G = to_networkx(data, to_undirected=True)
nx.draw_networkx(G,
pos=nx.spring_layout(G, seed=0),
with_labels=True,
node_size=800,
node_color=outputs[i],
cmap="hsv",
vmin=-2,
vmax=3,
width=0.8,
edge_color="grey",
font_size=14
)
plt.title(f'Epoch {i} | Loss: {losses[i]:.2f} | Acc: {accuracies[i]*100:.2f}%',
fontsize=18, pad=20)
fig = plt.figure(figsize=(12, 12))
plt.axis('off')
anim = animation.FuncAnimation(fig, animate,
np.arange(0, 200, 10), interval=500, repeat=True)
html = HTML(anim.to_html5_video())
display(html)

The first predictions are random, but the GCN perfectly labels every node after a while. Indeed, the final graph is the same as the one we plotted at the end of the first section. But what does the GCN really learn?

By aggregating features from neighboring nodes, the GNN learns a vector representation (or embedding) of every node in the network. In our model, the final layer just learns how to use these representations to produce the best classifications. However, embeddings are the real products of GNNs.

Let’s print the embeddings learned by our model.

# Print embeddings
print(f'Final embeddings = {h.shape}')
print(h)
Final embeddings = torch.Size([34, 3])
tensor([[1.9099e+00, 2.3584e+00, 7.4027e-01],
[2.6203e+00, 2.7997e+00, 0.0000e+00],
[2.2567e+00, 2.2962e+00, 6.4663e-01],
[2.0802e+00, 2.8785e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 2.9694e+00],
[0.0000e+00, 0.0000e+00, 3.3817e+00],
[0.0000e+00, 1.5008e-04, 3.4246e+00],
[1.7593e+00, 2.4292e+00, 2.4551e-01],
[1.9757e+00, 6.1032e-01, 1.8986e+00],
[1.7770e+00, 1.9950e+00, 6.7018e-01],
[0.0000e+00, 1.1683e-04, 2.9738e+00],
[1.8988e+00, 2.0512e+00, 2.6225e-01],
[1.7081e+00, 2.3618e+00, 1.9609e-01],
[1.8303e+00, 2.1591e+00, 3.5906e-01],
[2.0755e+00, 2.7468e-01, 1.9804e+00],
[1.9676e+00, 3.7185e-01, 2.0011e+00],
[0.0000e+00, 0.0000e+00, 3.4787e+00],
[1.6945e+00, 2.0350e+00, 1.9789e-01],
[1.9808e+00, 3.2633e-01, 2.1349e+00],
[1.7846e+00, 1.9585e+00, 4.8021e-01],
[2.0420e+00, 2.7512e-01, 1.9810e+00],
[1.7665e+00, 2.1357e+00, 4.0325e-01],
[1.9870e+00, 3.3886e-01, 2.0421e+00],
[2.0614e+00, 5.1042e-01, 2.4872e+00],
...
[2.1778e+00, 4.4730e-01, 2.0077e+00],
[3.8906e-02, 2.3443e+00, 1.9195e+00],
[3.0748e+00, 0.0000e+00, 3.0789e+00],
[3.4316e+00, 1.9716e-01, 2.5231e+00]], grad_fn=<ReluBackward0>)

As you can see, embeddings do not need to have the same dimensions as feature vectors. Here, I chose to reduce the number of dimensions from 34 (dataset.num_features) to three to get a nice visualization in 3D.

Let’s plot these embeddings before any training happens, at epoch 0.

# Get first embedding at epoch = 0
embed = h.detach().cpu().numpy()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.patch.set_alpha(0)
plt.tick_params(left=False,
bottom=False,
labelleft=False,
labelbottom=False)
ax.scatter(embed[:, 0], embed[:, 1], embed[:, 2],
s=200, c=data.y, cmap="hsv", vmin=-2, vmax=3)
plt.show()

We see every node from Zachary’s karate club with their true labels (and not the model’s predictions). For now, they’re all over the place since the GNN is not trained yet. But if we plot these embeddings at each step of the training loop, we’d be able to visualize what the GNN truly learns.

Let’s see how they evolve over time, as the GCN gets better and better at classifying nodes.

%%capture
def animate(i):
embed = embeddings[i].detach().cpu().numpy()
ax.clear()
ax.scatter(embed[:, 0], embed[:, 1], embed[:, 2],
s=200, c=data.y, cmap="hsv", vmin=-2, vmax=3)
plt.title(f'Epoch {i} | Loss: {losses[i]:.2f} | Acc: {accuracies[i]*100:.2f}%',
fontsize=18, pad=40)
fig = plt.figure(figsize=(12, 12))
plt.axis('off')
ax = fig.add_subplot(projection='3d')
plt.tick_params(left=False,
bottom=False,
labelleft=False,
labelbottom=False)
anim = animation.FuncAnimation(fig, animate,
np.arange(0, 200, 10), interval=800, repeat=True)
html = HTML(anim.to_html5_video())
display(html)

Our Graph Convolutional Network (GCN) has effectively learned embeddings that group similar nodes into distinct clusters. This enables the final linear layer to distinguish them into separate classes with ease.

Embeddings are not unique to GNNs: they can be found everywhere in deep learning. They don’t have to be 3D either: actually, they rarely are. For instance, language models like BERT produce embeddings with 768 or even 1024 dimensions.

Additional dimensions store more information about nodes, text, images, etc. but they also create bigger models that are more difficult to train. This is why keeping low-dimensional embeddings as long as possible is advantageous.

Graph Convolutional Networks are an incredibly versatile architecture that can be applied in many contexts. In this article, we familiarized ourselves with the PyTorch Geometric library and objects like Datasets and Data. Then, we successfully reconstructed a graph convolutional layer from the ground up. Next, we put theory into practice by implementing a GCN, which gave us an understanding of practical aspects and how individual components interact. Finally, we visualized the training process and obtained a clear perspective of what it involves for such a network.

Zachary’s karate club is a simplistic dataset, but it is good enough to understand the most important concepts in graph data and GNNs. Although we only talked about node classification in this article, there are other tasks GNNs can accomplish: link prediction (e.g., to recommend a friend), graph classification (e.g., to label molecules), graph generation (e.g., to create new molecules), and so on.

Beyond GCN, numerous GNN layers and architectures have been proposed by researchers. In the next article, we’ll introduce the Graph Attention Network (GAT) architecture, which dynamically computes the GCN’s normalization factor and the importance of each connection with an attention mechanism.

If you want to know more about graph neural networks, dive deeper into the world of GNNs with my book, Hands-On Graph Neural Networks.



Source link

Leave a Comment