본문 바로가기
Paper Summary

GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs, UAI’18

by ChanwooKim 2023. 3. 26.

논문 링크 : http://www.auai.org/uai2018/proceedings/papers/139.pdf

1. Introduction

Treating each attention head equally loses the opportunity to benefit from some attention heads which are inherently more important than others. To this end, we propose the Gated Attention Networks(GaAN) for learning on graphs. GaAN uses a small convolutional subnetwork to compute a soft gate at each attention head to control its importance.

Unlike the traditional multi-head attention that admits all attended contents, the gated attention can modulate the amount of attended content via the introduced gates.

Moreover, since only a simple and light-weighted subnetwork is introduced in constructing the gates, the computational overhead is negligible and the model is easy to train.

Furthermore, since our proposed aggregator is very general, we extend it to construct a Graph Gated Recurrent Unit (GGRU), which is directly applicable for spatiotemporal forecasting problem.

In summary, our main contributions include:

(a) a new multi-head attention-based aggregator with additional gates on the attention heads;

(b) a unified framework for transforming graph aggregators to graph recurrent neural networks;

(c) the state-of-the-art prediction performance on three real-world datasets.

2. Notations

We denote a single fully-connected layer with a non-linear activation α(·) as FCθαFC^\alpha_\theta(x) = α(Wx + b), where θ = {W, b} are the parameters. Also, θ with different subscripts mean different transformation parameters.

For activation functions, we denote h(·) to be the LeakyReLU activation (Xu et al., 2015a) with negative slope equals to 0.1 and σ(·) to be the sigmoid activation.

FCθFC_\theta(x) means applying no activation function after the linear transform.

We denote ⊕ as the concatenation operation and k=1Kxk||^K_{k=1}x_k as sequentially concatenating x1x_1 through xKx_K.

We denote the Hadamard product as ‘◦’ and the dot product between two vectors as <ㆍ,ㆍ>.

4. Gated Attention Networks

Generic formulation of graph aggregators

Given a node i and its neighboring nodes Ni\mathcal{N}_i, a graph aggregator is a function γ\gamma in the form of yi=γθ(xi,{zNi})y_i = \gamma_\theta(x_i,\{z_{\mathcal{N}_i}\}), where xix_i and yiy_i are the input and output vectors of the center node i.

zNi={zjjNi}z_{\mathcal{N}_i} = \{z_j| j\in\mathcal{N}_i\} is the set of the reference vectors in the neighboring nodes and Θ is the learnable parameters of the aggregator. In this paper, we do not consider aggregators that use edge features. However, it is straightforward to incorporate edges in our definition by defining zjz_j to contain the edge feature vectors ei,je_{i,j}.

4.1 Multi-Head Attention Aggregator

We linearly project the center node feature x i to get the query vector and project the neighboring node features to get the key and value vectors.

Here, K is the number of attention heads. wi,j(k)w^{(k)}_{i,j} is the k-th attentional weights between the center node i and the neighboring node j, which is generated by applying a softmax to the dot product values.

θxa(k)\theta_{xa}^{(k)}, θza(k)\theta_{za}^{(k)} and θv(k)\theta_{v}^{(k)} are the parameters of the kth head for computing the query, key and value vectors, which have dimensions of dad_a, dad_a and dvd_v respectively.

The K attention outputs are concatenated with the input vector and pass to an output fully-connected layer parameterized by θo\theta_o to get the final output yiy_i , which has dimension dod_o.

The difference between our aggregator and that in GAT (Veličković et al., 2018) is that we have adopted the key-value attention mechanism and the dot product attention while GAT does not compute additional value vectors and uses a fully-connected layer to compute ϕw(k)\phi_w^{(k)}.

4.2 Gated Attention Aggregator

We compute an additional soft gate between 0 (low importance) and 1 (high importance) to assign different importance to each head.

In combination with the multi-head attention aggregator, we get the formulation of the gated attention aggregator:

  • Multi-head attention aggregator는 center node와 그에 해당하는 neighborhood 사이의 여러 표현 부분 공간을 탐색 하지만 이 모두가 똑같이 중요하지는 않음. 그래서 추가적인 soft gate를 사용하여 각 head 마다 다른 importance를 부여함.

where gi(k)g_i^{(k)} is a scalar, the gate value of the k-th head at node i.

To make sure adding gates will not introduce too many additional parameters, we use a convolutional network ψg\psi_g that takes the center node and neighboring node features to generate the gate values.

There are multiple possible designs of the ψg\psi_g network. In this paper, we combine average pooling and max pooling to construct the network. The detailed formula is given below:

Here, θm\theta_m maps the neighbor features to a dmd_m dimensional vector before taking the element-wise max and θg\theta_g maps the concatenated features to the final K gates. By setting a small dmd_m, the subnetwork for computing the gate will have negligible computational overhead.

-The oval in dash line around the neighbors means the interaction among neighbors is utilized when determining the weights.

⇒ 이웃 주변의 점선 타원은 가중치를 결정할 때 이웃 간의 상호 작용을 활용함을 의미합니다.

4.3 Other Graph Aggregators

Graph pooling aggregators

The main characteristic of graph pooling aggregators is that they do not consider the correlation between neighboring nodes and the center node.

Instead, neighboring nodes’ features are directly aggregated and the center node’s feature is simply concatenated or added to the aggregated vector and then passed through an output function ϕo\phi_o:

Graph pairwise sum aggregators

Like attention-based aggregators, graph pairwise sum aggregators also aggregate the neighborhood features by taking K weighted sums.

The difference is that the weight between node i and its neighbor j is not related to the other neighbors in NiN_i. The formula of graph pairwise sum aggregator is given as follows:

Here, wi,j(k)w_{i,j}^{(k)} is only related to the pair xix_i and zjz_j while in attention-based models wi,j(k)w_{i,j}^{(k)} is related to features of all neighbors zNiz_{\mathcal{N}_i}.

Baseline aggregators

To fairly evaluate the effectiveness of GaAN against previous work, we choose two representative aggregators in each category as baselines:

5. Inductive Node Classification

5.1 Model

During training, the validation and testing nodes are not observable and the goal is to predict the labels of the unseen testing nodes.

With a stack of M layers of graph aggregators, we will first sample a mini-batch of nodes B0B_0 and then recursively expand BlB_l to be Bl+1B_{l+1} by sampling the neighboring nodes of BlB_l .

The node representations, which are initialized to be the node features, will be aggregated in reverse order from BMB_{M} to B0B_{0}.

The representations of the last layer, i.e., the final representations of the nodes in B0B_0, are projected to get the output.

We only sample a subset of the neighborhoods for each node.

In our implementation, at the l-th sampling step, we sample min(Ni,Sl)(|\mathcal{N}_i|,S_l) neighbors without replacement for the node i, where SlS_l is a hyperparameter that controls the maximum number of sampled neighbors at the l-th step.

Moreover, to improve over GraphSAGE and further reduce memory cost, we merge repeated nodes that are sampled from different seeds’ neighborhoods within each mini-batch.

5.2 Experimental Setup

5.5 Ablation Analysis

Effect of the number of attention heads and the sample size

K : head number , S_1 and S_2 : maximum number of sampled neighborhoods in the 1st and 2nd sampling steps , all : sample all the neighborhoods

Effect of output dimensions in the PPI dataset

We find that the performance becomes better for larger output dimensions and the proposed GaAN consistently outperforms the other models.

Visualization of gate values

It illustrates the diversity of the learned gate combinations for different nodes.

6. Traffic Speed Forecasting

We formulate traffic speed forecasting as a spatiotemporal sequence forecasting problem where the input and the target are sequences defined on a fixed spatiotemporal graph, e.g., the road network.

To simplify notations, we denote Y=Γ(X,Z,G)\mathsf{Y} = \Gamma(\mathsf{X}, \mathsf{Z}, \mathcal{G}) as applying the γ\gamma aggregator for all nodes in G, i.e., yi=γΘ(x,zNi)\mathsf{y}_i = \gamma_\Theta(\mathsf{x},\mathsf{z}_{\mathcal{N}_i}).

Based on a given graph aggregator Γ, we can construct a GRU-like RNN structure using the following equations:

Predict the future K steps of traffic speeds in the sensor network X^J+1\hat{X}_{J+1}, X^J+2\hat{X}_{J+2},..., X^J+K\hat{X}_{J+K} based on the previous J steps of observed traffic speeds X1X_1 , X2X_2 , ..., XJX_J.

7. Conclusion and Future Work

We introduced the GaAN model and applied it to two challenging tasks:

(1) inductive node classification and traffic speed forecasting. GaAN beats previous state-of-the-art algorithms in both cases. In the future, we plan to extend GaAN by integrating edge features and processing massive graphs with millions or even billions of nodes.

(2) Moreover, our model is not restricted to graph learning. A particularly exciting direction for future work is to apply GaAN to natural language processing tasks like machine translation.

  • 학부생 때 만든 것이라 오류가 있거나, 설명이 부족할 수 있습니다.

Uploaded by N2T