I'm new to geometry and when I was reading some research paper about geometric deep learning, there was a word "pseudo-coordinates". I searched the means of it, but there was few references. Can someone please explain me what it is and how it is related to manifolds? Thank you in advance.
“pseudo-coordinates”
coordinate systemsgeometrymachine learningmanifolds
Related Solutions
Any differentiable manifold is locally homeomorphic to Euclidean space. In other words, if we select a point on the manifold, then over very small distances the manifold can be approximated by Euclidean space. It is then possible to parameterise the manifold with local polar coordinates $(\rho,\theta)$ which behave like polar coordinates in an infinitesimal region around the selected point.
The models GCNN, ACNN and MonNet each use a differentiable manifold parameterised by local polar coordinates. They have a weighting function, called the patch operator weighting function $w_i(\rho,\theta)$. Table $1$ in the paper gives $w_i(\rho,\theta)$ for ACNN and GCNN.
The red curves are $0.5$ level sets. That is to say, $w_i(\rho,\theta)=0.5$ along the red curves.
Edit: The OP asked about the definition of MoNet
In section 4, the paper mentions using a weighting function of the form $w_j({\bf{u}})=\exp\left(-\frac{1}2(\bf{\mu}-\bf{\mu}_j)^T\bf{\Sigma}_j(\bf{\mu}-\bf{\mu}_j)\right)$ with $\bf{\Sigma}_j$ and $\bf{\mu}_j$ learnable (formula 11 in the paper). $\bf{\Sigma}_j$ is restricted to being a diagonal matrix.
The paper then describes the neural network used to learn $\bf{\Sigma}_j$ and $\bf{\mu}_j$ and the procedure used to train it. The Adam method is explained by the following paper: https://arxiv.org/abs/1412.6980
LeNet used 2×2 max-pooling; in ChebNet and MoNet we used three convolutional layers, interleaved with pooling layers based on the Graclus method [16] to coarsen the graph by a factor of four.
For MoNet, we used polar coordinates u = (ρ,θ) of pixels (respectively, of superpixel barycenters) to produce the patch operator; as the weighting functions of the patch operator, 25 Gaussian kernels (initialized with random means and variances) were used. Training was done with 350K iterations of Adam method [25], initial learning rate 10−4, regularization factor 10−4, dropout probability 0.5, and batch size of 10.
That is the definition of the Dirichlet (semi)norm, taken as an analogue of the corresponding concept from complex/harmonic analysis. Essentially, if you recall the graph Laplacian is the analogue of the Laplacian in $\mathbb{R}^n$, then it is reasonable to take as analogue of $$\require{cancel} \lVert f\rVert_{\mathcal{D}}^2= \int \lvert f'\rvert^2\,\mathrm{d}A=\int f\Delta f\,\mathrm{d}A+\cancel{\text{boundary term}} $$ as a possible measure of how much $f$ deviates from constant, so we have the quantity $\operatorname{trace}(X^T\Delta X)$.
Best Answer
Pseudo-coordinates in geometric learning architectures serve two purposes:
They provide local pairwise features among neighbours, i.e. they associate some latent vector to the edges of the graph, rather than just the nodes. They are thus like an adjacency matrix but describing something richer than just connectivity.
They act like a local coordinate system describing a local "patch" on the manifold surface or graph. This tells the network something about directionality on the graph.
Essentially, they give the network easy access to the local geometry or structure of the patches, rather than forcing it to figure it out from e.g. just binary connectivity.
Mathematically, consider a graph-like construct $\mathcal{M}=(\mathcal{V},\mathcal{E},\mathcal{U})$, where $\mathcal{V}$ is the set of nodes (with features $f(v)\in\mathbb{R}^n\;\forall\;v\in\mathcal{V}$), $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ is the set of directed edges, and $\mathcal{U}$ is the pseudo-coordinate function. Let $\mathcal{N}(v)=\{u\in\mathcal{V}\mid (u,v)\in\mathcal{E}\}$ be the set of neighbours of a node $v$. We can think of $\mathcal{U}$ in two equivalent ways: (1) as a function $u(x,y) : \mathcal{V}\times \mathcal{N}(x)\rightarrow\mathbb{R}^d$ that maps a vertex and any of its neighbours to a vector and (2) as a set associating a vector to every directed edge $\mathcal{U}=\{ u(e)\in\mathbb{R}^d \mid e\in\mathcal{E}\}$.
If you prefer to think of a smooth Riemannian manifold $M=(\mathcal{X},g)$, then one example is to consider a local chart $C(p)$ around some $p\in M$ with local coordinates $\alpha_p,\beta_p$ (in the 2D case). One simple pseudo-coordinates would be just $u(p,q)=(\alpha_p(q),\beta_p(q))$. This is the basis of the Geodesic CNN, referenced in the paper you linked. But they can be more general than this (e.g. a transform thereof). (See SplineCNN, for instance, or the graph example in the paper you linked).
How they are used depends on the paper. For example, in most graph (convolutional) neural networks, one wants to compute a weighted average of the features of a point and those of its connected neighbours. But how to compute the weight? If all you know is that the nodes are connected, then the weights are limited in what they can be computed from. But now the weights in the average can depend on the pseudo-coordinates, for instance: $$ F(v)_j = \sum_{\xi\in\mathcal{N}(v)} W(u(\xi,v)|\Theta_j) f_j(\xi) $$ where we are computing the $j$th output (indexing over the channels of the weighting kernel and those of the input feature map) of node $v\in\mathcal{V}$, dependent on learned parameters $\Theta_j$ of weight function $W:\mathbb{R}^d\rightarrow \mathbb{R}$. This pseudo-coordinate-dependent weighted sum is called a patch operator, since it extracts a representation $F(v)$ of a patch about a point $v$. The analogy to this in classical convolutional neural networks is simply the Euclidean image patch around a given point, which is convolved with a kernel to give rise to the new feature map at that point. Thus, given the (pseudo-)patch $F(v)$, the natural thing to do is "convolve" it to a learned graph signal $g$ (analogous to the learned kernel weights of classical CNN's filters): $$ (f\ast g_\ell)(v) = \sum_j g_{\ell j} F(v)_j $$ So that the output features are $ f_\text{out}(v)=((f\ast g_1)(v),\ldots,(f\ast g_K)(v)) $. Again, though, it depends on the paper.
Basically, relating this back to classical CNNs, in the case of Euclidean images, we extract little windows as patches $P$, treating each element of this window equally. The learned kernel $\kappa$ convolved to it easily associates each weight to an input value: $P_{ij}$ gets multiplied to $\kappa_{ij}$, before performing the summation part of the convolution. But on manifolds or graphs, this association is no longer so obvious. For instance, imagine rotating an image: the CNN weights would then not properly apply to the input, because the positions of the template filter would have gone astray. Instead, on manifolds, we create some pseudo-coordinates instead, which attempt to help let the network learn a solution to this directional ambiguity problem, though it does not solve it in general.
References
Monti et al, Geometric deep learning on graphs and manifolds using mixture model CNNs
Fey et al, SplineCNN: Fast Geometric Deep Learning with Continuous B-Spline Kernels