I was reading some research papers on Reinforcement Learning Theory, and I constantly encountered a term called the suboptimality gap. As I searched the internet, I couldn't find any information about this term. So, I wonder whether anyone here knows what this means?
Suboptimality gap in reinforcement learning
machine learningoptimization
Related Solutions
The book Learning from Data by Yaser S. Abu-Mostafa et all gives a nice introductory path for the VC dimension and then in chapter 4 tackles regularization. There is one section on page 137 which talks about the connection between the two concepts. Basically it says that with regularization (augmented error) the VC dimension does not change, so it proposes to use of the effective number of parameters as a good surrogate for the VC dimension.
Pseudo-coordinates in geometric learning architectures serve two purposes:
They provide local pairwise features among neighbours, i.e. they associate some latent vector to the edges of the graph, rather than just the nodes. They are thus like an adjacency matrix but describing something richer than just connectivity.
They act like a local coordinate system describing a local "patch" on the manifold surface or graph. This tells the network something about directionality on the graph.
Essentially, they give the network easy access to the local geometry or structure of the patches, rather than forcing it to figure it out from e.g. just binary connectivity.
Mathematically, consider a graph-like construct $\mathcal{M}=(\mathcal{V},\mathcal{E},\mathcal{U})$, where $\mathcal{V}$ is the set of nodes (with features $f(v)\in\mathbb{R}^n\;\forall\;v\in\mathcal{V}$), $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ is the set of directed edges, and $\mathcal{U}$ is the pseudo-coordinate function. Let $\mathcal{N}(v)=\{u\in\mathcal{V}\mid (u,v)\in\mathcal{E}\}$ be the set of neighbours of a node $v$. We can think of $\mathcal{U}$ in two equivalent ways: (1) as a function $u(x,y) : \mathcal{V}\times \mathcal{N}(x)\rightarrow\mathbb{R}^d$ that maps a vertex and any of its neighbours to a vector and (2) as a set associating a vector to every directed edge $\mathcal{U}=\{ u(e)\in\mathbb{R}^d \mid e\in\mathcal{E}\}$.
If you prefer to think of a smooth Riemannian manifold $M=(\mathcal{X},g)$, then one example is to consider a local chart $C(p)$ around some $p\in M$ with local coordinates $\alpha_p,\beta_p$ (in the 2D case). One simple pseudo-coordinates would be just $u(p,q)=(\alpha_p(q),\beta_p(q))$. This is the basis of the Geodesic CNN, referenced in the paper you linked. But they can be more general than this (e.g. a transform thereof). (See SplineCNN, for instance, or the graph example in the paper you linked).
How they are used depends on the paper. For example, in most graph (convolutional) neural networks, one wants to compute a weighted average of the features of a point and those of its connected neighbours. But how to compute the weight? If all you know is that the nodes are connected, then the weights are limited in what they can be computed from. But now the weights in the average can depend on the pseudo-coordinates, for instance: $$ F(v)_j = \sum_{\xi\in\mathcal{N}(v)} W(u(\xi,v)|\Theta_j) f_j(\xi) $$ where we are computing the $j$th output (indexing over the channels of the weighting kernel and those of the input feature map) of node $v\in\mathcal{V}$, dependent on learned parameters $\Theta_j$ of weight function $W:\mathbb{R}^d\rightarrow \mathbb{R}$. This pseudo-coordinate-dependent weighted sum is called a patch operator, since it extracts a representation $F(v)$ of a patch about a point $v$. The analogy to this in classical convolutional neural networks is simply the Euclidean image patch around a given point, which is convolved with a kernel to give rise to the new feature map at that point. Thus, given the (pseudo-)patch $F(v)$, the natural thing to do is "convolve" it to a learned graph signal $g$ (analogous to the learned kernel weights of classical CNN's filters): $$ (f\ast g_\ell)(v) = \sum_j g_{\ell j} F(v)_j $$ So that the output features are $ f_\text{out}(v)=((f\ast g_1)(v),\ldots,(f\ast g_K)(v)) $. Again, though, it depends on the paper.
Basically, relating this back to classical CNNs, in the case of Euclidean images, we extract little windows as patches $P$, treating each element of this window equally. The learned kernel $\kappa$ convolved to it easily associates each weight to an input value: $P_{ij}$ gets multiplied to $\kappa_{ij}$, before performing the summation part of the convolution. But on manifolds or graphs, this association is no longer so obvious. For instance, imagine rotating an image: the CNN weights would then not properly apply to the input, because the positions of the template filter would have gone astray. Instead, on manifolds, we create some pseudo-coordinates instead, which attempt to help let the network learn a solution to this directional ambiguity problem, though it does not solve it in general.
References
Best Answer
In Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs, suboptimality gap associate with action $a$ at state $x$ is defined to be
$$gap_\infty(x,a)=V^{\pi^*}(x)-Q^{\pi^*}(x,a),$$
It is the difference in the value of a particular action from a particular state as compared to the optimal move.
Similar term has been used in bandit problem as well.