Solved – Python implementation of indicator function in Softmax gradient

gradient descentmachine learningpythonsoftmax

I hope this is the right place for this question. I am following the Stanford Deep Learning tutorial http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ trying to implement gradient decent with softmax. For the indicator function in the equation below,

\begin{align}
\nabla_{\theta^{(k)}} J(\theta) = – \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = k\} – P(y^{(i)} = k | x^{(i)}; \theta) \right) \right] }
\end{align}

I am thinking of creating a numpy array that will hold the indicator for all the elements of the input X, which I can then implement.

First of all, I'm not sure that creating an array to hold the indicators is the right way to go, but here is my implementation so far:

indicator = [[1 if X[i,j]==y[i] else 0 for j in range(X.shape[1])] for i in range(X.shape[0])]

where X is the input and y is the labels.

This implementation is erroneous, in addition to being quite slow. I wonder if someone could set me in the right direction. Thanks!

Best Answer

The array of indicators for a single sample is just the one-hot representation of its label.

For instance, if there're in total 3 categories, and $x^{(i)}$ has the label $y^{(i)}=2$, then its one-hot representation is $[0,1,0]$.

In terms of code it should be something like

[[1 if y[i]==k else 0 for k in range(category_num)] for i in range(sample_num)]

Related Solutions

Solved – How to perform genetic-algorithm variable selection in R for SVM input variables

My advice would be to not do this. The theoretical advantages of the SVM that avoid over-fitting apply only to the determination of the lagrange multipliers (the parameters of the model). As soon as you start performing feature selection, those advantages are essentially lost, as there is little theory that covers model selection or feature selection, and you are highly likely to over-fit the feature selection criterion, especially if you search really hard using a GA. If feature selection is important, I would use something like LASSO, LARS or Elastic net, where the feature selection arises via reguarisation, where the feature selection is more constrained, so there are fewer effective degrees of freedom, and less over-fitting.

Note a key advantage of the SVM is that is is an approximate implementation of a generalisation bound which is independent of the dimensionality of the feature space, which suggests that feature selection perhaps shouldn't necessarily be expected to improve performance, and if there is a defficiency in the selection prcess (e.g. over-fitting the selection criterion) it may well make things worse!

Solved – Variational method — implementation of function gradient for image denoise

As noted in the other answer, your problem is not using both outputs from gradient. However, here I want to note important issues with the formulation.

First, your discrete problem can be written as a standard linear least squares problem. That is, your objective function is (half of) $$E[u] = \|u-f\|^2 + \lambda(\|\Delta_xu\|^2+\|\Delta_yu\|^2)$$ where now $u$ and $f$ are vectors created by flattening the corresponding image arrays (i.e. u=U(:) and f=F(:)). This is a sum of squares, which is equivalent to the least-squares residual $$ \begin{bmatrix} I \\ \sqrt{\lambda}D \end{bmatrix} u \approx \begin{bmatrix} f \\ 0 \end{bmatrix} \,,\, D = \begin{bmatrix} D_x \\ D_y \end{bmatrix} $$ where $I$ is the identity matrix, $D$ is a discrete gradient operator. The least squares solution of the system $Au=b$ can then be computed as u=A\b, using Matlab's backslash operator.

Second, for this problem the discrete gradient would typically be computed using forward differences (rather than the centered differences used by gradient). This is because the least squares solution of the above problem is $$(I+\lambda{D^T}D)u=f$$ With 1st order differences, the regularization matrix $D^TD$ is then a 2nd order discretization of the Laplacian $\nabla^2u$. The equation above is then a discrete Poisson equation, and moreover is consistent with the continuous (variational) form of the optimization problem.

If all of the above is done with sparse matrices, then the solution is very efficient, and with no iteration required. As a side note, this can all be done in Matlab using gridfit (with the "springs" option).

For completeness, below is an example Matlab implementation.

[ny,nx]=size(F);
Dfnc=@(sz)spdiags(ones(sz(1),1)*[-1,1],0:1,sz(1)-1,sz(1));
Dy=kron(eye(nx),Dfnc(ny,nx)); Dx=kron(Dfnc(nx,ny),eye(ny)); D=[Dx;Dy];
% Fx=diff(F,[],2); Fy=diff(F,[],1); Derr=norm(D*F(:)-[Fx(:);Fy(:)]) % Derr=0
I=speye(ny*nx); A=[I;lam*D]; b=[F(:);zeros(size(D,1),1)];
U=zeros(ny,nx); U(:)=A\b;

Best Answer

Related Solutions

Solved – How to perform genetic-algorithm variable selection in R for SVM input variables

Solved – Variational method — implementation of function gradient for image denoise

Related Question