Why does the accuracy suddenly spike down randomly into the mid 90 percents, then shoot back up?
It depends on your parameter update strategy, your parameter update strategy's parameters (e.g., learning rate), your network, and your data set.
E.g. see the spike of Rmsprop even though the network had almost converged:
(Images credit: Alec Radford.)
As long as you have the same spike on your validation set, it shouldn't be an issue. If you wish to further investigate, you may want to look at the gradients, and whether the spikes happen on some particular samples.
How can I tell if I am "overfitting" my data set?
See https://datascience.stackexchange.com/a/627/843.
Since it looks like the accuracy on the test set is pretty much 1, I wouldn't worry about overfitting.
Is my breakdown of training vs. testing data optimal? (i.e. I chose to use 5% of my images to test and 95% of my images to train. Is there a better combination?)
The test set should reflect as best as possible the actual samples on which your system will be used in production.
My understanding is that numpy.linalg.lstsq relies on the LAPACK routine dgelsd.
The problem is to solve:
$$ \text{minimize} (\text{over} \; \mathbf{x}) \quad \| A\mathbf{x} - \mathbf{b} \|_2$$
Of course, this does not have a unique solution for a matrix A whose rank is less than length of vector $\mathbf{b}$. In the case of an undetermined system, dgelsd
provides a solution $\mathbf{z}$ such that:
- $A\mathbf{z} = \mathbf{b}$
- $\| \mathbf{z} \|_2 \leq \|\mathbf{x} \|_2$ for all $\mathbf{x}$ that satisfy $A\mathbf{x} = \mathbf{b}$. (i.e. $\mathbf{z}$ is the minimum norm solution to the undetermined system.
Example, if system is $x + y = 1$, numpy.linalg.lstsq would return $x = .5, y = .5$.
How does dgelsd work?
The routine dgelsd
computes the singular value decomposition (SVD) of A.
I'll just sketch the idea behind using an SVD to solve a linear system. The singular value decomposition is a factorization $U \Sigma V' = A$ where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix where the diagonal entries are known as singular values.
The effective rank of matrix $A$ will be the number of singular values that are effectively non-zero (i.e. sufficiently different from zero relative to machine precision etc...). Let $S$ be a diagonal matrix of the non-zero singular values. The SVD is thus:
$$ A = U \begin{bmatrix} S & 0 \\ 0 & 0 \end{bmatrix} V' $$
The pseudo-inverse of $A$ is given by:
$$ A^\dagger = V \begin{bmatrix} S^{-1} & 0 \\ 0 & 0 \end{bmatrix} U' $$
Consider the solution $\mathbf{x} = A^\dagger \mathbf{b}$. Then:
\begin{align*}
A\mathbf{x} - \mathbf{b} &= U \begin{bmatrix} S & 0 \\ 0 & 0 \end{bmatrix} V' V \begin{bmatrix} S^{-1} & 0 \\ 0 & 0 \end{bmatrix} U' \mathbf{b} - \mathbf{b} \\ &= U \begin{bmatrix} I & 0 \\ 0 & 0 \end{bmatrix} U' \mathbf{b} - \mathbf{b}\\
\end{align*}
There basically two cases here:
- The number of non-zero singular values (i.e. the size of matrix $I$) is less than the length of $\mathbf{b}$. The solution here won't be exact; we'll solve the linear system in the least squares sense.
- $A\mathbf{x} - \mathbf{b} = \mathbf{0}$
This last part is a bit tricky... need to keep track of matrix dimensions and use that $U$ is an orthogonal matrix.
Equivalence of pseudo-inverse
When $A$ has linearly independent rows, (eg. we have a fat matrix), then:
$$A^\dagger = A'\left(AA' \right)^{-1}$$
For an undetermined system, you can show that the pseudo-inverse gives you the minimum norm solution.
When $A$ has linearly independent columns, (eg. we have a skinny matrix), then:
$$A^\dagger = \left(A'A \right)^{-1}A'$$
Best Answer
You're trying to compute the accuracy of a regression problem, i.e. the portion of examples on which the output is floating-point equal to the input. Unsurprisingly, this never happens. Pass
metrics=['mean_squared_error']
instead.