Notice that since $y_{i} = \pm 1$, you can rewrite,
$$
\alpha_i = \frac{1}{y_i} \left[ \frac{1}{y_i} - \frac{1}{N} \sum_{n=1}^N \frac{1}{y_n}\right] = y_i \left[y_i - \frac{1}{N} \sum_{n=1}^N y_n\right] = 1 - y_{i}\frac{N^{+}-N^{-}}{N}
$$
where $N^{+}$ and $N^{-}$ are the number of samples in each of the classes. You can check that $\sum_{n}\alpha_{n}y_{n} = 0$. Also $\alpha_{n} > 0$, that is, all vectors are support vectors.
As for the margin,
$$
||\omega|| = \sum_{n}\alpha^{2} = N\left[1-\left(\frac{N^{+}-N^{-}}{N}\right)^{2}\right]
$$
Yes feature scaling depends on the kernel and in general it's a good idea. The kernel is effectively a distance and if different features vary on different scales then this can matter. For the RBF kernel, for instance, we have
$$
K(x, x') = \exp\left(-\gamma ||x-x'||^2\right)
$$
so if one dimension takes much larger values than others then it will dominate the kernel values and you'll lose some signal in other dimensions. This applies to the linear kernel too.
But this doesn't apply to all kernels, since some have scaling built in. For example, you could do something like the ARD kernel or Mahalanobis kernel with
$$
K(x, x') = \exp\left(-\gamma (x-x')^T\hat \Sigma^{-1}(x-x')\right)
$$
where $\hat \Sigma$ is the sample covariance matrix or maybe just the diagonal matrix of individual variances. As a function of $x$ and $x'$ this is still PD so it's a valid kernel.
As a general strategy for deciding if this is an issue for any particular kernel, just do what they did in the linked question and try it with data like $x=(1000,1,2,3)$, $x'=(500, .5, 3, 2)$ and see if the first dimension necessarily dominates.
Another way to try to assess a given kernel is to try to see if it inherits scale issues from subfunctions. For example, consider the polynomial kernel $K_{poly}(x,x') = (a+cx^Tx')^d$. We can write this as a function of the linear kernel $x^Tx'$, which we already know to be sensitive to scale, and the map $z \mapsto (a+cz)^d$ won't undo scale issues, so we can see that the polynomial kernel inherits these issues. We can do a similar analysis by writing the RBF kernel as a function of the scale-sensitive $||x-x'||^2$.
Best Answer
That actually depends on the particular solver you are using, and the data you work with. It depends on the data, because the resulting number of support vectors you get depends on the problem at hand. As a consequence, adding/removing certain features might simplify your problem, thus taking less time to train and resulting in fewer support vectors.
In any case, this question is answered in detail in the paper Support Vector Machine Solvers, where a complexity of training a SVM is analyzed, and they also describe and compare different implementations.
Hope that helps.
ON REPLY TO YOUR FIRST REPLY: Do you use libsvm?. If so, you could try alpha seeding. LibSVM supports it. More details here and my reply to other question here for a quick intuitive description. Another approach migth be using stochastic gradient descent on SVM