Applied Predictive Modeling, by Max Kuhn (who wrote the caret package) and Kjell Johnson, is very practical with tons of code and real world datasets.
Practical Data Science with R by Nina Zumel and John Mount is an excellent new book that broadly covers data science, including ML, at an introductory-to-intermediate level.
An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani is a lucid introductory book written by many of the same authors as the more advanced Elements of Statistical Learning. Hastie and Tibshirani should be familiar names as they have also made major contributions to machine learning and R (packages including glmnet, gam, hierNet, etc). Both ISL and ESL are available as PDFs online but I personally think they're also worth getting in hardcopy.
For improving understanding of R as a programming language, I like The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff. I also strongly recommend Hadley Wickham's in-progress book Advanced R.
For people who are beginning to learn statistics, I often recommend Introductory Statistics with R by Peter Dalgaard. For more advanced statistics, I have benefited from Cosma Shalizi's online draft textbook Advanced Data Analysis from an Elementary Point of View.
(I have yet to read and can't comment one way or another on Data Mining with R by Luis Torgo or Machine Learning with R by Brett Lantz.)
Update: Thanks to this discussion, scikit-learn
was updated and works correctly now. Its LDA source code can be found here. The original issue was due to a minor bug (see this github discussion) and my answer was actually not pointing at it correctly (apologies for any confusion caused). As all of that does not matter anymore (bug is fixed), I edited my answer to focus on how LDA can be solved via SVD, which is the default algorithm in scikit-learn
.
After defining within- and between-class scatter matrices $\boldsymbol \Sigma_W$ and $\boldsymbol \Sigma_B$, the standard LDA calculation, as pointed out in your question, is to take eigenvectors of $\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B$ as discriminant axes (see e.g. here). The same axes, however, can be computed in a slightly different way, exploiting a whitening matrix:
Compute $\boldsymbol \Sigma_W^{-1/2}$. This is a whitening transformation with respect to the pooled within-class covariance (see my linked answer for details).
Note that if you have eigen-decomposition $\boldsymbol \Sigma_W = \mathbf{U}\mathbf{S}\mathbf{U}^\top$, then $\boldsymbol \Sigma_W^{-1/2}=\mathbf{U}\mathbf{S}^{-1/2}\mathbf{U}^\top$. Note also that one compute the same by doing SVD of pooled within-class data: $\mathbf{X}_W = \mathbf{U} \mathbf{L} \mathbf{V}^\top \Rightarrow \boldsymbol\Sigma_W^{-1/2}=\mathbf{U}\mathbf{L}^{-1}\mathbf{U}^\top$.
Find eigenvectors of $\boldsymbol \Sigma_W^{-1/2} \boldsymbol \Sigma_B \boldsymbol \Sigma_W^{-1/2}$, let us call them $\mathbf{A}^*$.
Again, note that one can compute it by doing SVD of between-class data $\mathbf{X}_B$, transformed with $\boldsymbol \Sigma_W^{-1/2}$, i.e. between-class data whitened with respect to the within-class covariance.
The discriminant axes $\mathbf A$ will be given by $\boldsymbol \Sigma_W^{-1/2} \mathbf{A}^*$, i.e. by the principal axes of transformed data, transformed again.
Indeed, if $\mathbf a^*$ is an eigenvector of the above matrix, then $$\boldsymbol \Sigma_W^{-1/2} \boldsymbol \Sigma_B \boldsymbol \Sigma_W^{-1/2}\mathbf a^* = \lambda \mathbf a^*,$$ and multiplying from the left by $\boldsymbol \Sigma_W^{-1/2}$ and defining $\mathbf a = \boldsymbol \Sigma_W^{-1/2}\mathbf a^*$, we immediately obtain: $$\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B \mathbf a = \lambda \mathbf a.$$
In summary, LDA is equivalent to whitening the matrix of class means with respect to within-class covariance, doing PCA on the class means, and back-transforming the resulting principal axes into the original (unwhitened) space.
This is pointed out e.g. in The Elements of Statistical Learning, section 4.3.3. In scikit-learn
this is the default way to compute LDA because SVD of a data matrix is numerically more stable than eigen-decomposition of its covariance matrix.
Note that one can use any whitening transformation instead of $\boldsymbol \Sigma_W^{-1/2}$ and everything will still work exactly the same. In scikit-learn
$\mathbf{L}^{-1}\mathbf{U}^\top$ is used (instead of $\mathbf{U}\mathbf{L}^{-1}\mathbf{U}^\top$), and it works just fine (contrary to what was originally written in my answer).
Best Answer
In addition to the scikit-learn user guide, the following two sources were of great help to me:
PyCon conferences include a fairly large number of tutorial sessions in their itinerary, most of which end up as 3+ hours long videos in their Youtube channels. I would strongly recommend viewing the following sessions in the given order:
a. Machine Learning with Scikit-Learn (I) by Jake VanderPlas, held during PyCon 2015.
b. Olivier Grisel's Machine Learning with scikit-learn (II), Sequel to (a), also held at PyCon 2015.
c. Machine Learning with Scikit Learn | SciPy 2015 Tutorial | Andreas Mueller & Kyle Kastner Part I and its sequel both of which are part of the SciPy 2015 conference, now available in Enthought's channel.
d. Olivier Grisel's Advanced Machine Learning with scikit-learn, held at PyCon 2013.
They also offer more scikit-learn, scipy and pandas related tutorial sessions, so make sure you visit their channels as well.
EDIT: May I direct attention to @inversion's answer as well; Kaggle is the playground for learning machine learning techniques based on a wide variety of libraries such as scikit-learn, Lasagne (Python), Theano (Python), h2o (R and Python) and caret (R), and gives you real-life, hands-on challenges to tackle.