The way to think of cross-validation is as estimating the performance obtained using a method for building a model, rather than for estimating the performance of a model.
If you use cross-validation to estimate the hyperparameters of a model (the $\alpha$s) and then use those hyper-parameters to fit a model to the whole dataset, then that is fine, provided that you recognise that the cross-validation estimate of performance is likely to be (possibly substantially) optimistically biased. This is because part of the model (the hyper-parameters) have been selected to minimise the cross-validation performance, so if the cross-validation statistic has a non-zero variance (and it will) there is the possibility of over-fitting the model selection criterion.
If you want to choose the hyper-parameters and estimate the performance of the resulting model then you need to perform a nested cross-validation, where the outer cross-validation is used to assess the performance of the model, and in each fold cross-validation is used to determine the hyper-parameters separately in each fold. You build the final model by using cross-validation on the whole set to choose the hyper-parameters and then build the classifier on the whole dataset using the optimized hyper-parameters.
This is of course computationally expensive, but worth it as the bias introduced by improper performance estimation can be large. See my paper
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www, pdf)
However, it is still possible to have over-fitting in model selection (nested cross-validation just allows you to test for it). A method I have found useful is to add a regularisation term to the cross-validation error that penalises hyper-parameter values likely to result in overly-complex models, see
G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www,pdf)
So the answers to your question are (i) yes, you should use the full dataset to produce your final model as the more data you use the more likely it is to generalise well but (ii) make sure you obtain an unbiased performance estimate via nested cross-validation and potentially consider penalising the cross-validation statistic to further avoid over-fitting in model selection.
There is a recently proposed method to speed up grid search:
"Fast Cross validation via sequential analysis"
http://www.scribd.com/doc/76134034/Fast-Cross-Validation-Via-Sequential-Analysis-Talk
Basically, they're doing a normal grid search, but try to eliminate bad parameters early in the process and not waste too much computation on them. It's fairly new and I don't know independent evaluations of their method, but I'm currently implementing it and want to give it a try.
Best Answer
Yes, the approach you are using is correct. As @halilpazarlama suggests the reason your cross-validation error is lower than the test error is indeed likely to be because you are over-fitting the cross-validation error. Essentially the cross-validation error is a performance estimator based on a finite same of data, and thus will have a (usually) non-negligible variance (i.e. if you ran the same experiment again with a different sample of data, you would get a slightly different minimum error and the minimum may well occur at different grid-point). Thus we can minimise the cross-validation error in two ways, ways that genuinely improve generalisation performance and ways that merely exploit the random sampling of the data to form the training data. I wrote a paper about this as it can be problematic if you have a large number of hyper-parameters to tune:
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (pdf)
If you only have a few hyper-parameters to tune, this isn't usually too much of a problem, as long as you are aware that the minimum cross-validation error will be optimistically biased. If you need an unbiased (or at least less biased or pessimistically biased) estimate, then the thing to do is nested cross-validation, where the outer cross-validation estimates performance and the inner cross-validation is used to tune the hyper-parameters separately in each fold (see the paper for details). Basically tuning the hyper-parameters is part of the fitting of the model and needs to be cross-validated as well. Of course this is computationally even more expensive.
To reduce the computational expense, you could try using the Nelder-Mead simplex method for tuning the hyper-parameters instead of grid search, it is usually more efficient and doesn't need gradient information. Pattern search is another alternative. Another thing you can do to improve efficiency is to start the model for each grid point from the model found at the last grid point, instead of starting from scratch again (alpha seeding). Alternatively you could use a "regularisation path" type algorithm to learn each row (where the regularisation parameter is being changed with fixed kernel parameter) in one go.