While this sounds somewhat like overfitting, I think it's actually more likely that you've got some kind of "bug" in your code or your process. I would start by verifying that your test set isn't somehow systematically different from the training/validation set. Suppose your data is sorted by date (or whatever). If you used the first 50% for training, the next 25% for validation, and the rest for testing, you may have accidentally stratified your data in a way that makes the training data somewhat representative of the validation data, but less so for the testing data. This is fairly easy to do by accident.
You should also ensure you're not "double-dipping" in the validation data somehow, which sometimes happens accidentally.
Alternately, CV's own @Frank Harrell has reported that a single train/test split is often too variable to provide useful information on a system's performance (maybe he can weigh in with a citation or some data). You might consider doing something like cross-validation or bootstrapping, which would let you measure both the mean and variance of your accuracy measure.
Unlike Mikera, I don't think the problem is your scoring mechanism. That said, I can't imagine a situation where your $R^2_{training} < R^2_{validation}$, so I'd suggest scoring using the validation data alone.
More generally, I think $R^2$ or something like it is a reasonable choice for measuring the performance of a continuous-output model, assuming you're aware of its potential caveats. Depending on exactly what you're doing, you may also want to look at the maximum or worst-case error too. If you are somehow discretizing your output (logistic regression, some external thresholds), then looking at precision/recall/AUC might be a better idea.
The two approaches are the same from a training perspective, as both use cross-validation. If you were to use the same k and the data was significantly large, there should be no difference.
The only difference is that in approach 2 you evaluate on an unseen 20% of the data.
For the second approach we use 80% of the data for training split into 60-20. So
k = 20/80 = 4
Best Answer
With lasso or ridge regression, you do not need to divide your data into 3 parts. Once you have determined how to best to split your data into two, you can use the training set with cross validation to determine the shrinkage parameter and fit the model using the same training data without introducing bias (see the lasso paper by Tibshirani on the Journal of Statistical Software, I believe). Consequently, your question should be how much data should be used to fit the model and how much to test. Since your sample size is small, I would recommend either a 70-30 or 80-20 split. There are really no rules about the split but I pay more attention to ensuring that I have enough data to estimate the parameters of my model more than I would care about having "sufficient" test data