No, it doesn't establish similarity, and indeed, it doesn't come close to answering the right question.
i) Failure to reject the null doesn't imply it's true. It simply implies your sample was too small to detect the difference.
ii) You're not actually even interested in the truth of the null you're testing. I presume you don't believe the distribution of errors is actually identical ... you must know that that would be an astronomically unlikely situation -- the new problems will have at least subtle differences from the original. You yourself use the word "similar" and the hypothesis you're testing isn't about that. With a large enough sample size, any difference in mean - no matter how trivial - might be rejected.
You may want to consider whether a confidence interval for the difference is a better tool. You can specify what you regard as the largest size of difference in error rate that is still reasonably consistent with being "similar", and see if the confidence interval contains it.
Alternatively, you might consider an equivalence test.
(I'm not 100% sure I'd use a t-test for this, since the number of errors will be fairly discrete, and there's also a potential issue with heteroskedasticity -- the variance of a difference will tend to be larger for students whose mean number of errors is larger -- but it may do well enough.)
Of course this only detects situations where the mean difference is away from zero.
It's possible that the variation is such that some people score higher and some score lower (for example, weaker students may typically score lower than they do on the original test and stronger students typically score higher than on the original test) and a paired test of means won't detect that. If you want to be able to identify that kind of change, you will want a different test.
Aside: It sounds like your underlying problem (though not your direct questions) is related to calibration, on which a fair bit has been written. If your device is not as close as you'd like to the commercial one, it may not matter so much, as long as it's fairly consistent in the way it responds. A calibration curve (in most cases, just a line) is often used to adjust readings on devices to match some standard (whence the scale on which the readings are made can be correspondingly adjusted for any such consistent bias). So the methodology of calibration may be of use to you if your device has some bias compared to the commercial one.
Your direct question sounds like you probably want equivalence testing; in particular, a two-one-sided test (TOST) procedure.
The more usual way of setting this up amounts to setting a pair of equivalence bounds around your gold standard measurement (values which are "close enough" to call equivalent) and then showing that you'd reject the hypothesis that the population mean of your measurement would lie above the upper bound and also that it would lie below the lower bound (and so you would conclude it will lie between the bounds).
[This can also be recast as seeing if a two sided confidence interval for the parameter lies entirely within the pair of equivalence bounds.]
See for example Walker & Nowacki (2011) [1]; there's a discussion of TOST in industrial applications in Richter & Richter (2002) [2].
However, a caveat: Presumably you're testing your device not at one value but across the range of the device. Given that there may be more bias at one value than another (indeed, it's possible to be biased low in one place and high in another), you probably want to look at equivalence at each value for the standard device rather than a simple TOST setting (in that case establishing equivalence bands, which may not necessarily be equally wide at every value -- e.g. if equivalence is in percentage terms). This brings us back nearer to the calibration problem I mentioned at the start.
[1]: Walker, E., & Nowacki, A. S. (2011).
Understanding Equivalence and Noninferiority Testing.
Journal of General Internal Medicine, 26(2), 192–196.
http://doi.org/10.1007/s11606-010-1513-8
(Ignore the 'noninferiority' stuff there, you're just after the equivalence part)
[2]: Richter, S. J. & C. Richter (2002),
"A Method for Determining Equivalence in Industrial Applications,"
Quality Engineering, 14(3), 375–380
Best Answer
We need either an example or more details on the datasets:
The t-test will answer the question: is the mean the same between the two classes?
To test if the two data sets come from the same distribution, you could for example apply a Kolmogorov Smirnov test (ks.test in R). And there are alternative multivariate Kolmogorov Smirnov tests if you have two or more variables [Lopes et al., 2007].
With the example dataset:
Given the plot and the results of the tests, you might want to augment the number of individuals!