The "uniform" in uniform prior doesn't just mean that hits and misses are equally likely. It means that you assume that you have a probability measure on the rates $[0,1]$ and this measure is the uniform measure. For example, it means the chance the true rate is between $0.9$ and $1.0$ is $0.1$. There are other measures on $[0,1]$ which have mean value $1/2$.
For example, if you start with the uniform prior and then observe $1$ hit and $1$ miss, the updated distribution is more concentrated around a rate of $1/2$ than the uniform prior. Instead of a probability of $\Delta r $ for the rate to be between $r$ and $r+\Delta r$, you would estimate the probability to be about $6 r (1-r) \Delta r$. (The factor of $6$ makes the total measure $1$.) This new distribution would also predict that the probability of getting a hit next time is $1/2$. If you observe an additional $h$ hits and $m$ misses, the expected probability of a hit is
$$\frac{h+2}{h+m+4}.$$
This is closer to $1/2$ than $\frac{h+1}{h+m+2}$. This agrees with the fact that you started with a distribution which was more concentrated near $1/2$. Of course, if you include that first hit and miss then there have been $h+1$ hits and $m+1$ misses in total.
The distribution you get from the uniform prior by observing some number of hits and misses is called a beta distribution. The uniform distribution on $[0,1]$ is $\text{Beta}(1,1)$. Beta distributions have the nice property that if you update a beta distribution with an observation, the result is still in that family. From one observation, the $\text{Beta}(a,b)$ distribution is updated either to $\text{Beta}(a+1,b)$ or $\text{Beta}(a,b+1)$. The mean of a $\text{Beta}(a,b)$ distribution is $\frac{a}{a+b}$. It is perfectly reasonable to have a prior distribution which is not a beta distribution, but the answers might not come out as nicely.
If I'm very confident that the hit rates are 50/50, I could add 2 instead of 1
to the hits and misses. Or if I'm less confident, I could add 1/2.
That might or might not agree with confidence. That is about whether you think it is likely that the rate is close to $1/2$. In some situations, you may believe the rate should be close to $0$ or close to $1$, and the first trial should be very informative. You might believe that half of your students know how to solve a type of problem, and that if you give a random student $5$ of these problems, the student is very likely to solve all $5$ or fail to solve any. This might be approximated by a $\text{Beta}(\epsilon,\epsilon)$ distribution so that after $h$ correct and $m$ incorrect, the average value is
$$\frac{h+\epsilon}{h+m+2\epsilon},$$
which is close to $0$ or $1$ after one observation.
Overfitting comes from allowing too large a class of models. This gets a bit tricky with models with continuous parameters (like splines and polynomials), but if you discretize the parameters into some number of distinct values, you'll see that increasing the number of knots/coefficients will increase the number of available models exponentially. For every dataset there is a spline and a polynomial that fits precisely, so long as you allow enough coefficients/knots. It may be that a spline with three knots overfits more than a polynomial with three coefficients, but that's hardly a fair comparison.
If you have a low number of parameters, and a large dataset, you can be reasonably sure you're not overfitting. If you want to try higher numbers of parameters you can try cross validating within your test set to find the best number, or you can use a criterion like Minimum Description Length.
EDIT: As requested in the comments, an example of how one would apply MDL. First you have to deal with the fact that your data is continuous, so it can't be represented in a finite code. For the sake of simplicity we'll segment the data space into boxes of side $\epsilon$ and instead of describing the data points, we'll describe the boxes that the data falls into. This means we lose some accuracy, but we can make $\epsilon$ arbitrarily small, so it doesn't matter much.
Now, the task is to describe the dataset as sucinctly as possible with the help of some polynomial. First we describe the polynomial. If it's an n-th order polynomial, we just need to store (n+1) coefficients. Again, we need to discretize these values. After that we need to store first the value $n$ in prefix-free coding (so we know when to stop reading) and then the $n+1$ parameter values. With this information a receiver of our code could restore the polynomial. Then we add the rest of the information required to store the dataset. For each datapoint we give the x-value, and then how many boxes up or down the data point lies off the polynomial. Both values we store in prefix-free coding so that short values require few bits, and we won't need delimiters between points. (You can shorten the code for the x-values by only storing the increments between values)
The fundamental point here is the tradeoff. If I choose a 0-order polynomial (like f(x) = 3.4), then the model is very simple to store, but for the y-values, I'm essentially storing the distance to the mean. More coefficients give me a better fitting polynomial (and thus shorter codes for the y values), but I have to spend more bits describing the model. The model that gives you the shortest code for your data is the best fit by the MDL criterion.
(Note that this is known as 'crude MDL', and there are some refinements you can make to solve various technical issues).
Best Answer
Thanks to the hints in the comments, I managed to find a suitable solution to my problem. More precisely, I found two that overcome the issue with the splines. But neither of my solutions uses splines anymore.
I think it's worth highlighting that the suggestion of @Gavin is probably also possible, but I didn't try this since I'm quite happy with what I have so far.
Gaussian Filter
I applied a 3D gaussian filter to the data and the results look reasonable, as shown in the figure below.
From my point of view, the drawback of this method is that the choice of sigma (standard deviation of the Gaussian kernel) is somewhat arbitrary. Instead, I settled on the next method.
Kernel Density Estimation
Thanks to the hint from @Aksakal I looked into kernel density estimation (KDE) and it seems to me suitable for the problem at hand. The result is shown in the figure below.
What I like about this method is that the bandwidth of the kernel can be estimated from the data with cross validation (in my opinion this is an advantage over just simply going for the value that seems to be looking fine).
If anybody is interested in KDE in Python, an excellent comparison of the different implementations to be found here.