I found a formula for $S_n$ in terms of the previous $S_i$'s and the polynomials $R_i$, which I'm quite happy about. In fact, I need the multivariate version of the $R_i$, which I will construct all at the same time. Let us consider the ring of formal power series in countably many variables $X_1,X_2,\dots$ with integer coefficients, that is $A=\mathbb{Z}[[X_1,X_2,\dots]]$. Note that there are several notions of power series in infinitely many variables; mine is that of Bourbaki, where the underlying module of $A$ is just the product of copies of $\mathbb{Z}$ indexed by (finitely supported!) multiindices. (Some people require that the homogeneous components of a power series be polynomials; this is not the case here.) Then one can see that there exists a unique sequence $(R_n)$ of elements of $A$ such that for all $n\ge 0$ we have
$X_1^{p^n}+X_2^{p^n}+\dots=R_0^{p^n}+pR_1^{p^{n-1}}+\dots+p^nR_n$
where on the left is the sum of all $p^n$-th powers of the variables. This is a straightforward application of Bourbaki, Algèbre Commutative, Chapitre IX, $\S~1$, no 2, prop. 2, c) with the endomorphism $\sigma:A\to A$ defined by $\sigma(X_i)=X_i^p$ for all $i$. Now if we have finitely many (say $s$) variables, then we set $R_n(X_1,\dots,X_s)=R_n(X_1,\dots,X_s,0,0,\dots)$. Examples:
$R_0(X_1,\dots,X_s)=X_1+\dots+X_s$ and $R_1(X_1,\dots,X_s)=\frac{X_1^p+\dots+X_s^p-(X_1+\dots+X_s)^p}{p}$.
Assuming that the $R_n$ are computable, I have an inductive recipe for $S_n$ which is interesting because it shows that all the $p$-adic congruences implying integrality of the $S_n$ are contained in the $R_n$. The recipe goes like this. For each $i$, the polynomial $S_i$ is a sum of $2i$ terms (it will be obvious below what these terms are) and assuming $S_1,\dots,S_{n-1}$ are known then
$S_n=R_0Z_n+R_1Z_{n-1}+\dots+R_nZ_0+R_1S_{n-1}+R_2S_{n-2}+\dots+R_{n-1}S_1$
where: $Z_j$ is short for the pair of variables $(X_j,Y_j)$, $R_iZ_j$ is short for $R_i(X_j,Y_j)$ (the bivariate $R_i$) and $R_iS_j$ is the ($2j$-variate) polynomial $R_i$ evaluated at the $2j$ terms of $S_j$. I hope the following examples make it clear what this means, and how efficient it is:
$S_1 = R_0 Z_1 + R_1 Z_0$
$S_2=R_0 Z_2 + R_1 Z_1 + R_2 Z_0 + R_1 (R_0 Z_1 , R_1 Z_0 )$
$S_3 = R_0Z_3+R_1Z_2+R_2Z_1+R_3Z_0 \\ \quad
+ R_1(R_0Z_2,R_1Z_1,R_2Z_0,R_1(R_0Z_1,R_1Z_0))+ R_2(R_0Z_1,R_1Z_0)$
$S_4 = R_0Z_4+R_1Z_3+R_2Z_2+R_3Z_1+R_4Z_0 \\ \quad
+ R_1(R_0Z_3,R_1Z_2,R_2Z_1,R_3Z_0,R_1(R_0Z_2,R_1Z_1,R_2Z_0,R_1(R_0Z_1,R_1Z_0)),R_2(R_0Z_1,R_1Z_0)) \\ \quad
+ R_2(R_0Z_2,R_1Z_1,R_2Z_0,R_1(R_0Z_1,R_1Z_0)) \\ \quad
+ R_3(R_0Z_1,R_1Z_0)$
The proof that the recipe is correct is an exercise.
Edited: the proof below assumes $k$ is algebraically closed. The proof for the multiplicity inequality has been added.
Given $x \in X := V(f_1, \ldots, f_m)$, let $k$ be the local dimension of $X$ at $x$ (i.e. $k$ is the maximum of the dimension of all irreducible components of $X$ containing $x$).
Claim: $mult_x(X) \leq $ the product of the largest $n-k$ elements of the sequence $\deg(f_1), \dots, \deg(f_n)$.
If $k = 0$, then the claim holds due to Bézout's theorem. This estimate is also sharp: given degrees $d_1, \ldots, d_n$, take generic homogeneous polynomials of the specified degrees in $n$-variables - they intersect only at the origin, and the multiplicity at the origin is precisely $\prod_i d_i$ due to Bézout's theorem.
In the case that $n > k \geq 1$, we will prove the following
Sub-claim 1: For each $j = 1, \ldots, n-k$, there are $g_1, \ldots, g_j$ such that each $g_j$ is a linear combination of the $f_i$, and $V(g_1, \ldots, g_j)$ is a complete intersection on a neighborhood of $x$.
Proof: Proceed by induction on $j$. For $j = 1$, set $g_1 := f_1$. If $k = n - 1$, then stop. Otherwise, for $2 \leq j \leq n - k$, we can assume by induction we found $g_1, \ldots, g_{j-1}$ such that $V(g_1, \ldots, g_{j-1})$ is a complete intersection on a neighborhood of $x$. We claim that there is a linear combination of the $f_1, \ldots, f_n$ which does not identically vanish on any irreducible component of $V(g_1, \ldots, g_{j-1})$. Indeed, otherwise since $k$ is infinite, we can find $m$ linearly independent linear combinations of $f_1, \ldots, f_m$ which vanish identically on one of the irreducible components of $V(g_1, \ldots, g_{j-1})$ containing $x$, which would mean that the local dimension of $X$ at $x$ is $n - j + 1 > k$, a contradiction. Now let $g_j$ be that linear combination of the $f_i$ which does not identically vanish on any irreducible component of $V(g_1, \ldots, g_{j-1})$, and repeat. $\square$
Once you find $g_1, \ldots, g_{n-k}$ as above, let $Y := V(g_1, \ldots, g_{n-k})$. The second observation is that $mult_x(X) \leq mult_x(Y)$, which follows from the following general fact:
Sub-claim 2: Let $X \subseteq Y \subseteq k^n$ be a chain of closed subschemes and $x$ be a closed point of $X$ such that $X$ and $Y$ have the same local dimension at $x$. Then $mult_x(X) \leq mult_x(Y)$.
Proof: For each $q \geq 0$, there is a surjection
$$O_{x,Y}/m_x^qO_{x,Y} \to O_{x,X}/m_x^qO_{x,X}$$
where $m_x$ is the ideal of $x$. Treating this surjection as an $O_{x,Y}$-module homomorphism, it follows that
$$l(O_{x,Y}/m_x^qO_{x,Y}) \geq l(O_{x,X}/m_x^qO_{x,X})$$
where $l$ is the "length" (see e.g. Atiyah-Macdonald, Proposition 6.9). By the assumption on dimension, for $q \gg 1$, both of these lengths are given by polynomials in $q$ of degree $d$, where $d := \dim_x(X) = \dim_x(Y)$. It follows that the coefficient of $q^d$ in the polynomial providing the values of $l(O_{x,Y}/m_x^qO_{x,Y})$ can not be smaller than the corresponding coefficient of the polynomial for $l(O_{x,X}/m_x^qO_{x,X})$. $\square$
The third observation is that if one of the $f_i$ appears in linear combinations defining more than one $g_j$, say $g_{j_1}, g_{j_2}$, then replacing $g_{j_2}$ by an element of the form $g_{j_2} + ag_{j_1}$ for an appropriate $a \in k$ we may ensure that $f_i$ does not appear in the linear combination defining $g_{j_2}$. In this way we can find appropriate "invertible" linear combinations $g'_1, \ldots, g'_{n-k}$ of $g_1, \ldots, g_{n-k}$ such that
- $V(g_1, \ldots, g_{n-k}) = V(g'_1, \ldots, g'_{n-k})$, and
- there is a reordering of $f_1, \ldots, f_n$ such that $\deg(g'_j) \leq \deg(f_j)$, $j = 1, \ldots, n-k$.
The final observation is that $mult_x(Y) = mult_x(Y \cap H)$ for a generic affine subspace of dimension $k$ containing $x$. On the other hand, if $h_1, \ldots, h_k$ are linear polynomials such that $x$ is an isolated point of $V(g'_1, \ldots, g'_{n-k}, h_1, \ldots, h_k)$, then we are done by the $k = 0$ case.
Best Answer
There are a lot of subtle reasons such exponential sums can fail to exhibit square-root cancellation. First let me comment on two reasons suggested in your answer:
This should never be a problem. Katz proved a while ago simple explicit Betti bounds that suffice for problems of this type, so in any kind of sophisticated cohomological argument the explicit constant should be no worse than one arising from Katz (which will depend only on the degree and number of variables, but not be incredibly far from the optimum).
This can indeed pose a problem, and removing it can simplify things - basically, it should almost always let us use only the classical, characteristic zero topological versions of sheaves and monodromy and such instead of their exotic, étale cohomological incarnations, but not remove them entirely.
Why is there so much difficulty in the statements of these results, and a seeming need to invoke difficult geometric concepts?
One reason, or at least one way of shedding light on the difficulty, is that the same geometric tools (sheaves, monodromy, and more elementary geometric concepts) that let us prove upper bounds for exponential sums in some cases let us prove lower bounds in other cases. So we can rig up exponential sums that fail to exhibit square-root cancellation for more or less subtle geometric reasons. Thus any bound requires, one way or another, avoiding these examples.
Here's a simple one. The sum
$$ \sum_{a \in F} \left(\sum_{x_1,x_2\in F} e( x_1 +x_2 +a /(x_1x_2))\right)^n,$$ which can be expanded into a $2n+1$-variable sum, admits full square-root cancellation if and only if $n$ is divisible by 3.
Why? The monodromy computed by Katz of the hyper-Kloosterman sum is $SL_3$. Katz's computation shows that the sum over $x_1,x_2$ always has full square-root cancellation, but the sum over $a$ cancels if and only if the $n$th tensor power of the standard representation of the monodromy group admits an invariant vector, which happens if and only if $n$ is divisible by $3$.
One, more elementary, example, arose in a conversation with a couple mathematicians who were specifically interested in character sums of products of linear factors. I therefore considered:
$$ \sum_{x,y \in F} \chi \left( xy \prod_{\zeta_1,\zeta_2\in \mu_3} (1 + \zeta_1 x+\zeta_2 y) \right) $$
where $F$ is a field of size congruent to $1$ mod $6$, $\chi$ is a character of the multiplicative group of $F$ of order $2$ (i.e. a Legendre symbol - certainly something that appears often in analytic number theory - if $|F|$ is prime) and $\mu_3$ are the third roots of unity in $F$.
This product of linear factors can actually be expressed as a rational function of $\frac{xy}{ x^3+y^3+1}$ by an algebraic manipulation related to the Dwork family and expressed in that variable one can see that it does not admit square-root cancellation.
So for any bound on multiplicative character sums one somehow needs to rule out examples like this one.
A third example concerns sums of the special form $$\sum_{x_1,\dots,x_n\in F, y\in F^*} e( y f(x_1,\dots, x_n))$$ where one can trivially eliminate the variable $y$ but this just reduces one to the not-obviously-simpler problem of counting zeroes of $f$. Counting zeroes of $f$ can be as hard as counting points on basically any algebraic variety and sum - like abelian varieties - do not exhibit square-root cancellations for reasons that are geometrically meaningful (in this case, related to the topology of a complex torus) but not at all apparent from staring at the equations.