Ok, Thanks to Steffen for the useful comments. I guess the answer is quite simple in the end. As he says, all we need to do is store current denominator (call it $z$):
$z(t) = |\{d:t\in d\}|$
Now given a new document $d^*$, we update the denominator simply by:
$z^*(t) = z(t) + \left\{ \begin{array}{ll} 1 & \mbox{if}\; {t\in d^*} \\ 0 & \mbox{otherwise} \end{array} \right.$
We would then have to recalculate the $tf-idf$ based on the new $idf$ vector.
Similarly to remove an old document, we decrement the numerator in a similar fashion.
This does mean that we either have to store the entire $tf$ matrix as well as the $tf-idf$ matrix (doubling the memory requirements), or we have to compute the $tf-idf$ scores when needed (increasing computational costs). I can't see any way round that.
For the second part of the question, about the evolution of $idf$ vectors over time, it seems that we can use the above method, and store a set of "landmark" $z$ vectors (denominators) for different date ranges (or perhaps content subsets). Of course $z$ is a dense vector of the length of the dictionary so storing a lot of these will be memory intensive; however this is probably preferable to recomputing $idf$ vectors when needed (which would again require storing the $tf$ matrix as well or instead).
The aspect emphasised is that the relevance of a term or a document does not increase proportionally with term (or document) frequency. Using a sub-linear function therefore helps dampen down this effect. To that extent, the influence of very large or very small values (e.g. very rare words) is also amortised. Finally, as most people intuitively perceive scoring functions to be somewhat additive, using logarithms will make probability of different independent terms from $P(A, B) = P(A) \, P(B)$ to look more like $\log(P(A,B)) = \log(P(A)) + \log(P(B))$.
As the Wikipedia article you link notes the justification of TF-IDF is still not well-established; it is/was a heuristic that we want to make rigorous, not a rigorous concept we want to transfer to the real world.
As mentioned by @Anony-Mousse as a very good read on the matter is Robertson's Understanding Inverse Document Frequency:
On theoretical arguments for IDF. It gives a broad overview of the whole framework and attempts to ground TF-IDF methodology to the relevance weighting of search terms.
Best Answer
As you will see pointed out elsewhere that tf-idf is discussed, there is no universally agreed single formula for computing tf-idf or even (as in your question) idf. The purpose of the $+ 1$ is to accomplish one of two objectives: a) to avoid division by zero, as when a term appears in no documents, even though this would not happen in a strictly "bag of words" approach, or b) to set a lower bound to avoid a term being given a zero weight just because it appeared in all documents.
I've actually never seen the formulation $log(1+\frac{N}{n_t})$, although you mention a textbook. But the purpose would be to set a lower bound of $log(2)$ rather than zero, as you correctly interpret. I have seen 1 + $log(\frac{N}{n_t})$, which sets a lower bound of 1. The most commonly used computation seems to be $log(\frac{N}{n_t})$, as in Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze (2008) Introduction to Information Retrieval, Cambridge University Press, p118 or Wikipedia (based on similar sources).
Not directly relevant to your query, but the upper bound is not $\infty$, but rather $k + log(N/s)$ where $k, s \in {0, 1}$ depending on your smoothing formulation. This happens for terms that appear in 0 or 1 documents (again, depends on whether you smooth with $s$ to make it defined for terms with zero document frequency - if not then the max value occurs for terms that appear in just one document). IDF $\rightarrow \infty$ when $1 + n_t=1$ and $N \rightarrow \infty$.