One big problem I see with ordinary least squares is the variance is clearly not constant but increases as $x$ does; I'd be inclined to consider a model where spread is related to mean (possibly proportional to it), suggesting either a generalized nonlinear model or maybe the use of a transformation (such as a log-transform) to stabilize the variance.
Without dealing with that issue, your variance-covariance and standard error estimates are biased.
Since you want to have estimates that are functions of such quantities, addressing this problem will be central to obtaining a usable answer.
In response to comments:
The standard errors of the parameter estimates and the residual standard deviation are built on the assumption of constant variance. This assumption is clearly violated. Consider if (by eye) I draw vertical bars intended to cover about 99.something percent of the local spread at each of several different x-values:
If calculations are based off that equality assumption (as the usual standard errors are), then they will likely be of little value.
The authors in the paper that your second plot there comes from (a page reference would have been nice btw - I am not an astronomer, so I have little idea what's going on in the rest of it) appear to have used a fit where spread is not constant for the y-scale of their plot, and indeed seems to increase in a fashion consistent with their error bars, so it's not clear they necessarily have a problem.
The blog post you link to doesn't look like the plot from the paper; it appears to be using different assumptions. Things that look superficially similar may not be.
If the spread looks roughly constant on the log scale (and there are no actual 0's), then you could fit a nonlinear least squares model of the form $E(y) = \log(a e^{bx}+c)$, and then convert that to a fit on the unstransformed scale - but beware that expectations on the two scales don't directly correspond. If you have near normality on the log-scale you can still get an estimated expectation on the original scale. Failing that you might be left with either making a particular distributional assumption, or a Taylor approximation to the expectation.
Several kinds of intervals
There are several main kinds of intervals people may wish to place on a plot. Here I discuss three:
$\;$ 1. a pointwise (confidence) interval (based on conditional standard error of the mean). This is an interval for where the curve (the conditional mean) may be, for each x-value. That is, specify a particular $x$, and take a vertical slice there. We can produce an estimate of the standard error for the mean at that $x$; a repeated set of such two standard-error intervals* (from different data) at a given $x$ would be expected to include the population mean at that $x$ about 95% of the time if certain assumptions were satisfied.
This seems to be something like what is being attempted in the blog post you mention in your final link, but crucially, looking at the Python code they seem to be ignoring the effect of the parameter covariances, which would make the result wrong.
$\;$ *(We would not say "$2\sigma$" though, since $\sigma$ would conventionally be a population quantity, and this is a sample estimate of something)
$\;$ 2. An overall interval for the mean. This would still be an interval for the mean but one based on global coverage probabilities. (I don't think you're after this.)
$\;$ 3. A (pointwise) prediction interval. This would be an interval with a specified probability of including a new observation at each given $x$. It includes both parameter uncertainty (due to both variances and covariances of parameter estimates) and the observation "noise" (based on a sample estimate of what I'd call $\sigma$). This is what at first glance seems to be going on in the second plot, but I am not 100% certain.
You should clarify what kind of interval you need, and what exactly you mean by "$\sigma$".
Since you aren't really familiar with the usual standard errors and intervals in linear models, let's cover those. (The interval calculations below are all pointwise, not global.)
Standard errors and intervals in linear models
A. Standard error of mean, confidence interval for mean
For a linear model, $y=X\beta+\varepsilon$, $\text{Var}(\varepsilon)=\sigma^2I$, we have for a matrix $C$.
$\text{Var}(C\hat{\beta})=\sigma^2C(X'X)^{-1}C'$.
Hence, if $c$ is a row-vector of constants, $\text{se}(c\hat\beta)=\sigma\sqrt{c(X'X)^{-1}c'}$.
If you estimate $\hat{\sigma^2}$ by $s^2=\frac{1}{n-p}\sum_i e_i^2$ (where $p$ is the number of predictors including the intercept term), then $\frac{c\hat\beta-c\beta}{\text{se}(c\hat\beta)}$ is a pivotal quantity* with a t-distribution with $n-p$ d.f.
A $1-\alpha$ confidence interval for a mean (at $c=x^*$) is therefore $x* \hat\beta \pm t_{1-\alpha/2} s \sqrt{x^* (X'X)^{-1} {x^*}'}$, or more succinctly, $\hat{y} \pm t_{1-\alpha/2} s \sqrt{x^* (X'X)^{-1} {x^*}'}$.
* basically a pivotal quantity is a function of a parameter or parameters and data whose distribution doesn't depend on the parameter(s)
B. Process variation
This is just the estimated standard deviation of the process, $\sigma$, usually estimated by $s$. If we knew the true (conditional) mean at $x^*$, you'd expect about 68% of the observations within $\sigma$ of it, and about 95% within $2\sigma$ of it. The fact that $s$ is estimated alters that somewhat (but I won't labor the point by giving all the details here unless you need them, since this is usually not what people are after).
C. Standard error of prediction, prediction interval for an observation
This incorporates both of the previous two components. Let $y^*$ be an observation not in our sample.
Then the variance of $y^{*}-\hat{y}^{*}$ is $\sigma^2+\sigma^2x^*(X'X)^{-1} {x^*}' = \sigma^2(1+x^* (X'X)^{-1} {x^*}')$, and so the standard error is $\sigma$ $\sqrt{1+x^* (X'X)^{-1} {x^*}'}$
Extending (slightly) the notion of a pivotal quantity, in similar fashion to the confidence interval we can construct a $1-\alpha$ interval for $y^{*}$: $\hat{y}^*$ $\pm$ $t_{1-\alpha/2}$ $s$
$\sqrt{1+x^* (X'X)^{-1} {x^*}'}$
Similar (if approximate) intervals can be fairly easily constructed in nonlinear models; for example, via a Taylor expansion. If you're able to identify which (if any) of the above calculations you need, we may be able to explore how to approach that.
Best Answer
This is called the Delta Method.
Suppose that you have some function $y = G(\beta,x) + \epsilon$; note that $G(\cdot)$ is a function of the parameters that you estimate, $\beta$, and the values of your predictors, $x$. First, find the derivative of this function with respect to your vector of parameters, $\beta$: $G^\prime(\beta, x)$. This says, if you change a parameter by a little bit, how much does your function change? Note that this derivative may be a function of your parameters themselves as well as the predictors. For example, if $G(\beta,x) = \exp (\beta x)$, then the derivative is $x \exp (\beta x)$, which depends upon the value of $\beta$ and the value of $x$. To evaluate this, you plug in the estimate of $\beta$ that your procedure gives, $\hat{\beta}$, and the value of the predictor $x$ where you want the prediction.
The Delta Method, derived from maximum likelihood procedures, states that the variance of $G\left(\hat{\beta}, x\right)$ is going to be $$G^\prime\left(\hat{\beta},x\right)^T \text{Var}\left(\hat{\beta}\right) G^\prime\left(\hat{\beta},x\right),$$ where $\text{Var}\left(\hat{\beta}\right)$ is the variance-covariance matrix of your estimates (this is equal to the inverse of the Hessian---the second derivatives of the likelihood function at your estimates). The function that your statistics packages employs calculates this value for each different value of the predictor $x$. This is just a number, not a vector, for each value of $x$.
This gives the variance of the value of the function at each point and this is used just like any other variance in calculating confidence intervals: take the square root of this value, multiply by the critical value for the normal or applicable t distribution relevant for a particular confidence level, and add and subtract this value to the estimate of $G(\cdot)$ at the point.
For prediction intervals, we need to take the variance of the outcome given the predictors $x$, $\text{Var}(y \mid x) \equiv \sigma^2$, into account. Hence, we must boost our variance from the Delta Method by our estimate of the variance of $\epsilon$, $\hat{\sigma}^2$, to get the variance of $y$, rather than the variance of the expected value of $y$ that is used for confidence intervals. Note that $\hat{\sigma}^2$ is the sum of squared errors (
SS
in help file notation) divided by the degrees of freedom (DF
).In the notation used in the help file above, it looks like their value of
c
does not take $\sigma^2$ into account; that is, the inverse of their Hessian is $\sigma^{-2}$ times the one that I give. I'm not sure why they do that. It could be a way of writing the confidence and prediction intervals in a more familiar way (of $\sigma$ times some number times some critical value). The variance that I give is actuallyc*SS/DF
in their notation.For example, in the familiar case of linear regression, their
c
would be $\left(x^\prime x\right)^{-1}$, while the $\text{Var}\left(\hat{\beta}\right) = \sigma^2 \left(x^\prime x\right)^{-1}$.