The approach you suggest is to fit a model to sampled data from each condition, then use some measure of the models' variability to test whether they're different at a particular set of time points. This seems similar to some established regression techniques, which involve assuming you know the basic functional form, then using that form to do bootstrapping. These methods are the parametric bootstrap and bootstrapping residuals (more about this in a second). So, I think your general idea makes sense; the question is how to implement it.
After fitting models to the time series, the question is how to estimate variability at the time points of interest. A bootstrapping approach could work. But, it won't be possible to use the simple bootstrap (i.e. to resample the data points) because the data are correlated in time.
That's why Rob Hyndman suggested the parametric bootstrap. In that approach, you'd fit a model to each time series, repeatedly simulate new data from the model, then run statistics on the simulated data. The model in this case wouldn't be a simple curve, but a generative model (i.e. it would have to give a probability distribution from which you could sample new points).
Here's a paper using that approach. They use Gaussian process regression to model the time series and do parametric bootstrapping. Their method might work well for your data. You'd use the same model fitting and bootstrap procedure, but the thing you'd test would be the equality of the mean at particular time points.
Kirk and Stumpf (2009). Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data
Another possibility along similar lines is to resample the residuals. The procedure would look like this:
Let's say the sampled time series is $\{x_1, ..., x_n\}$ for the first condition and $\{y_1, ..., y_n\}$ for the second condition. Say we're interested in the differences at time points $t_1, t_2, t_3$.
- Fit a model to each sampled time series (e.g. weighted sum of Gaussian basis functions, as you're currently using). The model should fit the data well and represent your beliefs about its temporal structure.
- Evaluate the models at each sampled time point. Call these fitted values $\hat{x}_i$ and $\hat{y}_i$ for each time point $i$.
- Compute the residuals at each time point. $a_i = x_i - \hat{x}_i, b_i = y_i - \hat{y}_i$
- Generate a synthetic version of each time series by resampling the residuals:
- For each condition at each time point, randomly draw a residual (from all time points, with replacemement), and add it to the fitted value (at that time point).
- For condition 1 at time point $i$, the synthetic time series is $x^*_i = \hat{x}_i + a_j$, where $j$ is a randomly chosen integer from $1$ to $n$.
- For condition 2 at time point $i$, the synthetic time series is $y^*_i = \hat{y}_i + b_k$, where $k$ is a separate, randomly chosen integer from $1$ to $n$.
- Fit a new model to each of the synthetic time series.
- The models should have the same functional form as used to fit the original data.
- The model fit to $x^*$ is $f_{(i)}(t)$. The model fit to $y^*$ is $g_{(i)}(t)$
- The subscript $i$ denotes the current bootstrap sample (i.e. the number of times we've run through the loop).
- The models are continuous functions of time, so they can be evaluated at any time point $t$.
- Evaluate the new models (fit to the synthetic data) at the time points of interest, where we want to calculate the differences. That is, calculate $f_{(i)}(t)$ and $g_{(i)}(t)$ for $t \in \{t_1, t_2, t_3\}$. Record these values.
- Repeat steps 4-6 many times (e.g. 10,000). Each iteration will produce a single bootstrap sample.
- We now have a set of bootstrapped function values at each of the time points of interest. That is: $f_{(i)}(t)$ and $g_{(i)}(t)$ for $t \in \{t_1, t_2, t_3\}$ and for $i \in \{1, ..., 10000\}$
- For each time point of interest $t$, run a statistical test comparing the bootstrapped values $f_{(i)}(t)$ vs. $g_{(i)}(t)$. I.e. test the null hypothesis that there's no difference. Or, even better, calculate a confidence interval on the difference, since it's more informative.
Resampling the residuals relies on the assumption that the residuals are identically distributed, so it would be good to check that this is true. This condition could be violated, for example, if the variance changes over time.
Possibly of interest:
This chapter describes bootstrapping residuals.
These notes briefly compare the simple bootstrap, parametric bootstrap, and bootstrapping the residuals.
I have 6 more independent variables and I used one-way anova or t test or kruskal wallis test and Mann Whitney test separately according to violation of the assumptions.
Separate regressions or tests with each of the independent variables is not the best way to proceed. If you omit any independent variable that is associated with outcome and is correlated with variables in the model, then the regression coefficients you get risk being biased. With 100 cases you then have more than 14 cases per independent variable, so you should be able to fit them all together in a multiple regression without overfitting your data.
So far, what you show doesn't suggest there is an association between time sitting and your Boston score, although there are problems with each of the tests you show (as noted in comments). That might improve if you take other independent variables into account in a multiple regression. I wouldn't worry much yet about residual plots and so forth for this single-predictor regression; those might also improve when you take all of your predictors into account together.
If your outcome is the sum of 19 items, each with a 1-5 scale, then you should be able to treat that as continuous. If your full multiple regression model still shows problems you could consider ordinal logistic regression, which models the ordering of outcomes without requiring assumptions related to residuals.
Best Answer
Spearman correlation is fine as far as it goes, but don't stop there. What if there is a nonlinear relationship? E.g., perhaps the cost difference between those who rate the toy bad vs. medium is not comparable to the cost difference between those who rate it medium vs. good. An ANOVA would help you detect this. There's a reason to use ANOVA instead of 2 T-tests. Others might explain it better, but in a nutshell, the combination of omnibus test and (if significant) post hoc tests preserves Type I and Type II error rates better than the 2 T-tests would.