It seems to me that a first step would be to try to create some models of how tcp header data might relate to your categories. That is, do you have any theories?
If you do, it might turn out that you need to preprocess your packet info: for example using the window size of the previous packet rather than the current one, or the using the day of the week instead of the day of the month.
Then you need to look carefully at your inputs and outputs. Are they categorical ("car", "truck"), ordered categorical ("small", "medium", "large"), etc? Your linear regression is probably treating your categories like they're continuous (1..N) and your plot shows there's no such linear relationship -- and there's probably no reason to expect there should be.
Once you have an idea of models that might make sense, have meaningful variables, and know the types of these variables, methods will naturally fall into place. (For example, continuous variables in and binary category out naturally suggests logistic regression.)
EDIT: In terms of logistic regression, it can be used with multiple outcomes. Look for multinomial logistic regression.
In terms of validation, you train your model with your training set then predict on the validation data and see how accurate you are. Obviously, if you look at your accuracy on your training data, it'll tend to overestimate your accuracy since it's what you tuned your model to. A better test of how you'll do in the real world is to use data that your tuning (training) process never used.
I believe this
... and there is no point at which the CI excludes 0, suggesting the means are the same at every point
is one source of confusion.
On the face of it, the data and smooths suggest that the difference of means is not exactly zero. Put another way, whatever difference in means there is between the two groups is small relative to the uncertainty in the estimates of those mean. You observed a difference but cannot say with any level of confidence that this observed difference is a real difference in the population of subjects you might have sampled.
Why are each of the model terms significant but the difference not significant? This comes from propagating the uncertainty in all of the things you estimated into the computation of the difference of means as a function of age.
Controlling for the smooth effect of Age by sex, you detect a small difference between the two groups in the mean response. The two smooth terms were assessed to be unlikely to have arise if the true functions were flat constant functions.
Computing a difference of means as a function of age requires you to bring together all these estimated effects and propagate their uncertainties into your estimates of the difference in the response between the two groups as a function of age.
You could try a more direct estimation of the thing you are interested in by using the ordered-factor method.
data <- transform(data, oSex = ordered(Sex))
m <- gam(Weight ~ oSex + s(Age) + s(Age, by = oSex) + s(Subject, bs = "re"),
method = "REML"
where s(Age)
is the smooth effect of Age for the reference level of Sex
, while s(Age, by = oSex)
is the smooth difference between the smooth effect of Age
in the reference level and the other level of Sex
.
This shouldn't change the outcome much though; I suspect you'll see a small effect of Sex
overall once you condition on the smooth effects of Age in both groups, but you won't find large ("significant") differences when you ask the more specific question of "At what ages are the groups different?".
Best Answer
Here is some code for how you could do it in
R
:Which then generates the following plot:
You could obviously add your own aesthetic touched but this should give you the general idea.