R Regression – Verification of a Regression Model

predictive-modelsrregression

I need some guidance related to regression model verification using validation data.
I am new to R-tool & statistics and trying my best to learn. I did search on internet too but I couldn't get a final answer to my questions.
Actually I have a lot of questions, I may try my best to explain the problems:
I am experimenting with network packets and R-tool.
I have captured some packets from a network using a custom made packet sniffer in java. The sniffer will capture some packets and save the information of packet header like: tcp window size, tcp sequence numbers, date-time, ip header length, ip time to live etc… in a csv file.

Also the sniffer will add category number to each csv file so that we can know which packet belongs to which category. I created 9 different categories saved in 9 different csv files.
Now I extracted 1000 observation from each of the csv files and created a data set named "alldata".

Then I created training data set and validation data set from "alldata" data set.

Now I want to perform linear regression, logistic regression, decision tree analysis, cluster analysis etc on this "alldata" data set.

So my plan was to use training data set to create models and then later use validation data set to verify my models.

Category will be my target variable in any case. I want to predict the category from other independent variables.

  1. My first confusion is that after I created scatter plot of category with other independent variables and I don't see any linear relationship between them. Moreover I even don't know what relation exists between category and independent variables. From scatter plots it seems to me that there is no specific relation between category and other independent variables(except date_time it is bit linear to category). Am I doing the correct interpretation ?
    Here are some of the plots:
    plot 1
    plot 2

  2. I think doing linear regression won't make any sense now after having a look at scatter plots. Is this correct assumption?

  3. Although I tried to do make some regression models with training data set, but the R-square values for all the models is quite low (for example like 0.00019, 0.0035, 0.018 etc. )
    So can I assume that these models are not good due to very low r-square vales?

  4. As logistic regression is used when we have target variables having only two values 1 or 0, or some probabilities between 0.0 – 1.0.
    This means performing logistic regression is not possible for this type of data set.
    Is my assumption true?

  5. My main question was how to verify a model created with training data set by using validation data set?
    Please let me know the commands and the procedure.
    Please let me know if I am doing this in wrong way or if you can suggest me a better way to do this whole work. I think if someone could please clear my doubts then I may ask further more questions.

If you don't understand my problem we can discuss in more detail
I look forward for your replies.
Thank you!


@Wayne

Hello thanks for the reply, but the thing is for each category I have almost same range of values of independent variables like(tcpheader, ipttl, iplen). For example iptype is only having two values 6 and 17. So most of the categories are having iptype value of 6 & 17.
So it is also same is for tcpheader, tcp sequence number, tcp acknowledgement number etc. I don't think there is any way to distinguish a particular packet based on these independent variables. Only the independent variable that can be helpful is time.
But when I created a model with time it had good r-squared value but the regression line equation doesn't predict category with any value of date_time.
I don't understand this behaviour.

Thanks.

Best Answer

It seems to me that a first step would be to try to create some models of how tcp header data might relate to your categories. That is, do you have any theories?

If you do, it might turn out that you need to preprocess your packet info: for example using the window size of the previous packet rather than the current one, or the using the day of the week instead of the day of the month.

Then you need to look carefully at your inputs and outputs. Are they categorical ("car", "truck"), ordered categorical ("small", "medium", "large"), etc? Your linear regression is probably treating your categories like they're continuous (1..N) and your plot shows there's no such linear relationship -- and there's probably no reason to expect there should be.

Once you have an idea of models that might make sense, have meaningful variables, and know the types of these variables, methods will naturally fall into place. (For example, continuous variables in and binary category out naturally suggests logistic regression.)

EDIT: In terms of logistic regression, it can be used with multiple outcomes. Look for multinomial logistic regression.

In terms of validation, you train your model with your training set then predict on the validation data and see how accurate you are. Obviously, if you look at your accuracy on your training data, it'll tend to overestimate your accuracy since it's what you tuned your model to. A better test of how you'll do in the real world is to use data that your tuning (training) process never used.

Related Question