Solved – How to analyze these data

hypothesis testingr

The biological data is listed as following:

   V1    V2    V3    V4    V5    V6
0.064 0.014 0.016 0.012 0.013 0.023
0.056 0.000 0.000 0.008 0.010 0.000
0.042 0.014 0.024 0.008 0.017 0.023
0.031 0.014 0.016 0.008 0.013 0.023
0.068 0.000 0.008 0.004 0.020 0.000
0.081 0.000 0.000 0.004 0.010 0.000
0.060 0.014 0.016 0.006 0.010 0.023

or you can download the data from http://www.mediafire.com/?6yp9l9m47jv433a.

A<- dat[,1] 
B<- dat[,2:6]

I want to compare the difference between the first column to other columns of the data.Because only dat[,2] and dat[,6] not subject to normal distribute,I used wilcox.test instead of t.test function to caculate in R. But the warning messages rised up,such as "In wilcox.test.default(A, B[, 1]) : cannot compute exact p-value with ties". Could you give me some suggestions? Thank you.

wilcox.test(A,B[,1])

Wilcoxon rank sum test with continuity correction

data: A and B[, 1] W = 49, p-value = 0.00184 alternative hypothesis: true location shift is not equal to 0

Warning message: In wilcox.test.default(A, B[, 1]) : cannot compute exact p-value with ties

Best Answer

Sometimes a formal statistical test is overkill. Row by row, the entries in the first column are the largest. Draw a picture to make this apparent: side-by-side boxplots or dotplots would work nicely.

Although this is a post-hoc comparison, if the initial intent had been to compare the first column against the rest for a shift in distribution, the most extreme characterizations would be that either all maxima or all minima occur in the first column (a two-sided test). The chance of this occurring by chance, if all columns contained values drawn at random from a common distribution, would be $2 (\frac{1}{6})^7$ = about 0.0007%.

In fact, the first two contains the largest 7 of the 42 values. Again, ex post facto, the chance of such an extreme ordering occurring equals $\frac{2}{42 \choose 7}$ = about 0.000007%.

These results indicate that any reasonably powerful test you choose to conduct will conclude there's a highly significant difference.

In any event, You don't need a p-value; you need to characterize how large the difference is (the right way to do this depends on what the data mean) and you need to seek an explanation for the difference.

Related Solutions

Solved – How to analyze categorical data

If there is only one such column, and the data type is string, such that values are "Yes" and "No", you can change the datatype with as.numeric(as.factor()) and this converts it to a factor variable which works with cor()

library(MASS)  # contains many sample datasets
data(Pima.te) # diabetes dataset, has 1 Yes/No column
str(Pima.te)

Result

'data.frame':   332 obs. of  8 variables:
 $ npreg: int  6 1 1 3 2 5 0 1 3 9 ...
 $ glu  : int  148 85 89 78 197 166 118 103 126 119 ...
 $ bp   : int  72 66 66 50 70 72 84 30 88 80 ...
 $ skin : int  35 29 23 32 45 19 47 38 41 35 ...
 $ bmi  : num  33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
 $ ped  : num  0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 0.704 0.263 ...
 $ age  : int  50 31 21 26 53 51 31 33 27 29 ...
 $ type : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 1 2

-- change the datatype:

Pima.te$type <- as.numeric(as.factor(Pima.te$type))
str(Pima.te)

Result

'data.frame':   332 obs. of  8 variables:
 $ npreg: int  6 1 1 3 2 5 0 1 3 9 ...
 $ glu  : int  148 85 89 78 197 166 118 103 126 119 ...
 $ bp   : int  72 66 66 50 70 72 84 30 88 80 ...
 $ skin : int  35 29 23 32 45 19 47 38 41 35 ...
 $ bmi  : num  33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
 $ ped  : num  0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 0.704 0.263 ...
 $ age  : int  50 31 21 26 53 51 31 33 27 29 ...
 $ type : num  2 1 1 2 2 2 2 1 1 2 ...

cor(Pima.te)

Result:

         npreg     glu      bp    skin      bmi     ped     age   type
npreg  1.00000 0.09548 0.17948 0.08521 -0.01591 0.07550 0.66738 0.2409
glu    0.09548 1.00000 0.19468 0.23517  0.27415 0.23521 0.23456 0.5199
bp     0.17948 0.19468 1.00000 0.20480  0.33819 0.03123 0.32488 0.1705
skin   0.08521 0.23517 0.20480 1.00000  0.65854 0.13691 0.09458 0.2677
bmi   -0.01591 0.27415 0.33819 0.65854  1.00000 0.12672 0.04733 0.3147
ped    0.07550 0.23521 0.03123 0.13691  0.12672 1.00000 0.15301 0.2517
age    0.66738 0.23456 0.32488 0.09458  0.04733 0.15301 1.00000 0.2830
type   0.24090 0.51994 0.17052 0.26772  0.31468 0.25167 0.28297 1.0000

This result probably does not make any sense, and maybe you prefer dummy variables 1 and 0 instead of 2 and 1, but you get the idea.

If there are many such variables in your dataframe, it is a different story and I would use

dplyr::mutate_if(is.factor, as.numeric) #pseudocode

but that's maybe too complicated for now

R – How to Analyze Longitudinal Data in R

Model Formulation

One way forward may be to aggregate tweets by week and count the number of occurrences of the use "we" adjusting for wins/losses and using an offset to account for the number of tweets made.

If your hypothesis is that "Wins and losses effect how frequently an athlete refers to the team collectively as 'we'" then it might be sensible to formulate your data as follows

Week	# We	Tweets	Wins	Losses	ID
0	...	...	...	...	...
1	...	...	...	...	...

Here, "# We" us the outcome (which I will reffer to as $y$). Even under the null hypothesis (Wins and Losses do not effect the frequency of $y$) the frequency can none the less increase/decrease simply by tweeting more. Thus, we will need to account for that somehow.

A typical model for count data is Poisson regression. We can perhaps the model the frequency of $y$ as follows

$$ \log(E(y_{i, j})) = \beta_{0, i} + \beta_1 \mbox{week}_{i,j} + \beta_2\mbox{Wins}_{i,j} + \beta_3 \mbox{Losses}_{i,j} + \log(\mbox{Tweets}_{i,j}) $$

There are a few important things to note here:

Each athlete has their own intercept in this model $\beta_{0,i}$. This sort of model is known as a mixed effects model and can account for the longitudinal nature of the data.
$\log(\mbox{Tweets})$ does not have a coefficient. This is known as an offset and accounts for an increase in the frequency of $y$ simply by increasing the number of tweets.

You could run this regression and examine the coefficients of $\beta_2, \beta_3$ to evaluate your hypotheses. However, there are additional considerations before moving forward.

A random intercept (one for each athlete) is sort of the minimum way you can account for the longitudinal nature. It may be the case that different athletes are effected by wins and losses differently, hence a random slope for these covariates may be more appropriate
The effect of time here is, in my opinion, something which can't be ignored, but a linear effect may or may not be too limiting depending on the size of your data. A spline or generalized additive model may or may not be appropriate given your data.

These are just some criticisms the model may suffer from. You would be able to come up with more since you have more domain expertise than any of us.

Example

Here is an example of how you might structure your data. Let's assume this exists in a dataframe called d.

# A tibble: 6 x 6
   week   ids Ngames  wins losses Ntweets
  <int> <int>  <dbl> <dbl>  <dbl>   <dbl>
1     1     1      5     4      1      40
2     1     2      2     1      1      23
3     1     3      3     2      1      46
4     1     4      5     1      4      45
5     1     5      4     2      2      30
6     1     6      2     1      1      40

To fit the mixed effects model, we need the lme4 library

library(lme4)

model = glmer(y~wins + losses + week + (1|ids), offset = log(Ntweets),  data = d, family=poisson())

Here, the (1|ids) ensures each athlete gets their own intercept.

I would strongly encourage you to make more formal assumptions about how variables like time effect the frequency of $y$ and if you are willing to posit that some athletes may be more strongly effected by winning/losing.