If there is only one such column, and the data type is string, such that values are "Yes" and "No", you can change the datatype with as.numeric(as.factor())
and this converts it to a factor variable which works with cor()
library(MASS) # contains many sample datasets
data(Pima.te) # diabetes dataset, has 1 Yes/No column
str(Pima.te)
Result
'data.frame': 332 obs. of 8 variables:
$ npreg: int 6 1 1 3 2 5 0 1 3 9 ...
$ glu : int 148 85 89 78 197 166 118 103 126 119 ...
$ bp : int 72 66 66 50 70 72 84 30 88 80 ...
$ skin : int 35 29 23 32 45 19 47 38 41 35 ...
$ bmi : num 33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
$ ped : num 0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 0.704 0.263 ...
$ age : int 50 31 21 26 53 51 31 33 27 29 ...
$ type : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 1 2
-- change the datatype:
Pima.te$type <- as.numeric(as.factor(Pima.te$type))
str(Pima.te)
Result
'data.frame': 332 obs. of 8 variables:
$ npreg: int 6 1 1 3 2 5 0 1 3 9 ...
$ glu : int 148 85 89 78 197 166 118 103 126 119 ...
$ bp : int 72 66 66 50 70 72 84 30 88 80 ...
$ skin : int 35 29 23 32 45 19 47 38 41 35 ...
$ bmi : num 33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
$ ped : num 0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 0.704 0.263 ...
$ age : int 50 31 21 26 53 51 31 33 27 29 ...
$ type : num 2 1 1 2 2 2 2 1 1 2 ...
--
cor(Pima.te)
Result:
npreg glu bp skin bmi ped age type
npreg 1.00000 0.09548 0.17948 0.08521 -0.01591 0.07550 0.66738 0.2409
glu 0.09548 1.00000 0.19468 0.23517 0.27415 0.23521 0.23456 0.5199
bp 0.17948 0.19468 1.00000 0.20480 0.33819 0.03123 0.32488 0.1705
skin 0.08521 0.23517 0.20480 1.00000 0.65854 0.13691 0.09458 0.2677
bmi -0.01591 0.27415 0.33819 0.65854 1.00000 0.12672 0.04733 0.3147
ped 0.07550 0.23521 0.03123 0.13691 0.12672 1.00000 0.15301 0.2517
age 0.66738 0.23456 0.32488 0.09458 0.04733 0.15301 1.00000 0.2830
type 0.24090 0.51994 0.17052 0.26772 0.31468 0.25167 0.28297 1.0000
This result probably does not make any sense, and maybe you prefer dummy variables 1 and 0 instead of 2 and 1, but you get the idea.
If there are many such variables in your dataframe, it is a different story and I would use
dplyr::mutate_if(is.factor, as.numeric) #pseudocode
but that's maybe too complicated for now
Model Formulation
One way forward may be to aggregate tweets by week and count the number of occurrences of the use "we" adjusting for wins/losses and using an offset to account for the number of tweets made.
If your hypothesis is that "Wins and losses effect how frequently an athlete refers to the team collectively as 'we'" then it might be sensible to formulate your data as follows
Week |
# We |
Tweets |
Wins |
Losses |
ID |
0 |
... |
... |
... |
... |
... |
1 |
... |
... |
... |
... |
... |
Here, "# We" us the outcome (which I will reffer to as $y$). Even under the null hypothesis (Wins and Losses do not effect the frequency of $y$) the frequency can none the less increase/decrease simply by tweeting more. Thus, we will need to account for that somehow.
A typical model for count data is Poisson regression. We can perhaps the model the frequency of $y$ as follows
$$ \log(E(y_{i, j})) = \beta_{0, i} + \beta_1 \mbox{week}_{i,j} + \beta_2\mbox{Wins}_{i,j} + \beta_3 \mbox{Losses}_{i,j} + \log(\mbox{Tweets}_{i,j}) $$
There are a few important things to note here:
Each athlete has their own intercept in this model $\beta_{0,i}$. This sort of model is known as a mixed effects model and can account for the longitudinal nature of the data.
$\log(\mbox{Tweets})$ does not have a coefficient. This is known as an offset and accounts for an increase in the frequency of $y$ simply by increasing the number of tweets.
You could run this regression and examine the coefficients of $\beta_2, \beta_3$ to evaluate your hypotheses. However, there are additional considerations before moving forward.
A random intercept (one for each athlete) is sort of the minimum way you can account for the longitudinal nature. It may be the case that different athletes are effected by wins and losses differently, hence a random slope for these covariates may be more appropriate
The effect of time here is, in my opinion, something which can't be ignored, but a linear effect may or may not be too limiting depending on the size of your data. A spline or generalized additive model may or may not be appropriate given your data.
These are just some criticisms the model may suffer from. You would be able to come up with more since you have more domain expertise than any of us.
Example
Here is an example of how you might structure your data. Let's assume this exists in a dataframe called d
.
# A tibble: 6 x 6
week ids Ngames wins losses Ntweets
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 4 1 40
2 1 2 2 1 1 23
3 1 3 3 2 1 46
4 1 4 5 1 4 45
5 1 5 4 2 2 30
6 1 6 2 1 1 40
To fit the mixed effects model, we need the lme4
library
library(lme4)
model = glmer(y~wins + losses + week + (1|ids), offset = log(Ntweets), data = d, family=poisson())
Here, the (1|ids)
ensures each athlete gets their own intercept.
I would strongly encourage you to make more formal assumptions about how variables like time effect the frequency of $y$ and if you are willing to posit that some athletes may be more strongly effected by winning/losing.
Best Answer
Sometimes a formal statistical test is overkill. Row by row, the entries in the first column are the largest. Draw a picture to make this apparent: side-by-side boxplots or dotplots would work nicely.
Although this is a post-hoc comparison, if the initial intent had been to compare the first column against the rest for a shift in distribution, the most extreme characterizations would be that either all maxima or all minima occur in the first column (a two-sided test). The chance of this occurring by chance, if all columns contained values drawn at random from a common distribution, would be $2 (\frac{1}{6})^7$ = about 0.0007%.
In fact, the first two contains the largest 7 of the 42 values. Again, ex post facto, the chance of such an extreme ordering occurring equals $\frac{2}{42 \choose 7}$ = about 0.000007%.
These results indicate that any reasonably powerful test you choose to conduct will conclude there's a highly significant difference.
In any event, You don't need a p-value; you need to characterize how large the difference is (the right way to do this depends on what the data mean) and you need to seek an explanation for the difference.