Hypothesis Testing – Hypothesis Testing with Big Data

How do you perform hypothesis tests with big data? I wrote the following MATLAB script to emphasize my confusion. All it does is generate two random series, and run a simple linear regression of one variable on the other. It performs this regression several times using different random values and reports averages. What tends to happen is as I increase the sample size, p-values on average get very small.

I know that because the power of a test increases with sample size, given a large enough sample, p-values will become small enough, even with random data, to reject any hypothesis test. I asked around and some people said that with 'Big Data' its more important to look at the effect size, ie. whether the test is significant AND has an effect large enough for us to care about. This is because in large sample sizes the p-values will pick up of very small differences, like it is explained here.

However, the effect size can determined by scaling of the data. Below I scale the explanatory variable to a small enough magnitude that given a large enough sample size, it has large significant effect on the dependent variable.

So I'm wondering, how do we gain any insight from Big Data if these problems exist?

%make average
%decide from how many values to make average
obs_inside_average = 100;

%make average counter
average_count = 1;

for average_i = 1:obs_inside_average,






%do regression loop
%number of observations
n = 1000;

%first independent variable (constant term)
x(1:10,1) = 1; 

%create dependent variable and the one regressor
for i = 1:10,

    y(i,1) = 100 + 100*rand();

    x(i,2) = 0.1*rand();

end





%calculate coefficients
beta = (x'*x)\x'*y;

%calculate residuals
u = y - x*beta;

%calcuatate sum of squares residuals
s_2 = (n-2)\u'*u;

%calculate t-statistics
design = s_2*inv(x'*x);

%calculate standard errors
stn_err = [sqrt(design(1,1));sqrt(design(2,2))];

%calculate t-statistics
t_stat(1,1) = sqrt(design(1,1))\(beta(1,1) - 0);
t_stat(2,1) = sqrt(design(2,2))\(beta(2,1) - 0);

%calculate p-statistics
p_val(1,1) = 2*(1 - tcdf(abs(t_stat(1,1)), n-2));
p_val(2,1) = 2*(1 - tcdf(abs(t_stat(2,1)), n-2));






%save first beta to data column 1
data(average_i,1) = beta(1,1);

%save second beta to data column 2
data(average_i,2) = beta(2,1);

%save first s.e. to data column 3
data(average_i,3) = stn_err(1,1);

%save second s.e. to data column 4
data(average_i,4) = stn_err(2,1);

%save first t-stat to data column 5
data(average_i,5) = t_stat(1,1);

%save second t-stat to data column 6
data(average_i,6) = t_stat(2,1);

%save first p-val to data column 7
data(average_i,7) = p_val(1,1);

%save second p-val to data column 8
data(average_i,8) = p_val(2,1);

end

%calculate first and second beta average
b1_average = mean(data(:,1));
b2_average = mean(data(:,2));

beta = [b1_average;b2_average];

%calculate first and second s.e. average
se1_average = mean(data(:,3));
se2_average = mean(data(:,4));

stn_err = [se1_average;se2_average];

%calculate first and second t-stat average
t1_average = mean(data(:,5));
t2_average = mean(data(:,6));

t_stat = [t1_average;t2_average];

%calculate first and second p-val average
p1_average = mean(data(:,7));
p2_average = mean(data(:,8));

p_val = [p1_average;p2_average];

beta
stn_err
t_stat
p_val

Best Answer

As Peter suggested, I think one of the important things in the era of "Big Data" is to put even less emphasis on p-values, and more on an estimation of the magnitude of effect.

Some of my own work struggles with this in ways I think are even more insidious than with Big Data - for stochastic computational models, your power is entirely a function of patience and computing resources. It's an artificial construct.

So turn back to the effect estimate. Even if its significant, does a 0.0001% increase in something matter in the real world?

I've also been playing around with reversing some of the ideas behind reporting study power. Instead of reporting the power your study had to detect the observed effect, reporting the minimum effect size the study was powered to find. That way the reader can know if significance was essentially guaranteed.

Best Answer

Related Solutions

Solved – Exercise of Hypothesis Testing

Solved – Random Forest in a Big Data setting

Related Question