# Hypothesis Testing – Hypothesis Testing with Big Data

hypothesis testinglarge data

How do you perform hypothesis tests with big data? I wrote the following MATLAB script to emphasize my confusion. All it does is generate two random series, and run a simple linear regression of one variable on the other. It performs this regression several times using different random values and reports averages. What tends to happen is as I increase the sample size, p-values on average get very small.

I know that because the power of a test increases with sample size, given a large enough sample, p-values will become small enough, even with random data, to reject any hypothesis test. I asked around and some people said that with 'Big Data' its more important to look at the effect size, ie. whether the test is significant AND has an effect large enough for us to care about. This is because in large sample sizes the p-values will pick up of very small differences, like it is explained here.

However, the effect size can determined by scaling of the data. Below I scale the explanatory variable to a small enough magnitude that given a large enough sample size, it has large significant effect on the dependent variable.

So I'm wondering, how do we gain any insight from Big Data if these problems exist?

%make average
%decide from how many values to make average
obs_inside_average = 100;

%make average counter
average_count = 1;

for average_i = 1:obs_inside_average,

%do regression loop
%number of observations
n = 1000;

%first independent variable (constant term)
x(1:10,1) = 1;

%create dependent variable and the one regressor
for i = 1:10,

y(i,1) = 100 + 100*rand();

x(i,2) = 0.1*rand();

end

%calculate coefficients
beta = (x'*x)\x'*y;

%calculate residuals
u = y - x*beta;

%calcuatate sum of squares residuals
s_2 = (n-2)\u'*u;

%calculate t-statistics
design = s_2*inv(x'*x);

%calculate standard errors
stn_err = [sqrt(design(1,1));sqrt(design(2,2))];

%calculate t-statistics
t_stat(1,1) = sqrt(design(1,1))\(beta(1,1) - 0);
t_stat(2,1) = sqrt(design(2,2))\(beta(2,1) - 0);

%calculate p-statistics
p_val(1,1) = 2*(1 - tcdf(abs(t_stat(1,1)), n-2));
p_val(2,1) = 2*(1 - tcdf(abs(t_stat(2,1)), n-2));

%save first beta to data column 1
data(average_i,1) = beta(1,1);

%save second beta to data column 2
data(average_i,2) = beta(2,1);

%save first s.e. to data column 3
data(average_i,3) = stn_err(1,1);

%save second s.e. to data column 4
data(average_i,4) = stn_err(2,1);

%save first t-stat to data column 5
data(average_i,5) = t_stat(1,1);

%save second t-stat to data column 6
data(average_i,6) = t_stat(2,1);

%save first p-val to data column 7
data(average_i,7) = p_val(1,1);

%save second p-val to data column 8
data(average_i,8) = p_val(2,1);

end

%calculate first and second beta average
b1_average = mean(data(:,1));
b2_average = mean(data(:,2));

beta = [b1_average;b2_average];

%calculate first and second s.e. average
se1_average = mean(data(:,3));
se2_average = mean(data(:,4));

stn_err = [se1_average;se2_average];

%calculate first and second t-stat average
t1_average = mean(data(:,5));
t2_average = mean(data(:,6));

t_stat = [t1_average;t2_average];

%calculate first and second p-val average
p1_average = mean(data(:,7));
p2_average = mean(data(:,8));

p_val = [p1_average;p2_average];

beta
stn_err
t_stat
p_val