For 100 companies, I have collected (i) tweets
and (ii) corporate website pageviews
for 148
days. The tweetvolume and pageviews per day are two independent variables corpaired against the stock trading volume
for each company, resulting in 100 x 148 = 14,800 observations. My data is structured like this:
company date tweetVol pageviewVol tradingVol
------------------------------------------------
1 1 200 150 2423325
1 2 194 152 2455343
1 3 214 199 3100429
. . . . .
. . . . .
1 148 205 233 2563463
2 1 752 932 7434124
2 2 932 2423 7464354
2 3 600 1435 5324323
. . . . .
. . . . .
. . . . .
100 148 3 155 32324
Because there is much difference in company-size (some companies only receive 2 tweets per day, where others like Apple get over 10,000 per day), all variables are logged to smoothen distribution. (This is in line with previous research – this is for my thesis).
I just performed a linear regression on this data, including both independend variables. R-Squared is .411 but Durbin-Watson only .141 (!) Without looking for the exact bounderies, I know this directly means my residuals are non-linear, eg. auto-correlated, right?
My question is: how can I solve this? When I think about it, this data should not be autocorrelated, so I don't really understand. Is it due to this actually being a timeseries analysis? I wouldn't think that either, since for instance trading volume today is independent of yesterdays trading volume. Can somebody explain this to me?
P.S. At my university, we use SPSS/PASW without additional modules, so I am unable to perform a timeseries analysis on this like you could in STATA or R.
Best Answer
The Durbin-Watson test may suggest the need for an ARIMA model to render the error term free of structure IFF there are no outliers/inliers/pulses AND no unspecified evel/step shifts AND no unspecified Seasonal Pulses AND no unspecified Local Time Trends AND the models' parameters are constant/homogeneous over time AND the error variance is constant/homogeneous over time AND the error variance is not related to the level/expected value AND the error variance can't be modelled as a random variable via GARCH.