Logged Variables – Why Use Logged Variables?

correlationdata transformationlogarithm

Probably, this is a very basic question but I don't seem to be able to find a solid answer for it. I hope here, I can.

I'm currently reading papers as a preparation for my own master's thesis. Currently, I'm reading a paper which researches the relationship between tweets and stock market features.

In one of their hypothesis, they propose that "increased tweet volume is associated with an increase in trading volume".

I would expect them, in the pairwise correlations, to correlate tweetVolume with tradingVolume, but instead they report using the logged versions: LN(tweetVolume) and LN(tradingVolume).

For my thesis, I have replicated this bit of their paper. I have collected tweets about 100 companies for over 6 months (tweetVolume) and stock trading volume for the same timeframe. If I correlate the absolute variables, I find r=.282, p.000 but when I use the logged verions, I find r=.488, p=.000.

I don't understand why researchers sometimes use logged versions of their variables and why correlation seems so much higher if you do so. What is the reasoning here, and why is it OK to use logged variables?

Your help is greatly appreciated 🙂

Best Answer

Reasons to use logged variables fall into two categories: Statistical and substantive.

Statistically, if your variables are right-skew (that is, they have a long tail at the high end) then a measure such as correlation or regression can be influenced a lot by one or a few cases at the high end on one or both variables (outliers, leverage points, influential points). Taking the log can help this by reducing or eliminating skew.

Substantively, some concepts are better thought of in terms of ratios than differences. Take the two volume measures you discuss. Now, compare two companies: One a small company trading on NASDAQ that few people have heard of, the other a mega-corporation. The former will get very few tweets per day. The latter will get many; similarly for trading volume. Suppose (just to pick numbers) that company A typically gets 100 tweets a day and the latter gets 100,000.

If company A's tweets go up from 100 to 500 (a difference of 400, a ratio of 5) that's huge news - something must be going on. But if company B's go up from 100,000 to 100,400 (a difference of 400, a ratio very close to 1) no one cares. The rough equivalent would be if it went from 100,000 to 500,000.