Solved – Visualising 2 Independent Variables and 1 Dependent Variable

data visualizationpython

I'm trying to come up with the best way to graphically display misfits from multiple geophysical inversion runs, where I am fixing two independent variables and analysing the misfit of each test. For example, the test structure may look like

        Var1  Var2  Misfit (%)
Test 1    8    1.5   13.5
          8    2.0   12.4
          8    2.5   10.1

        Var1  Var2  Misfit (%)
Test 2    10   1.5   20.4
          10   2.0   23.1
          10   2.5   15.0
        Var1  Var2  Misfit (%)
Test 3    12   1.5    5.5
          12   2.0    8.9
          12   2.5   17.5

What would be the best way to graphically display these tests? The objective is to better understand trends in misfit given the two independent variables, Var1 and Var2.
I'm thinking I could set different colours for Var1, set the x-axis to Var2, and then set the y-axis to Misfit.
If anyone has any experience with this sort of testing and visualization, any recommendations is appreciated! I will most likely be working with Python or Python Pandas.

Best Answer

Since your DV is Misfit, using Var1 and Var2 as your x-y axes, and coloring Misfit, i.e. your z and with a suitable colormap would be the best to obtain an intuition how your DVs combined effect look like.

Specific to Python, you can use matplotlib scatter plots with c argument as your color variable. For example, in your case, assume we have a dataframe with columns Var1,Var2 and MisFit:

`plt.scatter(df.Var1,df.Var2,c=df.MisFit)`

Related Solutions

Solved – OLS results dependent on scaling of independent variable

@Alecos gave a very thorough mathematical answer.

More intuitively, suppose you regress weight on height. You measure weight in pounds and height in inches. Then you are told you should have measured weight in kilos and height in centimeters. Many of the numbers in the regression results will change, but the meaning will stay exactly the same.

So, in answer to your question: No. The fact you stated doesn't mean anything; it is an automatic consequence of what you did. Whether you should standardize your variables is a good question, but the changes you noted give no help in answering it.

R Data Visualization – Visualizing Many Variables in One Plot

Fortuitously or otherwise, your example is of optimal size (up to 7 values for each of 15 groups) first, to show that there is a problem graphically; and second, to allow other and fairly simple solutions. The graph is of a kind often called spaghetti by people in different fields, although it's not always clear whether that term is meant as affectionate or abusive. (The term spaghetti was used by Gene Zelazny in 1985, but may well be much older yet.) The graph does show the collective or family behaviour of all the groups, but it is fairly hopeless at showing the detail to be explored.

One standard alternative is just to show the separate groups in separate panels, but that in turn can make precise group-to-group comparisons difficult; each group is separated from its context of the other groups.

So why not combine both ideas: a separate panel for each group, but also show the other groups as backdrop? This hinges crucially on highlighting the group which is in focus and on downplaying the others, which is easy enough in this example given some use of line colour, thickness etc. In other examples, marker or point symbol choices might be natural instead.

In this case, details of possible practical or scientific importance or interest are highlighted:

We only have one value for A and M.
We don't have all values for all given years in all other cases.
Some groups plot high, some low, and so forth.

I won't attempt an interpretation here: the data are anonymous, but that is the researcher's concern in any case.

Depending on what is easy or possible in your software, there is scope for changing small details here, such as whether axis labels and titles are repeated (there are simple arguments both for and against).

The larger issue is how far this strategy will work more generally. The number of groups is the major driver, more so than the number of points in each group. Roughly speaking, the approach might work up to about 25 groups (a 5 x 5 display, say): with more groups, not only do the graphs become smaller and more difficult to read, but even the researcher loses the inclination to scan all the panels. If there were hundreds (thousands, ...) of groups, it would usually be essential to select a small number of groups to show. Some mix of criteria such as selecting some "typical" and some "extreme" panels would be needed; that should be driven by project goals and some idea of what makes sense for each dataset. Another approach that can be efficient is to emphasize a small number of series in each panel. So, if there were 25 broad groups, each broad group could be shown with all others as backdrop. Alternatively, there could be some averaging or other summarization. Using (e.g.) principal or independent components might also be a good idea.

Although the example calls for line plots, the principle is naturally much general. Examples could be multiplied, scatter plots, model diagnostic plots, etc.

Similar but not identical ideas

Cleveland (1985, pp.74, 203, 205, 268) shows graphs in which summary curves for groups are repeated with data shown separately for each group. (Note: these graphs do not appear in Cleveland 1994.)

Wallgren et al. (1996, pp.47, 69) use the same idea.

Cleveland, W.S. 1985. Elements of Graphing Data. Monterey, CA: Wadsworth.

Cleveland, W.S. 1994. Elements of Graphing Data. Summit, NJ: Hobart Press.

Wallgren, A., B. Wallgren, R. Persson, U. Jorner, and J.-A. Haaland. 1996. Graphing Statistics and Data: Creating Better Charts. Newbury Park, CA: SAGE.

Zelazny (1985 and later editions) has a different twist:

Zelazny, G. 1985. Say It With Charts: The Executive's Guide to Successful Presentations. Homewood, IL: Dow Jones-Irwin. See p.39 for a graph with four panels: series A compared in turn with series B, C, D, E. See also p.111.

Same pages in 4th edition: Zelazny, G. 2001. Say It With Charts: The Executive's Guide to Visual Communication. New York: McGraw-Hill. See p.39 for a graph with four panels: series A compared in turn with series B, C, D, E. See also p.111.

Direct examples [others are most welcome]

Koenker, R. 2005. Quantile Regression. Cambridge: Cambridge University Press. See pp.12-13.

Carr, D.B. and Pickle, L.W. 2010. Visualizing Data Patterns with Micromaps. Boca Raton, FL: CRC Press. See p.85.

Cox, N.J. 2010. Graphing subsets. Stata Journal 10: 670-681.

Yau, N. 2013. Data Points: Visualization That Means Something. Indianapolis, IN: John Wiley. See p.224.

Rougier, N.P., Droettboom, M. and Bourne, P.E. 2014. Ten simple rules for better figures. PLOS Computational Biology 10(9): e1003833. doi:10.1371/journal.pcbi.1003833 link here

Schwabish, J.A. 2014. An economist's guide to visualizing data. Journal of Economic Perspectives 28: 209-234.

Knaflic, C.N. 2015. Storytelling with Data: A Data Visualization Guide for Business Professionals. Hoboken, NJ: Wiley. See p.233.

Unwin, A. 2015. Graphical Data Analysis with R. Boca Raton, FL: CRC Press. See pp.121, 217.

Berinato, S. 2016. Good Charts: The HBR Guide to Making Smarter, More Persuasive Data Visualizations. Boston, MA: Harvard Business Review Press. See p.74.

Cairo, A. 2016. The Truthful Art: Data, Charts, and Maps for Communication. San Francisco, CA: New Riders. See p.211

Camões, J. 2016. Data at Work: Best Practices for Creating Effective Charts and Information Graphics in Microsoft Excel. San Francisco, CA: New Riders. See p.354.

Standage, T. 2016. Go Figure: The Economist Explains: Things You Didn't Know You Didn't Know. London: Profile Books. See p.177.

Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. Cham: Springer. See p.157.

Schwabish, J. 2017. Better Presentations: A Guide for Scholars, Researchers, and Wonks. New York: Columbia University Press. See p.98.

Kriebel, A. and Murray, E. 2018. #MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time. Hoboken, NJ: John Wiley. See p.303.

Grant, R. 2019. Data Visualization: Charts, Maps, and Interactive Graphics. Boca Raton, FL: CRC Press. See p.52.

Koponen, J. and Hildén, J. 2019. The Data Visualization Handbook. Espoo: Aalto ARTS Books. See p.101.

Tufte, E.R. 2020. Seeing with Fresh Eyes: Meaning, Space, Data, Truth. Cheshire, CT: Graphics Press. See p.26 [original work by John Burn-Murdoch, but showing 0 on logarithmic scale (!) and alphabetical order by countries, which could be tuned]

Note: The graph was created in Stata. subsetplot must be installed first with ssc inst subsetplot. Data were copied and pasted from R and value labels were defined to show years as 90 95 00 05 10 15. The main command is

subsetplot connected Val Year, by(Var) c(L) lcolor(gs12) backdrop(line) xtitle("") combine(imargin(small)) subset(lcolor(blue) mcolor(blue))

Best Answer

Related Solutions

Solved – OLS results dependent on scaling of independent variable

R Data Visualization – Visualizing Many Variables in One Plot

Related Question