How does coarsened exact matching method in R package MatchIt determine the cutpoints for matching

matchingr

It is unclear to me how the cutpoints are determined after we selected the number of cutpoints for each covariate. What is the default "sturges" option?

mNN <- MatchIt::matchit(A ~ X1 + X2, data = d, 
      method="cem", 
      cutpoints = list(X1=6, X2=6))

Best Answer

When a single number is supplied identifying the number of cutpoints, the variable is split into bins by evenly spaced cutpoints (i.e., evenly spaced on the scale of the variable) from the minimum to the maximum. The cutpoints argument identifies the number of bins that will be used to split the variable. For example, for a variable with values ranging from 0 to 6, setting cutpoints to 3 for that variable splits the variable into 3 bins: 0 to 2, 2 to 4, and 4 to 6. A value on the border will be placed into the higher bin (i.e., a value of 2 would be placed into the second bin in the example). Although the cutpoints defining the bins are equally spaced, there can be different numbers of units in each bin.

If instead of a numerical value, a value like "q5" is supplied (i.e., q with a number), the variable will be split into quantiles. For example, setting cutpoints to "q3" will put the lowest third of units into one bin, the next third into another bin, and the highest third into another bin. Depending on the distribution of the variable, the bins will cover different ranges of values; for example, one bin might correspond to the values 0 to 1, while another bin might correspond to the values 3 to 6, but these bins will contain (approximately) the same number of units.

The default "sturges" option uses the algorithm implemented in nclass.Sturges(), which is ceiling(log2(length(x)) + 1). For example, for 100 units, this will produce 8 bins; for 1000 units, this will produce 11 bins, and for 10000 units, this will produce 15. These bins will be evenly spaced on the scale of the variable (like supplying a single number to cutpoints). If there are fewer variable values than requested bins, no binning (i.e., coarsening) will take place and it will be equivalent to exact matching on that variable.

Related Question