1) Using the defaults for MatchIt, nearest neighbor matching matches on the propensity score as defined by a logistic regression of treatment on the covariates included in your formula. For each treated unit, it finds the one unmatched control with the closest propensity score, and then throws out the unmatched control units. There is no issue with continuous vs. categorical covariates here. Note King & Nielsen (2016), who describe why propensity score matching can actually make balance worse, as in your example.
2) MatchIt creates matches for the ATET, but the Matching package, which also implements genetic matching, allows you to specify that you want the ATE. After matching, you can simply perform the regression analysis you would have had you randomly assigned your units (assuming balance has been achieved).
It seems like you are describing something like a difference-in-differences panel model with matching on pre-treatment trends. A common framework for this is an event study model. Sometimes it can involve matching.
In general, we use matching to build a synthetic counterfactual population. You would try to match on the pre-treatment or time-invariant factors that are most likely to lead to selection or self-selection into treatment. If you can build a counterfactual that is sufficiently similar, you can proceed as if your matched sample is an equivalent to the treatment sample that never received treatment. The researcher can build the case that the matched sample is sufficiently similar by comparing the expected/mean values of important attributes, often through t-tests or similar. If there is no statistically significant difference on the important range of attributes, you have some evidence that your matched sample is a good counterfactual. You can never be totally certain you have identified all key elements, which is why it is not as reliable as a true experiment with randomly assigned treatment.
If you are matching with panel data, stable attributes can be important, but you also need to examine whether your two groups have evidence of "parallel trends" during pre-treatment periods. This is probably what was meant when you heard about "sales history" in building a matched control sample. Two stores might be very similar in size, location, average sales, or a range of attributes, but if they were not exhibiting similar trends in sales before treatment, there is no reason to think they would continue to do it post-treatment, which means the control cases would not serve to show you how the treatment cases would have behaved in absence of the treatment.
Event study models test for "non-parallel pre-treatment trends" by picking one time point, and testing whether the difference-in-differences of other pre-treatment time points are significant. They don't have to be on the same "level" or nominal value for this to be valid. They just have to continue or change in a similar way for each time-point up to the point in time when treatment is applied. It is extremely helpful to know whether treated cases had anticipation of the treatment, too. If treated cases have anticipation, the researcher needs to think about how this would alter pre-treatment levels. It might cause the researcher to reject a decent match sample or accept one that is inappropriate. This isn't perfect and can't eliminate selection effects as reliably as random treatment, but it helps build the case that the counterfactual group is appropriate.
If you have a large pool of potential counterfactual cases, you might use matching on pre-treatment time trends (differences in values from one point to the next) to find better control cases.
Finding a matched sample that is similar in both trend and nominal values ("level heterogeneity") is often difficult. If we have evidence of no non-parallel trends, we can deal with level heterogeneity by calculating ATT as
$$ATT=(posttreatment - pretreatment)-(postcontrol - precontrol)$$
This way the ATT ignores the differences in expected nominal value that might exist between treatment and control. In event study models, it is more common to use unit (store) fixed effects that subtract each stores grand/panel mean from each time point, which forces all units/stores onto the same expected nominal value.
The important thing for these models is they are imperfect compared with true random assignment in the research design (often not possible), but they are often the best we can do and are often pretty convincing if the researcher is careful and transparent about the data. Here's a nice resource.
Edit adding answer to OP's useful follow-up question:
In my opinion, you never have certainty that you are removing confounding effects in observational (non-experimental) data. You can do certain things that make confounding effects less likely and results more believable. It would be important to know whether you are matching on "levels" of pre-treatment sales (i.e. avg. values) or on trends of pre-treatment sales. The former adds something but is not enough to trust results (IMO). To me, the latter can be convincing, especially when you test for non-parallel trends in sales and eliminate "level" heterogeneity with store FEs (event study framework).
Best Answer
When doing 1:1 matching without replacement, all matched units receive a weight of 1 and all unmatched units receive a weight of 0.
match.data()
removes the unmatched units from the dataset, so all that remains are the units with weights of 1.How matching weights are computed is explained in the documentation for
matchit()
. When matching for the ATT, treated units receive a weight of 1, and matched control units receive a weight of $p_i/(1-p_i)$, where $p_i$ is the proportion of units in the pair that unit $i$ is in that are treated. For 1:1 matching, every pair has 2 units, 1 of which is treated, so $p_i = .5$. Applying that formula (which is the same formula used for computing ATT weights with propensity score weighting), all matched control units receive a weight of 1.