Not really, if not by manually making RF clone doing bagging of rpart
models.
Some option comes from the fact that the output of RF is actually a continuous score rather than a crisp decision, i.e. the fraction of trees that voted on some class. It can be extracted with predict(rf_model,type="prob")
and used to make, for instance, a ROC curve which will reveal a better threshold than .5 (which can be later incorporated in RF training with cutoff
parameter).
classwt
approach also seems valid, but it does not work very well in practice -- the transition between balanced prediction and trivial casting of the same class regardless of attributes tends to be too sharp to be usable.
You don't seem to want logistic regression at all. What you say is "I would like to maximize the difference between true positives and false positives." That is a fine objective function, but it is not logistic regression. Let's see what it is.
First, some notation. The dependent variable is going to be $Y_i$:
\begin{align}
Y_i &= \left\{ \begin{array}{l}
1 \qquad \textrm{Purchase $i$ was profitable}\\
0 \qquad \textrm{Purchase $i$ was un-profitable}
\end{array}
\right.
\end{align}
The independent variables (the stuff you use to try to predict whether you should buy) are going to be $X_i$ (a vector). The parameter you are trying to estimate is going to be $\beta$ (a vector). You will predict buy when $X_i\beta>0$. For observation $i$, you predict buy when $X_i\beta>0$ or when the indicator function $\mathbf{1}_{X_i\beta>0}=1$.
A true positive happens on observation $i$ when both $Y_i=1$ and $\mathbf{1}_{X_i\beta>0}=1$. A false positive on observation $i$ happens when $Y_i=0$ and $\mathbf{1}_{X_i\beta>0}=1$. You wish to find the $\beta$ which maximizes true positives minus false positives, or:
\begin{equation}
max_\beta \; \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N (1-Y_i)\cdot\mathbf{1}_{X_i\beta>0}
\end{equation}
This is not an especially familiar objective function for estimating a discrete response model, but bear with me while I do a little algebra on the objective function:
\begin{align}
&\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N (1-Y_i)\cdot\mathbf{1}_{X_i\beta>0}\\
= &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N \mathbf{1}_{X_i\beta>0}
+ \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0}\\
= &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N \mathbf{1}_{X_i\beta>0}
+ \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} \\
& \qquad + \sum_{i=1}^N 1 - \sum_{i=1}^N 1 + \sum_{i=1}^N Y_i - \sum_{i=1}^N Y_i\\
= &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N (1-Y_i)(1-\mathbf{1}_{X_i\beta>0}) - \sum_{i=1}^N 1 + \sum_{i=1}^N Y_i \\
\end{align}
OK, now notice that the last two terms in that sum are not functions of $\beta$, so we can ignore them in the maximization. Finally, we have just shown that the problem you want to solve, "maximize the difference between true positives and false positives" is the same as this problem:
\begin{equation}
max_\beta \; \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N (1-Y_i)(1-\mathbf{1}_{X_i\beta>0})
\end{equation}
Now, that estimator has a name! It is named the maximum score estimator. It is a very intuitive way to estimate the parameter of a discrete response model. The parameter is chosen so as to maximize the number of correct predictions. The first term is the number of true positives, and the second term is the number of true negatives.
This is a pretty good way to estimate a (binary) discrete response model. The estimator is consistent, for example. (Manski, 1985, J of Econometrics) There are some oddities to this estimator, though. First, it is not unique in small samples. Once you have found one $\beta$ which solves the maximization, then any other $\beta$ which makes the exact same predictions in your dataset will solve the maximization---so, infinitely many $\beta$s close to the one you found. Also, the estimator is not asymptotically normal, and it converges slower than typical maximum likelihood estimators---cube root $N$ instead of root $N$ convergence. (Kim and Pollard, 1990, Ann of Stat) Finally, you can't use bootstrapping to do inference on it. (Abrevaya & Huang, 2005, Econometrica) There are some papers using this estimator though---there is a fun one about predicting results in the NCAA basketball tournament by Caudill, International Journal of Forecasting, April 2003, v. 19, iss. 2, pp. 313-17.
An estimator that overcomes most of these problems is Horowitz's smoothed maximum score estimator (Horowitz, 1992, Econometrica and Horowitz, 2002, J of Econometrics). It gives a root-$N$ consistent, asymptotically normal, unique estimator which is amenable to bootstrapping. Horowitz provides example code to implement his estimator on his webpage.
Best Answer
I think you're making the common mistake of treating logistic regression as a classifier. We don't have false negatives or positives because those require an assignment of labels. We are instead modeling the probabilities of a success so all we have are the modeled probabilities for the observations where the event happened, and where the event didn't happen. But depending on our threshold, a probability of $\hat y_i = 0.7$ may lead to either a positive or negative label.
In this light, $c_i$ doesn't actually represent a penalty for false negatives.
Our loss (ignoring multiplicative constants) can be rewritten as $$ \sum \limits_{i \, :\, t_i=1} \log \hat y_i + \sum \limits_{i \, :\, t_i=0} \log (1 - \hat y_i). $$ This means for each observation where the event happened (i.e. $t_i=1$) we get a contribution of $\log \hat y_i$, and analogously we get $\log (1-\hat y_i)$ for observation where the event did not happen (i.e. $t_i=0$).
If we add a $c_i$ term as you did, then we get $$ \sum \limits_{i \, :\, t_i=1} c_i \log \hat y_i + \sum \limits_{i \, :\, t_i=0} \log (1 - \hat y_i). $$ The effect is not that we're forcing the model to minimize false negatives, but rather we are affecting the contribution to the loss of the observations where the event happened. If we're maximizing then $c_i$ being large means our model is going to be encouraged to assign larger probabilities to the observations with $t_i=1$ even if the probabilities for $t_i=0$ observations suffer (although no finite $c_i$ will ever allow for $\hat y_i = 1$ when $t_i=0$, and in general any fixed $c_i$ can be overpowered by a really poorly aligned $t_i$ and $\hat y_i$). That does mean that for an a priori fixed threshold we'll likely see the false negative rate go down, although that's not because we're directly penalizing it but rather we're just encouraging our probabilities to be bigger. Similarly, this same modification will result in a relative increase in the true positive rate but for the exact same reason.