Your post seems to be mostly correct.
The way that multiclass linear classifiers are set up is that an example, $x$, is classified by the hyperplane that give the highest score: $\underset{k}{\mathrm{argmax}\,} w_k \cdot x$.
It doesn't matter if these scores are positive or negative.
If the hinge loss for a particular example is zero, then this means that the example is correctly classified.
To see this, the hinge loss will be zero when $1+w_{k}\cdot x_i<w_{y_i}\cdot x_i \;\forall k$. This is a stronger condition than $w_{k}\cdot x_i<w_{y_i}\cdot x_i \;\forall k$, which would indicate that example $i$ was correctly classified as $y_i$.
The 1 in the hinge loss is related to the "margin" of the classifier.
The hinge loss encourages scores from the correct class, $w_{y_i}\cdot x_i$ to not only be higher that scores from all the other classes, $w_k\cdot x_i$, but to be higher than these scores by an additive factor.
We can use the value 1 for the margin because the distance of a point from a hyperplane is scaled by the magnitude of the linear weights: $\frac{w}{|w|}\cdot x$ is the distance of $x$ from the hyperplane with normal vector $w$.
Since the weights are the same for all points in the dataset, it only matters that the scaling factor—1—is the same for all data points.
Also, it may make things easier to understand if you parameterize the loss function as $L(x,y;w)$. You currently have the loss functions as a function of the linear margin, and this is not necessarily the case.
does anyone have a clue why I’m getting way more false positives than false negatives (positive is the minority class)? Thanks in advance for your help!
Because positive is the minority class. There are a lot of negative examples that could become false positives. Conversely, there are fewer positive examples that could become false negatives.
Recall that Recall = Sensitivity $=\dfrac{TP}{(TP+FN)}$
Sensitivity (True Positive Rate) is related to False Positive Rate (1-specificity) as visualized by an ROC curve. At one extreme, you call every example positive and have a 100% sensitivity with 100% FPR. At another, you call no example positive and have a 0% sensitivity with a 0% FPR. When the positive class is the minority, even a relatively small FPR (which you may have because you have a high recall=sensitivity=TPR) will end up causing a high number of FPs (because there are so many negative examples).
Since
Precision $=\dfrac{TP}{(TP+FP)}$
Even at a relatively low FPR, the FP will overwhelm the TP if the number of negative examples is much larger.
Alternatively,
Positive classifier: $C^+$
Positive example: $O^+$
Precision = $P(O^+|C^+)=\dfrac{P(C^+|O^+)P(O^+)}{P(C^+)}$
P(O+) is low when the positive class is small.
Does anyone of you have some advice what I could do to improve my precision without hurting my recall?
As mentioned by @rinspy, GBC works well in my experience. It will however be slower than SVC with a linear kernel, but you can make very shallow trees to speed it up. Also, more features or more observations might help (for example, there might be some currently un-analyzed feature that is always set to some value in all of your current FP).
It might also be worth plotting ROC curves and calibration curves. It might be the case that even though the classifier has low precision, it could lead to a very useful probability estimate. For example, just knowing that a hard drive might have a 500 fold increased probability of failing, even though the absolute probability is fairly small, might be important information.
Also, a low precision essentially means that the classifier returns a lot of false positives. This however might not be so bad if a false positive is cheap.
Best Answer
Artificially constructing a balanced training set is debatable, quite controversial actually. If you do it, you should empirically verify that it really works better than leaving the training set unbalanced. Artificially balancing the test-set is almost never a good idea. The test-set should represent new data points as they come in without labels. You expect them to be unbalanced, so you need to know if your model can handle an unbalanced test-set. (If you don't expect new records to be unbalanced, why are all your existing records unbalanced?)
Regarding your performance metric, you will always get what you ask. If accuracy is not what you need foremost in an unbalanced set, because not only the classes but also the misclassification costs are unbalanced, then don't use it. If you had used accuracy as metric and done all your model selection and hyperparameter tuning by always taking the one with the best accuracy, you are optimizing for accuracy.
I take the minority class as the positive class, this is the conventional way of naming them. Thus precision and recall as discussed below are precision and recall of the minority class.