I'm trying to interpret the weights of a linear svm which I use to classify elements in my dataset of patients into two classes: alzheimer and non-alzheimer. From this post I understand that the value of each weight can be interpreted as: "how much a feature contributes in the classification". I'm interest in understanding how to correlate the sign of the weights to the classification label: in what way do I infer that positive weights are associated to the alzheimer label and negative weights to the non-alzheimer label, and not vice-versa? I also read this post but it seems to me his reasoning is based on having two numerical labels, one positive and one negative…
SVM – Interpreting Linear SVM Weights in Binary Classification: Class Significance
classificationinterpretationsvmweights
Related Solutions
Let $\mathcal{X}$ represent your input space i.e the space where your data points resides. Consider a function $\Phi:\mathcal{X} \rightarrow \mathcal{F}$ such that it takes a point from your input space $\mathcal{X}$ and maps it to a point in $\mathcal{F}$. Now, let us say that we have mapped all your data points from $\mathcal{X}$ to this new space $\mathcal{F}$. Now, if you try to solve the normal linear svm in this new space $\mathcal{F}$ instead of $\mathcal{X}$, you will notice that all the earlier working simply look the same, except that all the points $x_i$ are represented as $\Phi(x_i)$ and instead of using $x^Ty$ (dot product) which is the natural inner product for Euclidean space, we replace it with $\langle \Phi(x), \Phi(y) \rangle$ which represents the natural inner product in the new space $\mathcal{F}$. So, at the end, your $w^*$ would look like,
$$ w^*=\sum_{i \in SV} h_i y_i \Phi(x_i) $$
and hence, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i \langle \Phi(x_i), \Phi(x) \rangle $$
Similarly, $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j \langle \Phi(x_j), \Phi(x_i)\rangle\right)\right) $$
and your classification rule looks like: $c_x=\text{sign}(\langle w, \Phi(x) \rangle+b)$.
So far so good, there is nothing new, since we have simply applied the normal linear SVM to just a different space. However, the magic part is this -
Let us say that there exists a function $k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ such that $k(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$. Then, we can replace all the dot products above with $k(x_i, x_j)$. Such a $k$ is called a kernel function.
Therefore, your $w^*$ and $b^*$ look like, $$ \langle w^*, \Phi(x) \rangle = \sum_{i \in SV} h_i y_i k(x_i, x) $$ $$ b^*=\frac{1}{|SV|}\sum_{i \in SV}\left(y_i - \sum_{j=1}^N\left(h_j y_j k(x_j, x_i)\right)\right) $$
For which kernel functions is the above substitution valid? Well, that's a slightly involved question and you might want to take up proper reading material to understand those implications. However, I will just add that the above holds true for RBF Kernel.
To answer your question, "Is the situation so that all the support vectors are needed for the classification?" Yes. As you may notice above, we compute the inner product of $w$ with $x$ instead of computing $w$ explicitly. This requires us to retain all the support vectors for classification.
Note: The $h_i$'s in the final section here are solution to dual of the SVM in the space $\mathcal{F}$ and not $\mathcal{X}$. Does that mean that we need to know $\Phi$ function explicitly? Luckily, no. If you look at the dual objective, it consists only of inner product and since we have $k$ that allows us to compute the inner product directly, we don't need to know $\Phi$ explicitly. The dual objective simply looks like, $$ \max \sum_i h_i - \sum_{i,j} y_i y_j h_i h_j k(x_i, x_j) \\ \text{subject to : } \sum_i y_i h_i = 0, h_i \geq 0 $$
If it is only 70%-30% there is probably no need to balance the dataset. The class imbalance problem is caused by not having enough patterns for the minority class, rather than a high ratio of positive to negative patterns. Generally, if you have enough data, the "class imbalance problem" doesn't arise. Also, note that if you artificially balance the dataset, you are implying an equal prior probability of positive and negative patterns. If that isn't true, your model may give bad predictions by over-predicting the minority class.
More importantly, there may be an overlap between classes such that the Bayes optimal decision is always to assign patterns to the positive class, in which case your model is doing exactly the right thing. Consider the case where there is one explanatory variable, which is distributed according to a standard normal distribution for both classes. In that case, as the positive class has a higher prior probability, the optimal model assigns all patterns to the positive class. Similar examples can be constructed where the class means are not the same, but the difference is small compared with the variation.
If classifying the majority class is a problem, that suggests that the misclassification costs of false-positive and false-negative costs are not the same. This can be built into the classifier by changing the threshold, rather than the model, as you are using a logistic loss.
Best Answer
Positive and negative weights are both associated with AD and non AD labels. If you code AD as 1 and non AD as 0, then positive weights are associated positively and negative negatively with the AD label and vice versa with the non AD label. I.e., positive weight means that the larger that variable is, the higher chance that a subject will be classified as AD, and negative weight means that the lower that variable is, the higher the chance of AD classification.
That being said I am not a huge fan of interpreting weights from ML models, see for example Haufe et al., On the interpretation of weight vectors of linear models in multivariate neuroimaging https://www.sciencedirect.com/science/article/pii/S1053811913010914