Solved – Machine Learning: Feature Comparison

feature selectionmachine learningnormal distributionpca

I am working on a machine learning algorithm which performs a binary classification. I have different features, and I would like to know which of them are better for the classification, I mean what of them really make a difference between classes and what of them are not important.

I have thought of computing the normal distribution of every feature in class A and B and getting the overlap between them. If they have a big area in common the feature is not good.

I have also heard about Principal Component Analysis. I do not know what is the best way of determining which are the best features, so I ask or your help.

Thanks a lot!!!!

Best Answer

As I continue to investigate this subject I found out that there are different methods to compare features, I am going to post some of them, just in case someone has the same problem than me:

1.-Kolmogorov-Smirnov compares the maximum distance between two cumulative distribution functions and returns a value meaning the similarity between the functions. If you compare the feature from sample A and B gives you an idea, if they are close is not a good feature.

2.-Compute overlap between distribution functions, the bigger the overlap the worse the feature is to differenciate the classes, code:

# compute overlap between the 2 distributions
ker_sick = stats.gaussian_kde(sick)
ker_healthy = stats.gaussian_kde(healthy)
min_point, max_point = aux.get_min_max(sick, healthy)
points_range = np.linspace(min_point, max_point, 100)
sick_points = ker_sick(points_range)
healthy_points = ker_healthy(points_range)
min_points = aux.min_between_two_list(sick_points, healthy_points)


def y_pts(pt):
    y_pt = min(ker_sick(pt), ker_healthy(pt))
    return y_pt


overlap = integrate.quad(y_pts, a=-np.inf, b=np.inf)
print("overlap: ", overlap)
overlap_list.append(overlap[0])

# plot distributions (healthy, sick) and the overlap between them
fig = plt.figure()
ax = fig.add_subplot(121)
ax = sns.kdeplot(sick, shade=True, cut=0, label="healthy", color='g')
ax = sns.kdeplot(healthy, shade=True, cut=0, label="sick", color='r')
ax.plot(points_range, min_points, color="y", alpha=1)
ax.fill_between(points_range, 0, min_points)
ax.set(xlabel='value', ylabel='probability', title=name)

3.- Previous methods are bit more homemade, scikit also includes some module for feature selection using different methods to compare features and let you know which one is better, Scikit feature selection

I tried these 3 methods, including several variatons of the scikit feature selection, and they always matched in which was the best feature, so I guess everyone of them is working properly.

Thanks a lot and I hope this ends up being helpful!

Related Question