Solved – multiclass classification having class imbalance with Gradient Boosting Classifier

classificationmachine learningmulti-class

I am using Abalon data for classification from UCI(https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data). I have scaled data and used TSNE for visualization.

data=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
x=data.drop('15', axis=1)
y=data['15']
import matplotlib as plt
mapping={'M':0,'I':1,'F':2}`x['M'].replace(mapping,inplace=True)`
from sklearn.preprocessing import StandardScaler
sc=StandardScalar()
x_scaled=sc.fit_transform(x)
from sklearn.manifold import Isomap,TSNE
sne=TSNE(n_components=2)
x_red_sne=sne.fit_transform(x_scaled)
plt.scatter(x=x_red_sne[:,0],y=x_red_sne[:,1],c=data['15'],cmap='spectral')

Visualization of data in 2D

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score,train_test_split
from sklearn.metrics import classification_report,f1_score

 x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,train_size=.7)
gb=GradientBoostingClassifier(n_estimators=200,learning_rate=.1)
gb.fit(x_train,y_train)
cross_val_score(estimator=gb,X=x_test,y=y_test,scoring='f1_weighted',cv=5)
print classification_report(y_true=y_test,y_pred=gb.predict(x_test))

This model is failing poorly as from the classification report its showing all metrics recall, f1, precision as .23,.22,.24.

I understand its multiclass classification with high class imbalance. What can I do to improve the model?

Best Answer

Gradient Boosting is a good approach to tackle multiclass problem that suffers from class imbalance issue. In your cross validation you're not tuning any hyper-parameters for GB. I would recommend following this link and try tuning few parameters.

https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

Related Question