Calculate drug enrichment curves using AUC

aucpythonrrocscikit learn

I have data on various drugs and p values describing how strongly these drugs are associated with a given disease (e.g., type 2 diabetes; p values calculated with gene set analysis). I want to calculate an enrichment area under the curve using roc, where the x axis moves left from the lowest p values (one p-value per drug) to the highest p values and the y axis describes the % of approved type 2 diabetes drugs found as you move from left to right on the x axis. I know this sort of thing is done for drugs when doing virtual screening etc; however, I have not done this before and am little lost on how to start. Is it possible to use auc functions from sci kit learn for something like this?

drug	p-value	disease	approved for disease i
drug1	0.0032	type2d	1
drug2	0.004	type2d	1
…	…	…	…
drug100	0.87	type2d	0

Thanks in advance for any help!

Best Answer

For those interested, this should work:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve
import pandas as pd 

# convert p value to absolute value -log10
df['P'] = abs(np.log10(df['P']))
df = df.sort_values(by=['P'], ascending=False)

y_score=df['P']
y_true=df['indication'] == "INSERT CLINICAL INDICATION HERE"
y_true = y_true*1

# calculate roc auc to plot
fpr, tpr, threshold = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)

'indication' here is the clinical indication of a drug. So this will create the enrichment curve for a group of drugs in your data set, where the group of drugs is all drugs that are approved for use of whatever disease you are want to look at and have in your data.

'df' is a data frame that contains a p value for each drug with a phenotype and contains the disease a drug is approved for.

edit = typo

Related Solutions

Solved – Calculate LOO-AUC values using glmnet

number of folds - default is 10. Although nfolds can be as large as the sample
size (leave-one-out CV), it is not recommended for large datasets. Smallest
value allowable is nfolds=3

From the package documenation it appears that you indeed can set nfolds equal to the sample size to perform leave-one-out CV.

However, the problem you are facing - as the error message indicates, is that, in order to calculate the AUC ( which really needs a way to rank your test cases) glmnet needs at least 10 obs.

Think about - if no. of test cases is only 1 how are you supposed to rank just one case?

This is only an issue because of the performance measure (auc) you have chosen. Other measures which do not require ranking i.e., those that can be sufficiently calculated using just on one test case ex: Mean squared error will not give you such an error you see.

Solved – ROC curves and AUC in simulations to compare models

Since you are using the ROC, I presume that you are running 5 classifiers. Frank is right about the ROC, that's not the way people compare models. For the linear, and generalized linear models you can apply the likelihood ratio test.

However, in case you are after the best prediction performance, and particularly in case you are not using a parametric model, but say a random forest classifier, I would do the following:

generate data
split it randomly into a training and testing set
train all your 5 models and test their performance
repeat the entire procedure for as many time as the run time permits and store all 5 ROC curves (I would pick a 1000, or 10000 as a minimum, depending on the convergence of the mean predictions)
report the means of the 5 ROC curves together with a 90% pointwise confidence interval around them

The idea is of course that you pick a model that seems like the best combination of high AUC and low variance (narrow intervals around the mean) of the estimates.

Best

Best Answer

Related Solutions

Solved – Calculate LOO-AUC values using glmnet

Solved – ROC curves and AUC in simulations to compare models

Related Question