Calculate drug enrichment curves using AUC

aucpythonrrocscikit learn

I have data on various drugs and p values describing how strongly these drugs are associated with a given disease (e.g., type 2 diabetes; p values calculated with gene set analysis). I want to calculate an enrichment area under the curve using roc, where the x axis moves left from the lowest p values (one p-value per drug) to the highest p values and the y axis describes the % of approved type 2 diabetes drugs found as you move from left to right on the x axis. I know this sort of thing is done for drugs when doing virtual screening etc; however, I have not done this before and am little lost on how to start. Is it possible to use auc functions from sci kit learn for something like this?

drug p-value disease approved for disease i
drug1 0.0032 type2d 1
drug2 0.004 type2d 1
drug100 0.87 type2d 0

Thanks in advance for any help!

Best Answer

For those interested, this should work:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, roc_auc_score, plot_roc_curve
import pandas as pd 

# convert p value to absolute value -log10
df['P'] = abs(np.log10(df['P']))
df = df.sort_values(by=['P'], ascending=False)

y_score=df['P']
y_true=df['indication'] == "INSERT CLINICAL INDICATION HERE"
y_true = y_true*1

# calculate roc auc to plot
fpr, tpr, threshold = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)

'indication' here is the clinical indication of a drug. So this will create the enrichment curve for a group of drugs in your data set, where the group of drugs is all drugs that are approved for use of whatever disease you are want to look at and have in your data.

'df' is a data frame that contains a p value for each drug with a phenotype and contains the disease a drug is approved for.

edit = typo