ROC curve 101
An ROC curve visualizes the predictive performance of a classifier for various levels of conservatism (measured by confidence scores). In simple terms, it illustrates the price you pay in terms of false positive rate to increase the true positive rate. The conservatism is controlled via thresholds on confidence scores to assign the positive and negative label.
The x-axis can be interpreted as a measure of liberalism of the classifier, depicting its false positive rate (1-specificity). The y-axis represents how well it is at detecting positives, depicting the classifier's true positive rate (sensitivity). A perfect classifier's ROC curve passes through $(0,1)$, meaning it can classify all positives correctly without a single false positive. This results in an area under the curve of exactly $1$.
Intuitively, a more conservative classifier (which labels less stuff as positive) has higher precision and lower sensitivity than a more liberal one. When the threshold for positive prediction decreases (e.g. the required positive confidence score decreases), both the false positive rate and sensitivity rise monotonically. This is why an ROC curve always increases monotonically.
Plotting an ROC curve
You need not compute the predictions for various thresholds as you say. Computing an ROC curve is done based on the ranking produced by your classifier (e.g. your logistic regression model).
Use the model to predict every single test point once. You'll get a vector of confidence scores, let's call it $\mathbf{\hat{Y}}$. Using this vector you can produce the full ROC curve (or atleast an estimate thereof). The distinct values in $\mathbf{\hat{Y}}$ are your thresholds. Since you use logistic regression, the confidence scores in $\mathbf{\hat{Y}}$ are probabilities, e.g. in $[0,1]$.
Now, simply iterate over the sorted values and adjust TP/TN/FP/FN as you go and you can compute the ROC curve point by point. The amount of points in your ROC curve is equal to the length of $\mathbf{\hat{Y}}$, assuming there are no ties in prediction.
To plot the final result, use a function that plots in zero order hold (ZOH), rather than linear interpolation between points, like MATLAB's stairs
or R's staircase.plot
. Also keep this in mind when computing the area under the curve (AUC). If you use linear interpolation instead of ZOH to compute AUC, you actually end up with the area under the convex hull (AUCH).
An ROC curve visualizes the performance of a model in different configurations (= cutoffs), and hence the second option is the right way.
In the first option you are somehow plotting points of different models (same learning approach with different hyperparameters), which is not related to ROC curves. In fact, which points would you even plot of these different models to make them somewhat calibrated and comparable? All points with $P(\hat{Y}=1) = 0.5$?
Best Answer
The problem arises because of the assumption that there can not be different $y$ values for one $x$ value. You seem to (implicitly?) mix up the ROC curve with a graph that depicts a function of the form $y = f(x) = x + ...$ where indeed each $x$ value can be connected to a single $y$ value. This is not the case for a ROC curve. In fact the website you linked says further below that A classifier with the perfect performance level shows a combination of two straight lines – from the origin (0.0, 0.0) to the top left corner (0.0, 1.0) and further to the top right corner (1.0, 1.0).
Let's consider an example. Here is some data where test values can seperate perfectly the occurence of some entity in the reality with the perfect treshold in red:
The ROC curve (red line) for this data looks as the description in the quote before suggests:
As you can see, for $x= 0$ there is not only $y= 0$ and $y= 1$ but in fact we actually get for every treshold with $x= 0$ a point in the interval $[0, 1]$ on the $y$ axis, hence there can be many more different $y$ values for $x= 0$ than the two that you mentioned (start (0.0, 0.0) and end (1.0, 1.0)). The answer is therefore, that a ROC curve always starts at (0, 0), even in case of a "perfect classifier". And I hope that it has become clear that this is not a contradiction.
You can reproduce this with the following R code: