Gradient descent and gradient ascent are the same algorithm. More precisely
Gradient ascent applied to $f(x)$, starting at $x_0$
is the same as
Gradient descent applied to $-f(x)$, starting at $x_0$.
This is true in the sense that gradient ascent in the first case and gradient descent in the second generate the same sequence of points, the first converges if and only if the second converges, and in case they both converge, they both converge to the same place.
For logistic regression, the cost function is
$$ \pm \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $$
you get to choose one of these two options, it doesn't matter which, as long as you are consistent.
Since $p_i$ is between zero and one, $\log(p_i)$ is negative, hence
$ \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $ is always negative.
Further, by letting $p_i \rightarrow 0$ for a point with $y_i = 1$, we can drive this cost function all the way to $- \infty$ (which can also be accomplished by lettinf $p_i \rightarrow 1$ for a point with $y_i = 0$. So this cost function has the shape of an upside-down bowl, hence it should be maximized, using gradient ascent.
If we use the negative of this cost function
$ - \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $ is always positive.
We can get exactly the opposite results (we can force it to $+ \infty$). So this is a rightside up bowl, and we should use gradient descent to minimize it.
Here's how I go about it in pure sklearn. There are probably ways to improve this workflow.
First I made a tranformer that simply selects one column from a DataFrame or matrix:
class ColumnSelector(object):
def __init__(self, idxs):
self.idxs = np.asarray(idxs)
# Fit here doesn't need to do anything. We already know the indices of the columns
# we want to keep.
def fit(self, *args, **kwargs):
return self
def transform(self, X, **transform_params):
# Need to teat pandas data frames and numpy arrays slightly differently.
if isinstance(X, pd.DataFrame):
return X.iloc[:, self.idxs]
return X[:, self.idxs]
Then I made a PolynomialExpansion class
class PolynomialExpansion(object):
def __init__(self, degree):
self.degree = degree
def fit(self, *args, **kwargs):
return self
def transform(self, X, **transform_params):
# Initialize our return value as a matrix of all zeros.
# We are going to overwrite all of these zeros in the code below.
X_poly = np.zeros((X.shape[0], self.degree))
# The first column in our transformed matrix is just the vector we started with.
X_poly[:, 0] = X.squeeze()
# Cleverness Alert:
# We create the subsequent columns by multiplying the most recently created column
# by X. This creates the sequence X -> X^2 -> X^3 -> etc...
for i in range(1, self.degree):
X_poly[:, i] = X_poly[:, i-1] * X.squeeze()
return X_poly
Then to use it, I combine it with Pipelines and FeatureUnions. This pipeline uses the arsenic data from Gelman and Hill
wells_pipeline = Pipeline([
('polynomial_expansions', FeatureUnion([
('arsenic_quadratic', Pipeline([
('arsenic_selector', ColumnSelector([0])),
('quadratic_expansion', PolynomialExpansion(2))
])),
('distance_quadratic', Pipeline([
('distance_selector', ColumnSelector([1])),
('quadratic_expansion', PolynomialExpansion(2))
]))
])),
('regression', LogisticRegression())
])
Another option is to use patsy, which allows you to specify model formulas, as in R.
Best Answer
The gradient descent code looks fine to me (although you should not fix the number of steps but rather exiting when the gradient is small).
The issue is that the gradient descent converges to a local minimum, which is not necessarily the global minimum. The key points are that
(i) the polynomial fitting converges VERY slowly to the sin function. The 7th order approximation does not converge outside the first period for example (see http://www.wolframalpha.com/input/?i=Taylor+sin+x) and the gradient descent method will not be able to find a much better solution (it requires a higher order polynomial);
(ii) the coefficients of the Taylor series centered in 0.5 have very different magnitudes. You should find theta = [-197, +2766, -15877, +47085, -75701, +62590, -20863, +0]. Supposedly, the same applies for the coefficients of the global minimum of the gradient descent method*. Starting from a guesses of all ones is a very poor initial condition, you are too far away from the actual solution.
Hope this helps
*NB: the Taylor series coefficients are NOT the same coefficients that the gradient descent method should fine. The Taylor series best approximate the function close to the expansion point, with the error increasing away from it; in the least square fitting that underlies the gradient descent tries to minimize the overall error, there is no "favorite point" unless you use weighted sums.)