The Kaplan-Meier estimator is for estimating a homogeneous cumulative survival or cumulative incidence function in the absence of competing events. In order for the distribution to be homogeneous, all the regression coefficients in the Cox model would have to be zero in the population. So it is very uncommon for Kaplan-Meier estimates to be the focus. Instead you can get survival curve estimates in the Cox model context. There are several options in some software packages for which survival estimator is used with the Cox model. One of the methods is the Kalbfleisch-Prentice estimator which is exactly Kaplan-Meier if all the regression coefficients are estimated to be exactly zero.
When obtaining survival estimates from a Cox model fit you have to specify the values of all the covariates. You can vary one of the covariates at a time to see their effects. In R this is extremely easy to do.
This is a very good example of non-proportional hazards OR the effect of 'depletion' in survival analysis. I will try to explain.
At first take a good look at your Kaplan-Meier (KM) curve: you can see in the first part (until around 3000 days) the proportion of males still alive in the population at risk at time t is larger than the proportion of females (i.e. the blue line is 'higher' than the red one). This means that indeed male gender is 'protective' for the event (death) studied. Accordingly the hazard ratio should be between 0 and 1 (and the coefficient should be negative).
However, after day 3000, the red line is higher! This would indeed suggest the opposite. Based on this KM graph alone, this would further suggest a non-proportional hazard. In this case 'non-proportional' means that the effect of your independent variable (gender) is not constant over time. In other words, the hazard ratio is viable to change as time progresses. As explained above, this seems the case. The regular proportional hazard Cox model does not accommodate such effects. Actually, one of the main assumptions is that the hazards are proportional! Now you can actually model non-proportional hazards as well, but that is beyond the scope of this answer.
There is one additional comment to make: this difference could be due to the true hazards being non-proportional or the fact that there is a lot of variance in the tail estimates of the KM curves. Note that at this point in time the total group of 348 patients will have declined to a very small population still at risk. As you can see, both gender groups have patients experiencing the event and patients being censored (the vertical lines). As the population at risk declines, the survival estimates will become less certain. If you would have plotted 95% confidence intervals around the KM lines, you would see the width of the confidence interval increasing. This is important for the estimation of hazards as well. Put simply, as the population at risk and amount of events in the final period of your study is low, this period will contribute less to the estimates in your initial cox model.
Finally, this would explain why the hazard (assumed constant over time) is more in line with the first part of your KM, instead of the final endpoint.
EDIT: see @Scrotchi's spot-on comment to the original question: As stated, the effect of low numbers in the final period of the study is that the estimates of the hazards at those points in time are uncertain. Consequently you are also less certain whether the apparent violation of the proportional hazards assumption isn't due to chance. As @ scrotchi's states, the PH assumption may not be that bad.
Best Answer
For the case of neural networks, this is a promising approach: WTTE-RNN - Less hacky churn prediction.
The essence of this method is to use a Recurrent Neural Network to predict parameters of a Weibull distribution at each time-step and optimize the network using a loss function that takes censoring into account.
The author also released his implementation on Github.