I would get more accurate risk estimates for the data if I categorized the variables which have a skewed distribution and outliers.
That suggestion is not summarily true. It may be true in some cases. It may be detrimental in others. Categorizing a predictor means dividing it into quartiles or user-defined thresholds. Categorized predictors have the advantage of fitting more flexible trend lines to the data. The disadvantage is that they predict the same risk in all groups within a specific category and they borrow no information across adjacent groups. Categorized predictors have the additional disadvantage of increasing the number of predictors and hence the risk of overfitting.
Categorizing a risk predictor introduces the sensitivity to the definition of the thresholds. It can be difficult to prespecify clinically relevant thresholds. Thresholds defined on quartiles of the sample do not tend to generalize well to other validation samples.
Biologically, however, we would be concerned that extreme values of predictors signify biologic trends or interactions that are not captured in the risk model. For instance, blood pressures or BMI several standard deviations above the mean are no longer consistent with the additive effect on risk in intermediate ranges, but reflect exponentially growing risks for diabetes, hypertension, chronic kidney disease, and MI or stroke.
For this reason, we can use rigorous testing or inspection to assess linearity and add supplemental terms if the model fit is dramatically improved. Rather than categorizing predictors, a hybrid alternative approach is to use polynomial terms like the both the linear and log transformed values of a skewed predictor, or we may use piecewise linear, quadratic, or cubic splines to fit trends that are curvilinear but that borrow information across groups and predict non-constant risk in participants with different values.
There are several issues here.
Typically, we want to determine a minimum sample size so as to achieve a minimally acceptable level of statistical power. The sample size required is a function of several factors, primarily the magnitude of the effect you want to be able to differentiate from 0 (or whatever null you are using, but 0 is most common), and the minimum probability of catching that effect you want to have. Working from this perspective, sample size is determined by a power analysis.
Another consideration is the stability of your model (as @cbeleites notes). Basically, as the ratio of parameters estimated to the number of data gets close to 1, your model will become saturated, and will necessarily be overfit (unless there is, in fact, no randomness in the system). The 1 to 10 ratio rule of thumb comes from this perspective. Note that having adequate power will generally cover this concern for you, but not vice versa.
The 1 to 10 rule comes from the linear regression world, however, and it's important to recognize that logistic regression has additional complexities. One issue is that logistic regression works best when the percentages of 1's and 0's is approximately 50% / 50% (as @andrea and @psj discuss in the comments above). Another issue to be concerned with is separation. That is, you don't want to have all of your 1's gathered on one extreme of an independent variable (or some combination of them), and all of the 0's at the other extreme. Although this would seem like a good situation, because it would make perfect prediction easy, it actually makes the parameter estimation process blow up. (@Scortchi has an excellent discussion of how to deal with separation in logistic regression here: How to deal with perfect separation in logistic regression?) With more IV's, this becomes more likely, even if the true magnitudes of the effects are held constant, and especially if your responses are unbalanced. Thus, you can easily need more than 10 data per IV.
One last issue with that rule of thumb, is that it assumes your IV's are orthogonal. This is reasonable for designed experiments, but with observational studies such as yours, your IV's will almost never be roughly orthogonal. There are strategies for dealing with this situation (e.g., combining or dropping IV's, conducting a principal components analysis first, etc.), but if it isn't addressed (which is common), you will need more data.
A reasonable question then, is what should your minimum N be, and/or is your sample size sufficient? To address this, I suggest you use the methods @cbeleites discusses; relying on the 1 to 10 rule will be insufficient.
Best Answer
You should be wary of decide about transforming or not the variables just on statistical grounds. You must look on interpretation. ¿Is it reasonable that your responses is linear in $x$? or is it more probably linear in $\log(x)$? And to discuss that, we need to know your varaibles... Just as an example: independent of model fit, I wouldn't believe mortality to be a linear function of age!
Since you say you have "large data", you could look into splines, to let the data speak about transformations ... for instance, package mgcv in R. But even using such technology (or other methodsto search for transformations automatically), the ultimate test is to ask yourselves what makes scientific sense. ¿What do other people in your field do with similar data?