If I have computed correctly, logistic regression asymptotically has the same power as the t-test. To see this, write down its log likelihood and compute the expectation of its Hessian at its global maximum (its negative estimates the variance-covariance matrix of the ML solution). Don't bother with the usual logistic parameterization: it's simpler just to parameterize it with the two probabilities in question. The details will depend on exactly how you test the significance of a logistic regression coefficient (there are several methods).
That these tests have similar powers should not be too surprising, because the chi-square theory for ML estimates is based on a normal approximation to the log likelihood, and the t-test is based on a normal approximation to the distributions of proportions. The crux of the matter is that both methods make the same estimates of the two proportions and both estimates have the same standard errors.
An actual analysis might be more convincing. Let's adopt some general terminology for the values in a given group (A or B):
- $p$ is the probability of a 1.
- $n$ is the size of each set of draws.
- $m$ is the number of sets of draws.
- $N = m n$ is the amount of data.
- $k_{ij}$ (equal to $0$ or $1$) is the value of the $j^\text{th}$ result in the $i^\text{th}$ set of draws.
- $k_i$ is the total number of ones in the $i^\text{th}$ set of draws.
- $k$ is the total number of ones.
Logistic regression is essentially the ML estimator of $p$. Its logarithm is given by
$$\log(\mathbb{L}) = k \log(p) + (N-k) \log(1-p).$$
Its derivatives with respect to the parameter $p$ are
$$\frac{\partial \log(\mathbb{L})}{ \partial p} = \frac{k}{p} - \frac{N-k}{1-p} \text{ and}$$
$$-\frac{\partial^2 \log(\mathbb{L})}{\partial p^2} = \frac{k}{p^2} + \frac{N-k}{(1-p)^2}.$$
Setting the first to zero yields the ML estimate ${\hat{p} = k/N}$ and plugging that into the reciprocal of the second expression yields the variance $\hat{p}(1 - \hat{p})/N$, which is the square of the standard error.
The t statistic will be obtained from estimators based on the data grouped by sets of draws; namely, as the difference of the means (one from group A and the other from group B) divided by the standard error of that difference, which is obtained from the standard deviations of the means. Let's look at the mean and standard deviation for a given group, then. The mean equals $k/N$, which is identical to the ML estimator $\hat{p}$. The standard deviation in question is the standard deviation of the draw means; that is, it is the standard deviation of the set of $k_i/n$. Here is the crux of the matter, so let's explore some possibilities.
Suppose the data aren't grouped into draws at all: that is, $n = 1$ and $m = N$. The $k_{i}$ are the draw means. Their sample variance equals $N/(N-1)$ times $\hat{p}(1 - \hat{p})$. From this it follows that the standard error is identical to the ML standard error apart from a factor of $\sqrt{N/(N-1)}$, which is essentially $1$ when $N = 1800$. Therefore--apart from this tiny difference--any tests based on logistic regression will be the same as a t-test and we will achieve essentially the same power.
When the data are grouped, the (true) variance of the $k_i/n$ equals $p(1-p)/n$ because the statistics $k_i$ represent the sum of $n$ Bernoulli($p$) variables, each with variance $p(1-p)$. Therefore the expected standard error of the mean of $m$ of these values is the square root of $p(1-p)/n/m = p(1-p)/N$, just as before.
Number 2 indicates the power of the test should not vary appreciably with how the draws are apportioned (that is, with how $m$ and $n$ are varied subject to $m n = N$), apart perhaps from a fairly small effect from the adjustment in the sample variance (unless you were so foolish as to use extremely few sets of draws within each group).
Limited simulations to compare $p = 0.70$ to $p = 0.74$ (with 10,000 iterations apiece) involving $m = 900, n = 1$ (essentially logistic regression); $m = n = 30$; and $m = 2, n = 450$ (maximizing the sample variance adjustment) bear this out: the power (at $\alpha = 0.05$, one-sided) in the first two cases is 0.59 whereas in the third, where the adjustment factor makes a material change (there are now just two degrees of freedom instead of 1798 or 58), it drops to 0.36. Another test comparing $p = 0.50$ to $p = 0.52$ gives powers of 0.22, 0.21, and 0.15, respectively: again, we observe only a slight drop from no grouping into draws (=logistic regression) to grouping into 30 groups and a substantial drop down to just two groups.
The morals of this analysis are:
- You don't lose much when you partition your $N$ data values into a large number $m$ of relatively small groups of "draws".
- You can lose appreciable power using small numbers of groups ($m$ is small, $n$--the amount of data per group--is large).
- You're best off not grouping your $N$ data values into "draws" at all. Just analyze them as-is (using any reasonable test, including logistic regression and t-testing).
In general, dealing with missing input values is always problematic. To my best knowledge, none of the existing methods can deal with it without introducing some bias to the model, so you have to consider this during your research. There are at least few possible options:
- ignore data with missing values (which I do believe you do now), which is the "safest" option, but can lead to insufficient data being left to train a good model
- fill missing values with some statistical analysis of the data - for example:
- mean value of the particular feature/dimension (for real valued variables)
- median value of the particular feature/dimension (for categorical ones)
- train a separate model to predict a missing value, e.g. let's imagine data in $X^k$, and each of the dimensions can have missing inputs, then you can create $k$ models $M_i$, each for predicting the $i$th dimension using the rest of them, so $M_i : X^{k-1} \rightarrow X$, and you use it to preprocess your data
- use some generative model, that can fill missing values by itself, one possibility is a Restricted Boltzmann Machine
As was previously stated, each method introduces some bias to the analysis (which has been proven in many papers, for many models), but it can also help you build a better model: everything depends on your data.
EDIT (after clarification)
A missing value of some $i$th feature/dimension $f_i \in X$ is lack of observation/knowledge about what particular value $x\in X$ does it have. One can imagine a situation where we are asking people to fill out a multi-page survey, and after getting all the data it turns out we do not have one of the person's pages. We do not know what was his/her response, but we are quite sure there was one. On the other hand a person could give as a blank question (without an answer) or write something like "I will not answer this question", which is not missing information; in fact this is as informative as selecting one of the predefined boxes. In such a scenario we simply have a categorical feature, $f'_i \in X \cup \{ \emptyset \}$. We can either express it as a multi-valued feature, or encode it in unary form by replacing $f'_i$ with $|X|+1$ new binary features $f''_{ij}$ for each $j\in X \cup \{ \emptyset \}$ such that $f''_{ij} = 1 \iff f'_i = j$. Choice between these methods is model- and data-dependent.
Best Answer
So far, my findings point towards hierarchical modelling and mixture models. A statistician from my department also confirmed this.