Perhaps a simpler case will make things clearer. Lets say we choose a 1x2 sample of pixels instead of 100x100.
Sample Pixels From the Image
+----+----+
| x1 | x2 |
+----+----+
Imagine when plotting our training set, we noticed that it can't be separated easily with a linear model, so we choose to add polynomial terms to better fit the data.
Let's say, we decide to construct our polynomials by including all of the pixel intensities, and all possible multiples that can be formed from them.
Since our matrix is small, let's enumerate them:
$$x_1,\ x_2,\ x_1^2,\ x_2^2,\ x_1 \times x_2,\ x_2 \times x_1 $$
Interpreting the above sequence of features can see that there is a pattern. The first two terms, group 1, are features consisting only of their pixel intensity. The following two terms after that, group 2, are features consisting of the square of their intensity. The last two terms, group 3, are the product of all the combinations of pairwise (two) pixel intensities.
group 1: $x_1,\ x_2$
group 2: $x_1^2,\ x_2^2$
group 3: $x_1 \times x_2,\ x_2 \times x_1$
But wait, there is a problem. If you look at the group 3 terms in the sequence ($ x_1 \times x_2$ and $x_2 \times x_1$) you'll notice that they are equal. Remember our housing example. Imagine having two features x1 = square footage, and x2 = square footage, for the same house... That doesn't make any sense! Ok, so we need to get rid of the duplicate feature, lets say arbitrarily $x_2 \times x_1$. Now we can rewrite the list of group three features as:
group 3: $x_1 \times x_2$
We count the features in all three groups and get 5.
But this is a toy example. Lets derive a generic formula for calculating the number of features. Let's use our original groups of features as a starting point.
$size group 1 + size group 2 + size group 3 = m \times n + m \times n +m \times n = 3 \times m \times n$
Ah! But we had to get rid of the duplicate product in group 3.
So to properly count the features for group 3 we will need a way to count all unique pairwise products in the matrix. Which can be done with the binomial coefficient, which is a method for counting all possible unique subgroups of size k from an equal or larger group of size n. So to properly count the features in group 3 calculate $C(m \times n, 2)$.
So our generic formula would be:
$$ m \times n + m \times n +C(m \times n, 2) = 2m \times n + C(m \times n, 2) $$
Lets use it to calculate the number of features in our toy example:
$$2 \times 1 \times 2 + C(1 \times 2, 2) = 4 + 1 = 5$$
Thats it!
1) The input should be $(1,256,256)$. You should read about convolutional neural nets to understand better how images are processed. Your initial convolutional layer filters will have dimensions $(1,H,W)$, as there is no need to consider the color-depth of the image since you have one channel.
2) Normalization has no right answer and depends on context. Images can be normalized per image, per pixel, or even not normalized. As an example if you're dealing with medical CT scans, the images are standardized to Hounsfield units, then normalizing your images would make zero sense, since each image pixel is already the ground-truth of your CT scan and there's no concept of external contrast (unlike lighting conditions in photos). As well, sometimes whitening is applied to further normalize images.
3) Of course image size makes a difference. You need some minimal amount of resolution to identify the features you are interested in. For example, the images you posted possess complex structure of white and dark areas, along with detailed boundaries. If the details of those features are important then you need to maintain some minimal amount of resolution to detect them. In the medical example, a benign vs malignant cancerous growth can sometimes be distinguished by the regularity (waviness) of its boundary.
However image size drastically increases the time it takes to process the image. You'll only find out what a good resolution is by gradually scaling up your image until you achieve the accuracy you want. It would help to also make educated guesses about the minimal resolution. For example if the features you are interested in are on the order of 1/100 the scale of the image, then you should at least have a width and height of 100 pixels or more.
Best Answer
I would try LBP (or any other descriptor, based on what your images (and task, i.e. classes) actually are). To deal with different sizes, you can use Bag of Words encoding.