Causal inference is focused on knowing what happens to $Y$ when you change $X$. Prediction is focused on knowing the next $Y$ given $X$ (and whatever else you've got).
Usually, in causal inference, you want an unbiased estimate of the effect of $X$ on Y. In prediction, you're often more willing to accept a bit of bias if you and reduce the variance of your prediction.
In theory, yes. In practice, things are more subtle.
First of all, let's clear the field from a doubt raised in the comments: neural networks can handle multiple outputs in a seamless fashion, so it doesn't really matter whether we consider multiple regression or not (see The Elements of Statistical Learning, paragraph 11.4).
Having said that, a neural network of fixed architecture and loss function would indeed just be a parametric nonlinear regression model. So it would even less flexible than nonparametric models such as Gaussian Processes. To be precise, a single hidden layer neural network with a sigmoid or tanh activation function would be less flexible than a Gaussian Process: see http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf. For deep networks this is not true, but it becomes true again when you consider Deep Gaussian Processes.
So, why are Deep Neural Networks such a big deal? For very good reasons:
They allow fitting models of a complexity that you wouldn't even begin to dream of, when you fit Nonlinear Least Squares models with the Levenberg-Marquard algorithm. See for example https://arxiv.org/pdf/1611.05431.pdf, https://arxiv.org/pdf/1706.02677.pdf and https://arxiv.org/pdf/1805.00932.pdf where the number of parameters $p$ goes from 25 to 829 millions. Of course DNNs are overparametrized, non-identifiable, etc. etc. so the number of parameters is very different from the "degrees of freedom" of the model (see https://arxiv.org/abs/1804.08838 for some intuition). Still, it's undeniably amazing that models with $N <<p$ ($N=$ sample size) are able to generalize so well.
They scale to huge data sets. A vanilla Gaussian Process is a very flexible model, but inference has a $O(N^3)$ cost which is completely unacceptable for data sets as big as ImageNet or bigger such as Open Image V4. There are approximations to inference with GPs which scale as well as NNs, but I don't know why they don't enjoy the same fame (well, I have my ideas about that, but let's not digress).
For some tasks, they're impressively accurate, much better than many other statistical learning models. You can try to match ResNeXt accuracy on ImageNet, with a 65536 inputs kernel SVM, or with a random forest for classification. Good luck with that.
However, the real difference between theory:
all neural networks are parametric nonlinear regression or classification models
and practice in my opinion, is that in practice nothing about a deep neural network is really fixed in advance, so you end up fitting a model from a much bigger class than you would expect. In real-world applications, none of these aspects are really fixed:
- architecture (suppose I do sequence modeling: shall I use an RNN? A dilated CNN? Attention-based model?)
- details of the architecture (how many layers? how many units in layer 1, how many in layer 2, which activation function(s), etc.)
- how do I preprocess the data? Standardization? Minmax normalization? RobustScaler?
- kind of regularization ($l_1$? $l_2$? batch-norm? Before or after ReLU? Dropout? Between which layers?)
- optimizer (SGD? Path-SGD? Entropy-SGD? Adam? etc.)
- other hyperparameters such as the learning rate, early stopping, etc. etc.
- even the loss function is often not fixed in advance! We use NNs for mostly two applications (regression and classification), but people use a swath of different loss functions.
Look how many choices are performed even in a relatively simple case where there is a strong seasonal signal, and the number of features is small, as far as DNNs go:
https://stackoverflow.com/questions/48929272/non-linear-multivariate-time-series-response-prediction-using-rnn
Thus in practice, even though ideally fitting a DNN would mean to just fit a model of the type
$y=f(\mathbf{x}\vert\boldsymbol{\theta})+\epsilon$
where $f$ has a certain hierarchical structure, in practice very little (if anything at all) about the function and the fitting method is defined in advance, and thus the model is much more flexible than a "classic" parametric nonlinear model.
Best Answer
I don't think that anything about a neural network is easier to understand by using the "independent variable" and "dependent variable" terminology. Recurrent networks produce outputs for time $t$ which are then taken as inputs for time $t+1$. Auto-encoder neural networks take an input, and produce (1) an abstraction of that input and (2) a reconstruction of the input based on the abstraction. In either case, I don't think "independent" and "dependent" variables are really descriptive.
It's easier to think about neural networks in terms of inputs and outputs. A neural network takes something (a sequence of integer indices, a vector, an image, several vectors concatenated into a matrix, a graph) and returns something else (a probability vector, an arbitrary vector, another image).
And of course there are some neural networks that can take multiple, heterogenous inputs and return one or more outputs (which may, likewise, be heterogenous).
This description is incredibly abstract and general. That's kind of the point: neural network researchers have generalized beyond what is possible with linear regression.
What a "feature" is depends on context. Some recent successes in neural networks are entirely featureless, in the sense that they take some raw input, such as an entire image, as the input. This is in contrast to other computer vision or imaging tasks which take an image, extract features, and then pass the features to some downstream task. Indeed, until CNNs started putting up major successes, the image to feature extraction to conventional machine learning (SVM, RF, etc.). pipeline was the standard practice, and much attention was devoted to developing better feature extraction methods.
Other applications of neural networks are exactly like linear regression with extra bits added on: matrix input, scalar output. The only subtlety is the hidden layer nonlinearity.
Hyperparameters are not independent variables. Let's return to the regression context. One hyperparameter of a ridge regression is the penalty on the $L^2$ norm of the coefficients. This is not an independent variable because it is not an attribute of one of the samples in your data collection; instead, it's a researcher-chosen value which controls the length of the norm of the coefficients.
Likewise, neural network hyperparameters are not independent variables. Hyperparameters, including the $L^2$ penalty and learning rate, don't describe anything about your data set. The learning rate is a direct consequence of using an iterative optimization procedure. The magnitude of the $L^2$ penalty reflects a particular choice about how to constrain the model.
This thread may also be useful. What *is* an Artificial Neural Network?