Which modifiable components of a learning system are responsible for its success or failure? What changes to them improve performance? This has been called the fundamental credit assignment problem (Minsky, 1963). There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the narrower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural Networks (NNs).
A standard neural network (NN) consists of many simple, connected processors called neurons, each producing a sequence of real-valued activations. Input neurons get activated through sensors perceiving the environment, other neurons get activated through weighted connections from previously active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions. Learning or credit assignment is about finding weights that make the NN exhibit desired behavior, such as driving a car. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages (Sec. 3), where each stage transforms (often in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately assigning credit across many such stages.
Shallow NN-like models with few such stages have been around for many decades if not centuries (Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s (Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP) was developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training of deep NNs with many layers, however, had been found to be difficult in practice by the late 1980s (Sec. 5.6), and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became practically feasible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10 (1991), Sec. 5.15 (2006). The 1990s and 2000s also saw many improvements of purely supervised DL (Sec. 5). In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines (Vapnik, 1995; Scholkopf et al., 1998) in numerous important applications. In fact, since 2009, supervised deep NNs have won many official international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first superhuman visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also have become relevant for the more general field of Reinforcement Learning (RL) where there is no supervising teacher (Sec. 6).
On the other hand, I'm not sure that it's necessarily profitable to try and construct a taxonomy of mutually-exclusive buckets for machine learning strategies. I think we can say that there are perspectives from which models can be viewed as neural networks. I don't think that perspective is necessarily the best or useful in all contexts. For example, I'm still planning to refer to random forests and gradient boosted trees as "tree ensembles" instead of abstracting away their distinctions and calling them "neural network trees". Moreover, Schmidhuber distinguishes NNs from kernel machines -- even though kernel machines have some connections to NNs -- when he writes "In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines ... in numerous important applications. "
Best Answer
Imagine running linear regression when you expect the results to always be positive. Therefore, even if the prediction is negative, you set it to 0 to get a valid output, so effectively, $y = \text{relu}(w^Tx)$. Now if you simply "stack" these linear regression units in the same way you "stack" logistic regression to get a neural network, you end up with a neural network using relu units.
Another way to see why relu works is to drop the idea of sigmoid units doing logistic regression -- because they're not really doing logistic regression in any traditional sense. Instead, the sum total of the neural network is acting as a powerful function approximator. It has been shown that a neural network of sufficient size can approximate almost any function arbitrarily well. We want to train our neural network to approximate the function which maps the inputs to the correct outputs.
When you think about a neural network as a function approximator, it makes sense that relu works just as well as sigmoid -- they both play the role of introducing non-linearities into the network (which is required for the universal approximation theorem to hold).
To sum it up, you can replace logistic regression with a modified form of linear regression to satisfy your intuition. However, viewing the network as a function approximator may be a better way to see how neural networks work.