Which modifiable components of a learning system are responsible for its success or failure? What changes to them improve performance? This has been called the fundamental credit assignment problem (Minsky, 1963). There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the narrower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural Networks (NNs).
A standard neural network (NN) consists of many simple, connected processors called neurons, each producing a sequence of real-valued activations. Input neurons get activated through sensors perceiving the environment, other neurons get activated through weighted connections from previously active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions. Learning or credit assignment is about finding weights that make the NN exhibit desired behavior, such as driving a car. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages (Sec. 3), where each stage transforms (often in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately assigning credit across many such stages.
Shallow NN-like models with few such stages have been around for many decades if not centuries (Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s (Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP) was developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training of deep NNs with many layers, however, had been found to be difficult in practice by the late 1980s (Sec. 5.6), and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became practically feasible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10 (1991), Sec. 5.15 (2006). The 1990s and 2000s also saw many improvements of purely supervised DL (Sec. 5). In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines (Vapnik, 1995; Scholkopf et al., 1998) in numerous important applications. In fact, since 2009, supervised deep NNs have won many official international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first superhuman visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also have become relevant for the more general field of Reinforcement Learning (RL) where there is no supervising teacher (Sec. 6).
On the other hand, I'm not sure that it's necessarily profitable to try and construct a taxonomy of mutually-exclusive buckets for machine learning strategies. I think we can say that there are perspectives from which models can be viewed as neural networks. I don't think that perspective is necessarily the best or useful in all contexts. For example, I'm still planning to refer to random forests and gradient boosted trees as "tree ensembles" instead of abstracting away their distinctions and calling them "neural network trees". Moreover, Schmidhuber distinguishes NNs from kernel machines -- even though kernel machines have some connections to NNs -- when he writes "In the new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming alternative machine learning methods such as kernel machines ... in numerous important applications. "
Best Answer
Let $\theta$ denote a vector of network parameters to learn (weights/biases) and $g$ denote the function the network implements, so $y_i = g(x_i)$ (implicitly depends on $\theta$). It looks like the gradient of your loss function w.r.t. $\theta$ is:
$$-\sum_{i=1}^n [ w_i \nabla_\theta g(x_i) + \text{sgn}(g(x_i) - g(x_{i-1})) [\nabla_\theta g(x_i) - \nabla_\theta g(x_{i-1})] ]$$
On one hand, this seems pretty clean, because you can easily find the gradient of $g$, evaluate it for the different inputs, and plug in those values. There are 2 tricky parts:
Because of the absolute value in the loss function, its gradient is undefined when two subsequent outputs are identical. Depending on your data, that might be a very unlikely circumstance. Otherwise, I suppose you could call the expression above a subgradient and use it to do subgradient descent, which looks pretty much like regular gradient descent. The sgn function will return 0 where subsequent outputs are identical. Consider that something similar occurs with ReLUs; the activation function is non-differentiable at 0, but the subgradient is typically declared to be 0, 0.5, or 1 at this point, and standard gradient-based learning methods work fine (e.g. see here).
As you mention, standard online training methods (e.g. stochastic gradient descent) might not work so well because of the dependence on sequential outputs. You could try doing batch training, where you sweep through all data points to compute the gradient and update parameters once per sweep.