Looks like you answered your own question. However, you should check how they implemented log-softmax. See my answer here for a numerically stable softmax function. Therefore, log softmax should be:

```
def log_softmax(q):
max_q = max(0.0, np.max(q))
rebased_q = q - max_q
return rebased_q - np.logaddexp(-max_q, np.logaddexp.reduce(rebased_q))
```

As long as your inputs are finite, I don't think this can ever be infinite.

Recurrent Neural networks are recurring over time. For example if you have a sequence

x = ['h', 'e', 'l', 'l']

This sequence is fed to a **single** neuron which has a single connection to itself.

At time step 0, the letter 'h' is given as input.At time step 1, 'e' is given as input. The network when unfolded over time will look like this.

A recursive network is just a generalization of a recurrent network. In a recurrent network the weights are shared (and dimensionality remains constant) along the length of the sequence because how would you deal with position-dependent weights when you encounter a sequence at test-time of different length to any you saw at train-time. In a recursive network the weights are shared (and dimensionality remains constant) at every node for the same reason.

This means that all the W_xh weights will be equal(shared) and so will be the W_hh weight. This is simply because it is a single neuron which has been unfolded in time.

This is what a **Recursive Neural Network** looks like.

It is quite simple to see why it is called a Recursive Neural Network. Each parent node's children are simply a node similar to that node.

The Neural network you want to use depends on your usage. In Karpathy's blog, he is generating characters one at a time so a recurrent neural network is good.

But if you want to generate a parse tree, then using a Recursive Neural Network is better because it helps to create better hierarchical representations.

If you want to do deep learning in c++, then use CUDA. It has a nice user-base, and is fast. I do not know more about that so cannot comment more.

In python, Theano is the best option because it provides automatic differentiation, which means that when you are forming big, awkward NNs, you don't have to find gradients by hand. Theano does it automatically for you. This feature is lacked by Torch7.

Theano is very fast as it provides C wrappers to python code and can be implemented on GPUs. It also has an awesome user base, which is very important while learning something new.

## Best Answer

Here you go https://github.com/ofirnachum/tree_rnn.

This Theano implementation should be roughly what is described in the Kai Sheng Tai paper.