The short answer is that the conversion of the original untimed MATLAB code to sequential HDL requires multiple clock cycles to implement. This is just part of the nature of the algorithm, so will vary depending upon the particular design. I'm not familiar with this particular example so I can't say what causes the code to require conversion to sequential form like this.
For your question #1: HDL Coder is indicating that the generated HDL contains a latency of n cycles. With no explicit pipelining, the conversion of the algorithm from untimed MATLAB code to sequential HDL code requires n cycles to complete in HDL. You do not need to add explicit pipelining unless desired. The delays are not pipeline registers per se, but are required for the algorithmic conversion. Question #2 is a more generalized FPGA design question. The 'correct' answer may depend upon your algorithm, your target device, synthesis tool, whether HDL Coder is using Distributed Pipelining, and many other factors. I don't think that there is a hard and fast answer to this other than "whatever works well for your design environment".
Best Answer