[Tex/LaTex] Escape & Sanitize User Input

inputsymbols

Background

Generate a book with user content.

Problem

Names like Marie Curie-Sk\l{}odowska, when not escaped (e.g., MARIE CURIE-SKĊODOWSKA) cause LaTeX to fail.

Questions

  • What macros are available to ensure that characters are translated into their LaTeX-friendly equivalents?
  • How do you prevent items like \input{/etc/passwd}?

Thank you!

Best Answer

Depending on what input you need, just encoding your document as UTF-8 (\usepackage[utf8]{inputenc}) will allow unescaped unicode characters. If you need more variety than the major Latin-based languages, you should use XeLaTeX, (which assume unicode source) and a font that contains as many of the scripts as you might need (or you'll need to adjust your input cgi to choose the appropriate language and pass it to your document.)

You also need to decide how to handle characters that are reserved by LaTeX, but might be part of your allowable input (#, %, $, _, ^, &, {, }) which should probably be turned into \#, \% etc. This can easily be done with a regular expression substitution in your cgi script. (Although if you need to allow math input, this is more complicated.)

As for sanitizing dangerous stuff from the input, the safest is to not allow any latex markup at all, in which case can you can simply strip out all instances of \ from your input text. (And obviously don't run latex with the -shell-escape option.) If you need limited markup, this can be doable, but trickier, depending on what you want to allow.