My setup is happy with UTF-8 or escaped LaTex typographic characters, should I use one or the other

markdownpandoctypographyunicodeutf8

I’m using Pandoc for an initial conversion of Markdown files into LaTex. I have two questions:

  1. I use UTF-8 in all my documents without any issues, and can output PDFs from LaTeX with either/both UTF-8 and escaped characters (see example input and output below). Is there any reason to use escaped characters in my LaTex source text when the equivalent UTF-8 characters above appear fine in the output?

  2. When generating LaTeX, Pandoc converts all the UTF-8 characters in my Markdown to LaTeX escapes. I've tried the standalone -s flag to no avail. If anyone else using Pandoc Markdown to LaTeX conversions knows a way to prevent this, I'd be grateful (but see update below).

I'd like this test input to remain unconverted in the .text output file:

Some UTF-8 ‘characters’ in “this document” are ellipsis… plus—emdash—and en–dash.

but the Pandoc output gives me:

Some UTF-8 `characters' in ``this document'' are ellipsis\ldots{} plus---emdash---and en--dash.

UPDATE: the Pandoc -smart extension handles everything except , which still generates \ldots{}:

Some UTF-8 ‘characters’ in “this document” are ellipsis\ldots{} plus—emdash—and en–dash.

Best Answer

It doesn't make a lot of difference to latex.

Essentially if you are using pdflatex the unicode character will expand to the classic ascii markup, and if you are using lulatex or xelatex the classic ascii markup will expand to the unicode character, so either way they end up being the same.

Of course not all commands in latex expand to Unicode characters but you can look at the list in tex/latex/base/tuenc.def which shows around 500 commands for accents and composites that are defined by default, for example: \c{c} ends up as ç with lualatex because of

 \DeclareUnicodeComposite{\c}             {c}{"00E7}

Conversely not all of unicode is mapped to commands in pdflatex, but most of the common European accents are, and you can add more as needed. ç expands to \c{c} in pdflatex because of the definition in utf8enc.dfu

\DeclareUnicodeCharacter{00E7}{\c c}