[Tex/LaTex] How to get Unicode characters into HTML output

charactershtlatextex4htunicode

I am trying to make a book in both print and ebook. To make the print book I generate a PDF with xelatex. I have reconstructed all the source files to be encoded in UTF-8 and the Cyrillic and Japanese characters are appearing correctly.

To get an ebook, you generate HTML. For this I am using htlatex which appears to use TeX4ht to do the work.

htlatex "mybook.tex" "xhtml, charset=utf-8" " -cunihtf -utf8"

But when generating the HTML, I get errors in the log saying:

Missing character: There is no ½ in font cmr10!

It appears to complain that all the characters are not in the font, so it simply doesn't output the character. However, we all know that browsers support these characters.

I don't really want to select a font, because I would like the HTML file to not specify the font — I want to use the default font of the browser or ebook reader. Yet, somehow I need to tell LaTeX to quit being so paranoid, and just go ahead and output the character anyway. Is there any way to do that?

I have read other places that I need to select a "unicode" font. With xelatex I use the fontspec package, but that is apparently not allowed by Tex4ht program. (I get an error saying such).

I set the following in the preamble:

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}

I suspect this is wrong since T1 means Latin1 characters? I need Cyrillic and Japanese. Anyway, when I do this, I get new error:

Package inputenc Error: Unicode char \u8:¥ not set up for use with LaTeX.

Yes, it is not set up right. So, my question is: how do I select a font, or how do I otherwise convince htlatex to simply ALLOW all the characters to be written to the HTML file?

Here is a MWE that demonstrates the problem:

\documentclass{book}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
\frontmatter
\mainmatter

The flight was 4½ hours with a 3 hour change.  
We landed in Honolulu.  It was 85° and after 
rearranging the luggage which took a little time, 
ナンシースエンソン quickly got heated up.

\end{document}

my compile script is

htlatex "MWE.tex" "xhtml, charset=utf-8" " -cunihtf -utf8"

Best Answer

Your example doesn't work even with pdflatex, so it isn't really surprise that it doesn't work with tex4ht as well. Easiest solution to get non-european scripts working with tex4ht is to use helpers4ht bundle, in particular emulation of fonstspec package. helpers4ht aren't on CTAN yet, you need to install it yourself.

Now back to your example, it is little bit more difficult because of Japanese, which needs to be handled by some package, like xeCJK for XeTeX or luatexja for LuaTeX. Both of these packages aren't supported by tex4ht, but we can load them using alternative package loader, which will suppress them when tex4ht is loaded:

\documentclass{book}
\usepackage{alternative4ht}
\altusepackage{luatexja-fontspec}
% \altusepackage{xeCJK}
\altusepackage{fontspec}
\altusepackage{polyglossia}
\setmainlanguage{english}
% \setotherlanguage{japanese}
% \usepackage[T1]{fontenc}
% \usepackage[utf8]{inputenc}
\begin{document}
\frontmatter
\mainmatter

The flight was 4½ hours with a 3 hour change.  
We landed in Honolulu.  It was 85° and after 
rearranging the luggage which took a little time, 
ナンシースエンソン quickly got heated up.

\end{document}

You need to compile this example with LuaTeX as engine for tex4ht, for example with this command:

make4ht -ul filename.tex

LuaTeX is needed even if your document is normally compiled with XeTeX, because Lua callbacks are used to convert Unicode to suitable form for tex4ht.

The result

enter image description here

Related Question