[Tex/LaTex] Seqsplit problems with UTF8

unicode

I'm using seqsplit to split long words within cells of a longtable. I'm using utf8x and ucs packages too. I'm generating PDFs out of those .tex files.

When words have UTF-8 characters, the first one in the sequence raises an error.

\seqsplit{Música} using utf8x appears as `M[U+FFFD]sica`

This is the error it raises:

! Package utf8x Error: MalformedUTF-8sequence.

See the utf8x package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              

l.398 .. com} & \seqsplit{Música}

Ifthecharacterisanargument,putitin{}


Package ucs Warning: Unknown character 65533 = 0xFFFD appeared again. on input 
line 398.

If I remove seqsplit, the word appears correctly, but I need to use this package, maybe someone knows an alternative or macro I can use.

The funniest part is that if the word contains two or more UTF-8 characters, what I get is:

\seqsplit{Múúsica} `M[U+FFFD]úsica`

Only fails in the first character, so I'm sure UTF-8 encoding is correctly done.

Best Answer

Unicode points > 7 bit are encoded with several bytes in UTF-8. Package seqsplit does not know this, as it is written for long DNA/RNA/protein/… sequences. It is the wrong package for natural text. Languages have rules, where breakpoints are allowed in words (usually not after each letter) and they request the insertion of a hyphenation char.

Thus for narrow columns I recommend package ragged2e with command \Raggedright that is similar to \raggedright, but allows hyphenation. It therefore fills the available space better.

Nevertheless, if a sequence for \seqsplit contains UTF-8 chars and a Unicode TeX engine (XeTeX, LuaTeX) is not used, then the UTF-8 sequences can be grouped and protected for \seqsplit:

\documentclass{article}
\usepackage[utf8x]{inputenc}
\usepackage{seqsplit}
\begin{document}
\seqsplit{M{ú}sica}
\end{document}

Related Solutions

[Tex/LaTex] Using ligatures as Unicode

There are two problems with the code:

Use [utf8] instead of [utf8x]
The second parameter of \DeclareUnicodeCharacter is what the character will be replaced with, so it's meaningless to put the character there again. Replace it with what you reallly need (f and i will be joined in a ligature by TeX, as usual): \DeclareUnicodeCharacter{FB01}{fi}

After these changes the code will work as desired.

[Tex/LaTex] Problem after copying text: inputenc Error: Unicode char \u8:‭ not set up for use with LaTeX

Unhappily utf8.def does not show the numerical representation for the missing Unicode character. The missing character <char> is shown directly in macro \u8:<char>. The following example adds the numerical information in the error message:

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{stringenc}
\usepackage{pdfescape}

\makeatletter
\renewcommand*{\UTFviii@defined}[1]{%
  \ifx#1\relax
    \begingroup
      % Remove prefix "\u8:"
      \def\x##1:{}%
      % Extract Unicode char from command name
      % (utf8.def does not support surrogates)
      \edef\x{\expandafter\x\string#1}%
      \StringEncodingConvert\x\x{utf8}{utf16be}% convert to UTF-16BE
      % Hexadecimal representation
      \EdefEscapeHex\x\x
      % Enhanced error message
      \PackageError{inputenc}{Unicode\space char\space \string#1\space
                              (U+\x)\MessageBreak
                              not\space set\space up\space
                              for\space use\space with\space LaTeX}\@eha
    \endgroup
  \else\expandafter
    #1%
  \fi
}
\makeatother

\begin{document}
^^c2^^a0 % 7-bit input for U+00A0
\end{document}

Result:

! Package inputenc Error: Unicode char \u8:  (U+00A0)
(inputenc)                not set up for use with LaTeX.

Best Answer

Related Solutions

[Tex/LaTex] Using ligatures as Unicode

[Tex/LaTex] Problem after copying text: inputenc Error: Unicode char \u8:‭ not set up for use with LaTeX

Related Question