[Tex/LaTex] Verbatim, wrappable and Unicode text included from a file

cjklistingsunicodeverbatim

I am trying to output several megabytes of UTF8 text into a printable format. As enscript and friends don't like Unicode, this Unix SE question soon turned to LaTeX.

Summary of requirements: processing of several MB of Chinese, Latin and special chars (i.e. invalid LaTeX code), consisting of long and short lines, into a tightly packed, multi-column format. I know LaTeX has the ability to control font size, spacings, margins and so on, as well as headings and page numbers, so I am not too worried about that.

The problem I am having so far is getting both wrappable and UTF8-capable verbatim input of the file.

Is there a way to get both?

With the fancyvbr package, for example, I get good rendition of the Chinese text, but no line breaking:

\documentclass{article}
\usepackage{CJKutf8}
\usepackage{ucs}
\usepackage{multicol}
\usepackage{verbatim}
\usepackage[encapsulated]{CJK}
\newcommand{\myfont}{bsmi} % or {stheiti}, etc

\begin{document}

\begin{CJK}{UTF8}{\myfont}

    \begin{multicols}{2}
        \VerbatimInput[fontfamily=cmr]{file.txt}
    \end{multicols}
\end{CJK}
\end{document} 

I have so far been unable to get a listings-type environment to deal with the Unicode in the file.

Example of the type of thing I'm feeding it. This to to format thousands of lines of multi-lingual chat logs. Basically 100,000++ lines of this kind of thing:

###### 2013-10-26.223938+0000GMT.html ######
**** user@example.com/ (jabber) ****
(00:00:00) Lorem ipsum
(00:00:01) 車檢作畫病得星定局而是作的所由次園又此對這一馬的生故他試……外由懷黃客建時常嚴在位以說其配場戲回有部結一自法就生機,定的被各世皮全空!也地傳現他重城,書照展商直起眾家不思政國林年八計出地口早體故失離際們層氣,簡他廣集義,四便入的只了極。
(00:00:02) odesset ullamcorper quo. Cu adipisci assentior eam, debet definiebas eos ad. Te eos nihil populo vivendum, vix iusto noster peri
(00:00:02) odesset ullamcorper quo. Cu adip
(02:00:02) odesset ullamcorper quo. Cu adipisci assentior eam, debet d
(02:00:01) 車檢作畫病得星定局而是作的所由次園又此對這一馬的生故他試……外由懷黃客建時常嚴在位以說其配場戲回有部結一自法就生機,定的被各
(02:00:02) ok
(00:01:02) 病
(00:00:02) ok
(00:01:02) 

###### 2013-10-26.223938+0000GMT.html ######
**** user@example.com/ (jabber) ****
(00:00:01) 車檢作畫病得星定局而是作展商直起眾家不思政國林年八計出地口早體故失離際們層氣,簡他廣集義,四便入的只了極。
(00:00:02) odesset ullamcorper quo. Cu adipisci assentior eam, debet definiebas eos ad. Te eos nihil populo vivendum, vix iusto noster peri
(00:00:02) odesset ullamcorper quo. Cu adip
(02:00:02) odesset ullamcorper quo. 

I'm ideally looking for the most general solution possible, as the input text is not necessarily in a well-defined machine-readable format.

Edit also tried the following using the plain verbatim package:

\usepackage{verbatim}

\makeatletter
\def\@xobeysp{ }
\makeatother

with this in the document body:

\verbatiminput{file.txt}

This works for normal text and Chinese, but fails to break lines containing very long words, URLs or long strings of letters, all of which occur in the file.

Best Answer

As you have already used, verbatim package is easy to configure to get proper line breaking and escape all the special characters.

For CJK text, xeCJK is your friend. And there are some options of xeCJK to control the behavior for verbatim CJK text.

I hacked into xeCJK package to tune the linebreak. It seems better, but not perfect.

% -*- coding: utf-8 -*-
% Compile with XeLaTeX
\documentclass{article}

\usepackage{etoolbox}

\usepackage{verbatim}
\makeatletter
\def\@xobeysp{\ }% Or just a space, with a different result
\let\verbatim@nolig@list\empty
\appto\verbatim@font{\raggedright}
\makeatother

\usepackage{xeCJK}% You'd better use the latest version
\setCJKmonofont{SimSun}
\xeCJKsetup{Verb=false}
\normalspacedchars{}
\ExplSyntaxOn
% Hack into xeCJK package if you want to allow linebreaks after (almost) any character.
% Delete this if you don't want that.
\tex_chardef:D \c_fifty = 50 ~
\xeCJK_inter_class_toks:nnn { Default } { Default } { \tex_penalty:D \c_one_hundred }
\xeCJK_inter_class_toks:nnn { Default } { HalfLeft } { \tex_penalty:D \c_fifty }
\xeCJK_inter_class_toks:nnn { Default } { HalfRight } { \tex_penalty:D \c_one_thousand }
\xeCJK_inter_class_toks:nnn { HalfLeft } { Default } { \tex_penalty:D \c_one_thousand }
\xeCJK_inter_class_toks:nnn { HalfLeft } { HalfLeft } { \tex_penalty:D \c_one_thousand }
\xeCJK_inter_class_toks:nnn { HalfLeft } { HalfRight } { \tex_penalty:D \c_one_thousand }
\xeCJK_inter_class_toks:nnn { HalfRight } { Default } { \tex_penalty:D \c_fifty }
\xeCJK_inter_class_toks:nnn { HalfRight } { HalfLeft } { \tex_penalty:D \c_fifty }
\xeCJK_inter_class_toks:nnn { HalfRight } { HalfRight } { \tex_penalty:D \c_one_thousand }
\ExplSyntaxOff

\begin{document}

\verbatiminput{test.log}

\end{document}

enter image description here

The test log file with long lines:

###### 2013-10-26.223938+0000GMT.html ######
**** user@example.com/ (jabber) ****
(00:00:00) Lorem ipsum
(00:00:01) 車檢作畫病得星定局而是作的所由次園又此對這一馬的生故他試……外由懷黃客建時常嚴在位以說其配場戲回有部結一自法就生機,定的被各世皮全空!也地傳現他重城,書照展商直起眾家不思政國林年八計出地口早體故失離際們層氣,簡他廣集義,四便入的只了極。
(00:00:02) odesset ullamcorper quo. Cu adipisci assentior eam, debet definiebas eos ad. Te eos nihil populo vivendum, vix iusto noster peri
(00:00:02) `~!@#$%^&*()_+-={}[]|\:;"'.?/
(02:00:02) LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLONG LINE