[Tex/LaTex] Capitalizing strings ignoring closed class words

capitalizationmacros

I was just reviewing my "capitalization standards" for titles and such and was wondering if there's a macro to do the same thing I'm forced to do by hand nowadays. My personal rules (feel free to disagree/comment on them) are as follows:

  1. the first letter (a number is a letter too for the sake of these rules) always gets capitalized,
  2. every word gets capitalized individually,
  3. the exception to the rule above (and just the rule above) are closed class words such as prepositions and the like.

In other words: I'd like a capitalization command (like \MakeUppercase) that will capitalize every word not included in a list of words and that will always capitalize the first word of its argument.

Doable?

PS: one such "list" of closed class words (also known as "function words") can be found here.

Best Answer

\documentclass[a4paper]{article}
\usepackage[latin1]{inputenc}
\usepackage{xparse}
\ExplSyntaxOn
\NewDocumentCommand{\capitalize}{>{\SplitList{~}}m}{
  \CapitalizeFirst#1\Capitalize\unskip
}
\ExplSyntaxOff
\def\Sentinel{\Capitalize}
\def\CapitalizeFirst#1{\MakeUppercase#1 \Capitalize}
\def\Capitalize#1{%
  \def\next{#1}%
  \ifx\next\Sentinel
    \expandafter\unskip
  \else
    \CheckInList{#1}\space\expandafter\Capitalize
  \fi}
\def\CheckInList#1{%
  \ifcsname List@\detokenize{#1}\endcsname
    #1%
  \else
    \MakeUppercase#1%
  \fi}
\makeatletter
\def\AppendToList#1{%
  \@for\next:=#1\do
  {\expandafter\let\csname List@\detokenize\expandafter{\next}\endcsname\empty}}
\makeatother
\AppendToList{a,is,of}

\begin{document}
\capitalize{here is a list of words école}
\end{document}

Won't work with UTF-8 (unless XeLaTeX or LuaLaTeX are used).

It won't work with UTF-8 in pdflatex because \MakeUppercase will apply only to the first byte of a possible two, three or four byte combination (for Western languages probably only two). For that to work one has to feed the whole block of bytes to \MakeUppercase.

To be clearer: when we say \MakeUppercase, LaTeX will uppercase the argument; in general the call is \MakeUppercase{word}; here we're saying instead \MakeUppercase#1 (without braces), so only the first token (usually a character) will be uppercased; here's where it will fail with input such as \'ecole: the token passed to \MakeUppercase would be \', which it doesn't know what to do. Using école (and a one byte encoding such as latin1), \MakeUppercase will process é and give the correct result.

With UTF-8 this would fail: what we see as é on our screen when writing a LaTeX document is actually two bytes (C3 and A9, for é) and again \MakeUppercase would be passed only the first one. So a more complex routine is necessary.

In order to have this work with pdflatex and UTF-8, the definition of \CheckInList and \CapitalizeFirst above can be changed into the following

\def\CapitalizeFirst#1{\expandafter\UC@next#1 \Capitalize}
\def\CheckInList#1{%
  \ifcsname List@\detokenize{#1}\endcsname
    #1%
  \else
    \expandafter\UC@next#1%
  \fi}
\def\UC@next#1{%
  \ifx#1\UTFviii@two@octets
     \expandafter\@firstoffour
  \else
    \ifx#1\UTFviii@three@octets
      \expandafter\expandafter\expandafter\@secondoffour
    \else
      \ifx#1\UTFviii@four@octets
        \expandafter\expandafter\expandafter\expandafter\expandafter
        \@thirdoffour
      \else
        \expandafter\expandafter\expandafter\expandafter\expandafter
        \expandafter\expandafter\@fourthoffour
      \fi
    \fi
  \fi
  {\UC@two}{\UC@three}{\UC@four}{\MakeUppercase}#1}
\def\UC@two#1#2#3{\MakeUppercase{#1#2#3}}
\def\UC@three#1#2#3#4{\MakeUppercase{#1#2#3#4}}
\def\UC@four#1#2#3#4#5{\MakeUppercase{#1#2#3#4#5}}
\providecommand\@firstoffour[4]{#1}
\providecommand\@secondoffour[4]{#2}
\providecommand\@thirdoffour[4]{#3}
\providecommand\@fourthoffour[4]{#4}

However accent commands are not allowed (they aren't also in the other version).


UPDATE

After a few years, here's a better implementation, thanks to new expl3 features; it works for all engines.

\documentclass[a4paper]{article}

\usepackage{ifxetex}

\ifxetex
  \usepackage{fontspec}
\else
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
\fi

\usepackage{xparse}

\ExplSyntaxOn
\NewDocumentCommand{\capitalize}{>{\SplitList{~}}m}
 {
  \seq_clear:N \l_capitalize_words_seq
  \ProcessList{#1}{\CapitalizeFirst}
  \seq_use:Nn \l_capitalize_words_seq { ~ }
 }
\NewDocumentCommand{\CapitalizeFirst}{m}
 {
  \capitalize_word:n { #1 }
 }

\sys_if_engine_pdftex:TF
 {
  \cs_set_eq:Nc \capitalize_tl_set:Nn { protected@edef }
 }
 {
  \cs_set_eq:NN \capitalize_tl_set:Nn \tl_set:Nn
 }

\cs_new_protected:Nn \capitalize_word:n
 {
  \capitalize_tl_set:Nn \l_capitalize_word_tl { #1 }
  \seq_if_in:NfTF \g_capitalize_exceptions_seq { \tl_to_str:n { #1 } }
   % exception word
   { \seq_put_right:Nn \l_capitalize_words_seq { #1 } } % exception word
   % to be uppercased
   { \seq_put_right:Nx \l_capitalize_words_seq { \tl_mixed_case:V \l_capitalize_word_tl } }
 }
\cs_generate_variant:Nn \tl_mixed_case:n { V }
\NewDocumentCommand{\AppendToList}{m}
 {
  \clist_map_inline:nn { #1 }
   {
    \seq_gput_right:Nx \g_capitalize_exceptions_seq { \tl_to_str:n { ##1 } }
   }
 }
\cs_generate_variant:Nn \seq_if_in:NnTF { Nf }
\seq_new:N \l_capitalize_words_seq
\seq_new:N \g_capitalize_exceptions_seq
\ExplSyntaxOff

\AppendToList{a,is,of,óf}

\begin{document}
X\capitalize{here is a list of words óf école}X
\end{document}

enter image description here

Related Question