[Tex/LaTex] Converting LaTeX commands in BibTeX title field to UTF-8


I've been Googling and downloading different software for the last two days and I'm not getting very far. I was referred to this site and it looks like a great resource. Hopefully I can get the answer I need.

I'm working on a publication collection for a major research university. The project has oriented around the collection of CVs and extracting each citation to a parser. Our objective is to have a database with all this content by the end of August.

We've run into a problem when processing publications of some of our faculty members who have the majority of their publications on external databases. We've been successful with extracting the majority of these publications (using BibTeX –> JabRef and exporting to a particular citation format), however we've hit a brick wall with getting those LaTeX characters to display properly.

For example:

We find this publication:

We grab the BibTeX, which would look like:

   title = {Searches for the baryon- and lepton-number violating decays $B\rightarrow{}\Lambda{}c+l-$, $B-\rightarrow{}\Lambda{}l-$, and $B-\rightarrow{}\Lambda{}\ifmmode\bar\else\textasciimacron\fi{}l-$},
   collaboration = {<emph type="italic">BABAR</emph> Collaboration},
   author = {del Amo Sanchez, P. and others},
   journal = {Phys. Rev. D},
   volume = {83},
   number = {9},
   pages = {091101},
   numpages = {8},
   year = {2011},
   month = {May},
   doi = {10.1103/PhysRevD.83.091101},
   publisher = {American Physical Society}

The BibTeX is imported to our database on JabRef and is exported through a customized filter. The problem is that the title still contains the LaTeX character encodings and thus will not suffice our accuracy standards.

I've read documentation on BibTeX that states no conversion process occurs in BibTeX, so for citations that contain LaTeX, I'm curious of a method that we can use to ensure accuracy for each citation. Additionally, we would need each character written in LaTeX to be converted to usable UTF-8 (since our database won't recognize anything that is not UTF-8).

Will something like Biber work for this? Will we need to abandon JabRef for citations that contain LaTeX encodings? I've tried just adding the BibTeX as references in a .tex document and then uploading the final PDF, but it hasn't been working well.

Any suggestions?

Best Answer

(Converting my and Ulrike's comments into some form of answer.)

I'm not quite sure what you are expecting to happen here. I suspect that by 'LaTeX encodings' you mean items which are included as control sequences (for example \rightarrow). A lot of these are math symbols, and comprehensive coverage in this area is hard to find at the font level. So even with UTF-8, LaTeX users will tend to stick to the symbolic names here. Moreover, the title don't need to contain only characters. It is quite possible to insert a graphic:

title = {Tiger: {\includegraphics[width=1cm]{tiger}}}

or a chessboard or an exotic symbol. So if you want to build a database which use only UTF-8 chars a human will have to go over such special titles and decide what should be used as replacement.

Thus your best approach is going to be some custom scripting, taking items you do know how to convert and changing them. At the same time, anything else can be flagged for human intervention, which will be the only way in many cases.