[Tex/LaTex] Command-line tools for some bibtex database manipulations

bibtextools

Inspired by an answer to another question, I am reminded of the various reasons why it's difficult to have only one BIB file. I would like to have a set of command line tools that I could script to take my 'master' BIB and convert it for a specific situation. I am aware of bibtool which is probably powerful enough that it can meet most of my needs, but I haven't figured out how to use it…

Here are some common tasks that could be automated this way. Of course, the strategic goal would be solved 'the right way' by using a smarter BST (or something like biblatex) but often one is constrained (or at least it is much more convenient) to use a publisher's broken BST and make a few tweaks to the database:

  1. Truncate all papers with more than 5 authors, and keep only the first 3 authors of any truncated papers (because you are running low on space)
  2. Remove a field (month, url, ISSN, ISBN) because a journal has a broken BST that makes a mess of these
  3. Remove a field conditionally (eg, remove title only for articles but not books)
  4. Expand/collapse @string references (eg, I have 2 files, one apsjour.bib with
    @STRING{prl = {Phys. Rev. Lett.}}
    and one fulljour.bib with
    @STRING{prl = {Physical Review Letters}}
    so I can write \bibliography{apsjour,articlebib} or \bibliography{fulljour,articlebib} as required, but I certainly don't expect my co-authors to deal with this convention.
  5. Remove archiveprefix, eprint, primaryclass from @articles with a page number
  6. Remove DOIs with weird characters in them because the documentclass doesn't do enough escaping (seriously, Wiley, is
    10.1002/(SICI)1521-3978(200005)48:5/7<531::AID-PROP531>3.0.CO;2-#
    really a good idea for something that needs to end up in a URL?)
  7. A tool that is flexible enough to do more sophisticated things would be nice, so long as the flexibility isn't at the price of usability (cf bibtool)
  8. There are probably other use cases that I've temporarily forgotten. I feel like I'm forever making journal-specific or even paper-specific modifications of bibtex databases. JabRef makes it fairly easy, but still…

There are lots of cute tools living on CTAN in tex-archive/biblio/bibtex/utils but I feel that there must be some other place where the serious tools hide. I can't be the only one with these problems, can I? (Feel free to tell me that I'm approaching this the wrong way and let me know your personal strategy for dealing with the above issues without using commandline tools! This includes, as Jukka Suomela suggests in his answer, tools for editing the generated BBL file instead of editing the BIB.)

Here are some problems that I already know of solutions for:

  1. Process an AUX file to keep only the cited entries (many solutions, including bibextract)
  2. Merge BIB files and remove duplicates (pybib)
  3. Force a particular format of page ranges (1234-56 or 1234-1256) (bibfile-reformat-pages)

Best Answer

Have a look at biber which in the current 1.5 dev version on SourceForge has a new "tool" mode which allows you to use biber's reencoding and source mapping features independently of biblatex. The source mapping features are what you mainly need from your description and this is all documented in the PDF manual. I can provide specific examples if you have specific questions. biber will do everything you mention above apart from the @string expansion which would be possible to add but as you say, it's fairly idiosyncratic.

Of course, you can do this dynamically with biber too - with the changes being applied as the .bib is read but the .bib is not touched. The new tool mode allows you to write the changed .bib to another file without writing a .bbl.

For example, here is how in tool mode to tackle points 2, 3 5 and 6 in your examples. Point 1 is better handled semantically with biblatex and its max/min names options. Create a biber.conf with:

<config>
  <sourcemap>
    <maps datatype="bibtex" map_overwrite="1">
      <map>
        <map_step map_field_set="issn" map_null="1"/>
      </map>
      <map>
        <per_type>ARTICLE</per_type>
        <map_step map_field_set="title" map_null="1"/>
      </map>
      <map>
        <per_type>ARTICLE</per_type>
        <map_step map_field_source="pages" map_final="1"/>
        <map_step map_field_set="archiveprefix" map_null="1"/>
        <map_step map_field_set="eprint" map_null="1"/>
        <map_step map_field_set="primaryclss" map_null="1"/>
      </map>
    </maps>
    <map>
      <map_step map_field_source="doi" map_match="[\\;]" map_final="1"/>
      <map_step map_field_set="doi" map_null="1"/>
    </map>
  </sourcemap>
</config>

Then run biber with

biber --tool file.bib

Which will look in the default locations for your biber.conf and will output a file called file_bibertool.bib.

This is also all possible, as I said, dynamically using the biber.conf as you process the file normally into a .bbl with biber and also the whole mapping functionality is available in biblatex through macros (see \DeclareSourcemap in the biblatex documentation) if you wanted to do this on a per-document basis dynamically.