[Tex/LaTex] Automatically adding DOI fields to a hand-made bibliography


Is there a tool that will read a .bib file and add the correct DOI fields for all the entries that don't have them?

My current workflow is to just add citations as and when I need them with auctex in emacs. So a command line tool would be fine. I'd rather not load the file into some bibliography manager like jabref, because it adds all these superfluous fields like "owner" and "timestamp" which are useless. I guess with all the bibliographic information in the file, it should be easy to identify the right DOI with some judicious database searching…

Best Answer

I followed user13348's suggestion, and using his request function, I wrote a python3 script that takes a bib file and outputs a new bibfile with the DOIs it finds. I'm not using bibtool or taking any aux files.

The requirements are bibtexparser and unidecode.

#!/usr/bin/env python
import sys, re
from unidecode import unidecode
import bibtexparser
from bibtexparser.bwriter import BibTexWriter
import http.client as httplib
import urllib

# Search for the DOI given a title; e.g.  "computation in Noisy Radio Networks"
# Credit to user13348, slight modifications
# http://tex.stackexchange.com/questions/6810/automatically-adding-doi-fields-to-a-hand-made-bibliography
def searchdoi(title, author):
  params = urllib.parse.urlencode({"titlesearch":"titlesearch", "auth2" : author, "atitle2" : title, "multi_hit" : "on", "article_title_search" : "Search", "queryType" : "author-title"})
  headers = {"User-Agent": "Mozilla/5.0" , "Accept": "text/html", "Content-Type" : "application/x-www-form-urlencoded", "Host" : "www.crossref.org"}
  # conn = httplib.HTTPConnection("www.crossref.org:80") # Not working any more, HTTPS required
  conn = httplib.HTTPSConnection("www.crossref.org")       
  conn.request("POST", "/guestquery/", params, headers)
  response = conn.getresponse()
  #print(response.status, response.reason)
  data = response.read()
  return re.search(r'doi\.org/([^"^<^>]+)', str(data))

def normalize(string):
    """Normalize strings to ascii, without latex."""
    string = re.sub(r'[{}\\\'"^]',"", string)
    string = re.sub(r"\$.*?\$","",string) # better remove all math expressions
    return unidecode(string)

def get_authors(entry):
    """Get a list of authors' or editors' last names."""
    def get_last_name(authors):
        for author in authors :
            author = author.strip(" ")
            if "," in author:
                yield author.split(",")[0]
            elif " " in author:
                yield author.split(" ")[-1]
                yield author

        authors = entry["author"]
    except KeyError:
        authors = entry["editor"]

    authors = normalize(authors).split("and")
    return list(get_last_name(authors))

print("Reading Bibliography...")
with open(sys.argv[1]) as bibtex_file:
    bibliography = bibtexparser.load(bibtex_file)

print("Looking for Dois...")
before = 0
new = 0
total = len(bibliography.entries)
for i,entry in enumerate(bibliography.entries):
    print("\r{i}/{total} entries processed, please wait...".format(i=i,total=total),flush=True,end="")
        if "doi" not in entry or entry["doi"].isspace():
            title = entry["title"]
            authors = get_authors(entry)
            for author in authors:
                doi_match = searchdoi(title,author)
                if doi_match:
                    doi = doi_match.groups()[0]
                    entry["doi"] = doi
                    new += 1
            before += 1

template="We added {new} DOIs !\nBefore: {before}/{total} entries had DOI\nNow: {after}/{total} entries have DOI"

outfile = sys.argv[1]+"_doi.bib"
print("Writing result to ",outfile)
writer = BibTexWriter()
writer.indent = '    '     # indent entries with 4 spaces instead of one
with open(outfile, 'w') as bibfile:

You can use it as such :

python3 searchdoi.py test.bib

And it will look like this :

Reading Bibliography...
Looking for Dois...
161/162 entries processed, please wait...
We added 49 DOIs !
Before: 42/162 entries had DOI
Now: 91/162 entries have DOI
Writing result to  test.bib_doi.bib

You can now just check test.bib_doi.bib.