[Tex/LaTex] Find Duplicated article titles in the .bib file


I am writing my thesis and I copied and pasted some contents of previous .bib files. But maybe I used different labels for the same article title and maybe I cited these different labels in my thesis, so maybe my references can contain the same article twice. I have almost 190 references and I believe it will be hard to visually find repeated articles.

Is it possible to find in my bib file entries with the same title? I know bibtex looks for repeated labels. Is it possible to find repeated titles in my .bib file?

Best Answer

You could use perl to go through the bib file, save all titles as a hash key with its line as the hash value, and then loop through it and print the title if its value has multiple entries. To do so, create a file with the following content, e.g. "finddupls.pl", change the bib file name, then execute perl finddupls.pl in your terminal:

my %seen = ();

my $line = 0;
open my $B, 'file.bib';
while (<$B>) {
    # remove all non-alphanumeric characters, because bibtex could have " or { to encapsulate strings etc
    s/[^a-zA-Z0-9 _-]//ig; 
    # lower-case everything to be case-insensitive
    # pattern matches lines which start with title
    $seen{lc($1)} .= "$line," if /^\s*title\s*(.+)$/i;
close $B;

# loop through the title and count the number of lines found
foreach my $title (keys %seen) {
    # count number of elements seperated by comma
    my $num = $seen{$title} =~ tr/,//;
    print "title '$title' found $num times, lines: ".$seen{$title},"\n" if $num > 1;

# write sorted list into file
open my $S, '>sorted_titles.txt';
print $S join("\n", sort keys %seen);
close $S;

It returns directly in the terminal something like this:

title 'observation on soil moisture of irrigation cropland by cosmic-ray probe' found 2 times, lines: 99,1350,
title 'multiscale and multivariate evaluation of water fluxes and states over european river basins' found 2 times, lines: 199,1820,
title 'calibration of a non-invasive cosmic-ray probe for wide area snow water equivalent measurement' found 2 times, lines: 5,32,

And it additionally writes a file sorted_titles.txt listing all titles alphabetically ordered which you could go through and detect duplicates manually.

Related Question