[Math] Handling arXiv feeds to avoid duplicates

arxivreading-list

I subscribe to feeds from the arXiv Front for a number of subject areas, using Google Reader. This is great, but there is one problem: when a new preprint is listed in several subject categories, it gets listed in several feeds, which means I have to spend more time reading through the lists of new items, and due to my slightly dysfunctional memory, I often download the same preprint twice. Is there a way to get around this problem, by somehow merging the feeds, using a different arXiv site, or using some other clever trick?

(Hope this is not too off-topic, I think a good answer could be useful to a number of mathematicians. Also, I would like to tag this "arxiv" but am not allowed to add new tags.)

Best Answer

Unless the arXiv has changed recently, articles are published daily which means that the feeds and the email are completely in step.

The problem with the duplicates is that each feed is a separate request to the arXiv for information. The arXiv doesn't know that you are going to merge these results, and I've never heard of a feed reader that attempts to merge feeds to remove duplicates.

However, all is not lost. The feeds that the arXiv provides are not the only way to find information. The arXiv has an API which means that you can effectively craft your own feed. For example, if you point your browser at:

http://export.arxiv.org/api/query?search_query=submittedDate:[20091014200000+TO+20091015200000]&start=0&max_results=500

then you get all the papers submitted yesterday. You can filter your search by subject.

http://export.arxiv.org/api/query?search_query=%28cat:math.AT+OR+cat:math.CT%29+AND+submittedDate:[20091014200000+TO+20091015200000]&start=0&max_results=500

Because the requests are handled all at once, there are no duplicates produced (as can be seen since Emily Riehl's paper is both math.AT and math.CT).

The only catch is that you need to put the date in proper form each time, you can't put in dates such as "today" or "yesterday". Plus the timezone handling is a little weird: the arxiv publishes updates at a certain time determined by the local timezone, which includes daylight saving changes, but the API uses GMT/UTC. So if you want to exactly replicated the "new preprints" announcement of the arxiv then you need to do some funky timezone conversions.

However, this can be done and I've done it. I use a program called RefBase for organising my references and I've modified it so that each morning it presents me with a list of what's new on the arxiv for me to scan through and decide which articles to add to my own bibliographic database. I can also scan back a few days if I've been on holiday. Buried in this extension is the code for figuring out what the date-stamp should be. I could extract it if there's any interest.

Documentation on the arxiv API is at their documentation site. The 'submittedDate' stuff isn't covered there though, that's a newer feature.

Related Question