Regular Expression to Match a DOI (Digital Object Identifier)

The DOI (Digital Object Identifier) is an unique identifier used in biomedical publications (and in many other places). Due to lazy scientists, or lazy editors, not tagging the PDF correctly, it’s the only information that I could find parsable from any research paper. I wrote the regular expression to be used in Python. It retrieves the DOI in the format 10.1234.whateverishere. It finishes searching when it finds a space or a capital letter, which is usually the case. I tested it on 40 papers from several journals and it retrieved the doi perfectly from 39.

regexDOI = re.compile('[doi|DOI][\s\.\:]{0,2}(10\.\d{4}[\d\:\.\-\/a-z]+)[A-Z\s]')

I’m open to suggestions to improve the regex and to test it against more papers. Feel free :)

6 thoughts on “Regular Expression to Match a DOI (Digital Object Identifier)

  1. dunno if anybody cares at this point, but since I got here, maybe others will.

    A DOI is specifically intended to *be* a link to a document, not just to be an identifier that you have to google around for. Just drop the DOI into this URL pattern:

    http://dx.doi.org/10.XXXX.RANDOM

    Whammo! You now have a permanent link to your article, no matter who’s publishing it, as long as they keep the DOI active.

  2. Bom, obrigado por partilhares mas falha quando não existe o prefixo “doi” e por exemplo em “10.1016/S0735-1097(98)00347-7″.

    A minha abordagem (em PCRE) é (?:doi[\s.:]{0,2})?(10[.][^/]+/[^\s”]+).

  3. Há um peixe apenas, no sul da Sibéria, que consegue nadar ainda mais neste assunto…do que eu.

    Abraço. “pessoal do meu bairro…que vinha cá abaixo trazer os pássaros” mas já não vêm porque já ninguém sabe o que são os Cebola Mol!

  4. So, first I have to have the PDF file of a paper. Then, to parse it, I’m using pyPDF. However, I don’t get a neat output from the parser. Many PDF files don’t have “Title” tags for example, so it’s hard to get decent information out of it.

    The best way I found to, from the parsing of the PDF, getting to the paper’s information (Title, pubdate, abstract, etc) was to extract the DOI number from it. This line of code is just the regular expression I use to find it. Regular expressions are used to find patterns in text. Since the DOI has a regular pattern (10.XXXX.RANDOM) it’s “easy” to find it.

    Then, I’m thinking on coupling this search for the DOI with Pubmed Entrez Utilities and retrieve information on the paper. Yep, it sucks to have the info already on the PDF and having to get it from the Web, but it’s the best I can do at the moment :(

    Did you get it? :)

  5. That’s a web search of the DOI, right? You search specifically the precise DOI, you cannot make a random search? So, for that you have to search first the DOI of the article and then apply it to search again?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s