Regular Expression to Match a DOI (Digital Object Identifier)

The DOI (Digital Object Identifier) is an unique identifier used in biomedical publications (and in many other places). Due to lazy scientists, or lazy editors, not tagging the PDF correctly, it’s the only information that I could find parsable from any research paper. I wrote the regular expression to be used in Python. It retrieves the DOI in the format 10.1234.whateverishere. It finishes searching when it finds a space or a capital letter, which is usually the case. I tested it on 40 papers from several journals and it retrieved the doi perfectly from 39.

regexDOI = re.compile('[doi|DOI][\s\.\:]{0,2}(10\.\d{4}[\d\:\.\-\/a-z]+)[A-Z\s]')

I’m open to suggestions to improve the regex and to test it against more papers. Feel free :)