Regular Expression to Match a DOI (Digital Object Identifier)

The DOI (Digital Object Identifier) is an unique identifier used in biomedical publications (and in many other places). Due to lazy scientists, or lazy editors, not tagging the PDF correctly, it’s the only information that I could find parsable from any research paper. I wrote the regular expression to be used in Python. It retrieves the DOI in the format 10.1234.whateverishere. It finishes searching when it finds a space or a capital letter, which is usually the case. I tested it on 40 papers from several journals and it retrieved the doi perfectly from 39.

regexDOI = re.compile('[doi|DOI][\s\.\:]{0,2}(10\.\d{4}[\d\:\.\-\/a-z]+)[A-Z\s]')

I’m open to suggestions to improve the regex and to test it against more papers. Feel free :)

Mass renaming of files

I don’t know if it is just my camera, but everytime I transfer my pictures to the laptop, I have a boring time whenever I have to either upload them anywhere, or just use them in a program that requires a “jpg” or “jpeg” extension. Why? Because my camera uses the “JPG”/”JPEG” format to save the pictures, which, in Linux, as the filenames are case-sensitive, is a pain the rear. Yes, I know, Linux sucks. But luckily, it also has its own “de-sucking” tools, which can be handy in this particular situations, but also in several other occasions, some that even Windows users might also complain about :P

Meet rename. Rename is a handy bash tool/command with the simple syntax: rename <regexp> <files>

It comes with 3 options: the ubiquitous -v (verbose), -f (force), and -n (no-act) which outputs what would have been changed.

The tough part might be the Regular Expression. I advise this link for those who don’t know anything about it.

And, to show some examples of how it works:

joao@wasp:~/Desktop$ ls | grep docx
Events.docx
Keep Management Guide.docx
Neovir Campaign Setting.docx
joao@wasp:~/Desktop$ rename -n 's/docx/doc/' *.docx
Events.docx renamed as Events.doc
Keep Management Guide.docx renamed as Keep Management Guide.doc
Neovir Campaign Setting.docx renamed as Neovir Campaign Setting.doc
joao@wasp:~/Desktop$ 

The regular expression ‘s/docx/doc/’ stands for ‘substitute/this/forthis/’. You can use wildcards (*) to match several variants of a same name for example, and remember that if you with to substitue a /, you have to escape it with .

I was using a for loop before I found this :x Quite a change!