Very often I need to get a list of files in a directory based on their extensions, or some part of their names. I’d been looking for a performance benchmark of a few common “techniques” but couldn’t find any so I set out to do my own.
Dataset: Folder with 50k empty files named test_X.txt
Methods:
1. Slicing the end of the string. You have to know how long your extension is, otherwise you can get it with a combination of len and str.index.
def slicing():
ext = 'txt'
file_list = [f for f in os.listdir('.') if f[-4:] == ext]
return len(file_list)
2. Splitting the extension. If you have complicated filenames this might pose a problem, but often is a good approach.
def splitting():
ext = 'txt'
file_list = [f for f in os.listdir('.') if f.split('.')[-1] == ext]
return len(file_list)
3. Using the string method endswith. Very pythonic and easy to read.
def endswith():
ext = 'txt'
file_list = [f for f in os.listdir('.') if f.endswith(ext)]
return len(file_list)
I used the timeit module to benchmark the performance of the three functions, executing each 100 times.
Results (reported in seconds):
Starting test.. Slicing: 2.98469209671 Splitting: 5.28239703178 Endswith: 4.13192510605
Surprisingly, slicing is almost twice as fast as splitting, while around 33% faster than using endswith. I’ll stick to slicing from now on, although I doubt it will impact that much my small scripts performance :) 50k files is a lot!
Anyone has a good explanation for this?