Python: Slicing vs Splitting vs Endswith

Very often I need to get a list of files in a directory based on their extensions, or some part of their names. I’d been looking for a performance benchmark of a few common “techniques” but couldn’t find any so I set out to do my own.

Dataset: Folder with 50k empty files named test_X.txt

Methods:

1. Slicing the end of the string. You have to know how long your extension is, otherwise you can get it with a combination of len and str.index.

def slicing():
    ext = 'txt'
    file_list = [f for f in os.listdir('.') if f[-4:] == ext]
    return len(file_list)

2. Splitting the extension. If you have complicated filenames this might pose a problem, but often is a good approach.

def splitting():
    ext = 'txt'
    file_list = [f for f in os.listdir('.') if f.split('.')[-1] == ext]
    return len(file_list)

3. Using the string method endswith. Very pythonic and easy to read.

def endswith():
    ext = 'txt'
    file_list = [f for f in os.listdir('.') if f.endswith(ext)]
    return len(file_list)

I used the timeit module to benchmark the performance of the three functions, executing each 100 times.

Results (reported in seconds):

Starting test..
 Slicing: 2.98469209671
 Splitting: 5.28239703178
 Endswith: 4.13192510605

Surprisingly, slicing is almost twice as fast as splitting, while around 33% faster than using endswith. I’ll stick to slicing from now on, although I doubt it will impact that much my small scripts performance :) 50k files is a lot!

Anyone has a good explanation for this?

6 thoughts on “Python: Slicing vs Splitting vs Endswith

  1. I think you misunderstood me, endswith() allows for this:

    ext = ('.txt', '.exe', '.rar')
    file_list = [f for f in os.listdir('.') if f.endswith(ext)]
    
  2. > endswith() allows for more than one extension being passed and tested

    That’s one of the nice things about splitext. With it you can do something like this:

    exts = ['.exe', '.jpeg', '.ps']
    if splitext(filename)[1] in exts:
      # do stuff
    

    Where you would have to test each one individually with endswith.

  3. Hey Nick, thanks for the opinion!

    Indeed split() has to generate a list and thus ought be slower, but I find no good reason for endswith().. in any case, slicing is not the most flexible option indeed, unless you have a pretty “stable” set of data.. endswith() allows for more than one extension being passed and tested. This “test” was more of a product of curiosity than actual productivity.

    Thanks for the splitext tip, I knew about it but frankly never use it :)

  4. str.split has to allocate a list, which I assume is what’s making it slower (it appears to over-allocate too). I’m guessing, from the docs, that slicing is just doing some pointer manipulations/comparisons, relying on the immutability of strings. I have no idea why endswith doesn’t perform better, possibly it’s doing some extra error checking.

    You might also be interested in os.path.splitext It probably won’t be quicker, but it may be more robust.

    from os.path import splitext
    def slicing():
      ext = '.txt'
      file_list = [f for f in os.listdir('.') if splitext(f)[1] == ext]
      return len(file_list)
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s