Python: Slicing vs Splitting vs Endswith

Very often I need to get a list of files in a directory based on their extensions, or some part of their names. I’d been looking for a performance benchmark of a few common “techniques” but couldn’t find any so I set out to do my own.

Dataset: Folder with 50k empty files named test_X.txt

Methods:

1. Slicing the end of the string. You have to know how long your extension is, otherwise you can get it with a combination of len and str.index.

def slicing():
    ext = 'txt'
    file_list = [f for f in os.listdir('.') if f[-4:] == ext]
    return len(file_list)

2. Splitting the extension. If you have complicated filenames this might pose a problem, but often is a good approach.

def splitting():
    ext = 'txt'
    file_list = [f for f in os.listdir('.') if f.split('.')[-1] == ext]
    return len(file_list)

3. Using the string method endswith. Very pythonic and easy to read.

def endswith():
    ext = 'txt'
    file_list = [f for f in os.listdir('.') if f.endswith(ext)]
    return len(file_list)

I used the timeit module to benchmark the performance of the three functions, executing each 100 times.

Results (reported in seconds):

Starting test..
 Slicing: 2.98469209671
 Splitting: 5.28239703178
 Endswith: 4.13192510605

Surprisingly, slicing is almost twice as fast as splitting, while around 33% faster than using endswith. I’ll stick to slicing from now on, although I doubt it will impact that much my small scripts performance :) 50k files is a lot!

Anyone has a good explanation for this?

Compressing Files with Python: Symlink Trouble!

This is a follow-up of this previous post. I was trying to compress a directory that had symbolic links on it, using Python’s library zipfile. I was miguided into setting to True the os.walk argument followlinks, which in fact made me have a duplicate of my file, instead of a link. The following code is based on what A.Murat Eren wrote:

 

import  zipfile,  os
Z  =  zipfile.ZipFile('myzip.zip',  'w')
for  r,  d,  f  in  os.walk('mydir'):
   for  dd  in  d:
     if os.path.islink(os.path.join(r,  dd)):
       a  =  zipfile.ZipInfo()
       a.filename  =  os.path.join(r,  dd)
       a.create_system  =  3
       a.external_attr  =  2716663808L
       Z.writestr(a,  os.path.join(r,  dd))
     else:
       Z.write(os.path.join(r,  ff),  os.path.join(r,  ff))

   for ff in f:
     if os.path.islink(os.path.join(r,  ff)):
       a  =  zipfile.ZipInfo()
       a.filename  =  ff
       a.create_system  =  3
       a.external_attr  =  2716663808L
       Z.writestr(a,  os.path.join(r,  ff))
     else:
       Z.write(os.path.join(r,  ff),  os.path.join(r,  ff))

Z.close()

In my case I had a directory simlink, but this should be straightforward enough to implement for files. Another issue is with the line:

Z.writestr(a, os.path.join(r, dd))

That actually defines where your simbolic link will reside. Therefore, no matter which name you give to a.filename, this is what matters! I had a couple of troubles with lost symlinks because of this..

Compress a directory tree in Python with zipfile

Trying to compress a folder with the Python module zipfile results in an IOError exception being thrown. To overcome this simply combine os.walk with arcname argument of zipfile.write:


Z = zipfile.ZipFile('teste.zip', 'w')
for r, d, f in os.walk('teste'):
  for ff in f:
    Z.write(os.path.join(r, ff), os.path.join(r, ff))
Z.close()