Tag Archives: Python

Faster file IO in python using cython

Reading large files in Python sometimes feels incredible slow. Here are some approaches using Cython to minimize reading times. Simply compiling the existing python code with Cython reduces the reading times by 23%. By introducing explicit type definitions, I could finally reach C++ reading speeds wich are 4.4x faster than pure Python code. However, when I used the generator keyword yield to iterate over all lines in an external Python function without exploiting my memory, the required runtime doubles for this approach. The used codeĀ snippets are listed below.

timings

File: file_io_python.py Simple python function to read in a file line by line.

def read_file_python(filename):
    f = open(filename, "rb")
    while True:
        line = f.readline()
        if not line: break
 
        #yield line
 
    f.close()
    return []

File: file_io.pyx Cython file, containing a pure python function and a cython optimized function for linewise file reading.

from libc.stdio cimport *
 
cdef extern from "stdio.h":
    #FILE * fopen ( const char * filename, const char * mode )
    FILE *fopen(const char *, const char *)
    #int fclose ( FILE * stream )
    int fclose(FILE *)
    #ssize_t getline(char **lineptr, size_t *n, FILE *stream);
    ssize_t getline(char **, size_t *, FILE *)
 
def read_file_slow(filename):
    f = open(filename, "rb")
    while True:
        line = f.readline()
        if not line: break
 
        #yield line
 
    f.close()
 
    return []
 
def read_file(filename):
    filename_byte_string = filename.encode("UTF-8")
    cdef char* fname = filename_byte_string
 
    cdef FILE* cfile
    cfile = fopen(fname, "rb")
    if cfile == NULL:
        raise FileNotFoundError(2, "No such file or directory: '%s'" % filename)
 
    cdef char * line = NULL
    cdef size_t l = 0
    cdef ssize_t read
 
    while True:
        read = getline(&line, &l, cfile)
        if read == -1: break
 
        #yield line
 
    fclose(cfile)
 
    return []

File: file_io.cppComparison code for C++.

#include "stdio.h"
#include <stdlib.h>
 
int main()
{
    FILE* cfile = fopen("trajectory.pdb", "rb");
 
    if(cfile == NULL) return 1;
 
    char * line = NULL;
    size_t l = 0;
    ssize_t read;
 
    while(true)
    {
        read = getline(&line, &l, cfile);
        if(read == -1) break;
    }
    free(line);
 
    fclose(cfile);
 
    return 0;
}

File: file_io_bench.pyPython code to test and benchmark all different functions.
import timeit
 
#################
count = 10
check = False
#################
 
if check:
    from file_io import read_file, read_file_slow
 
    import hashlib
    m = hashlib.new("md5")
    for line in read_file_slow("trajectory.pdb"):
        m.update(line)
 
    h1 = m.hexdigest()
 
    m = hashlib.new("md5")
    for line in read_file("trajectory.pdb"):
        m.update(line)
 
    h2 = m.hexdigest()
 
    assert h1 == h2, Exception("read error")
    print("read functions: ok")
 
t = timeit.Timer("""for line in read_file_python("trajectory.pdb"):
  pass""", """from file_io_python import read_file_python""")
t1 = t.timeit(count)
print("Python", t1, "sec")
 
t = timeit.Timer("""for line in read_file_slow("trajectory.pdb"):
  pass""", """from file_io import read_file_slow""")
t2 = t.timeit(count)
print("Cython", t2, "sec")
 
t = timeit.Timer("""for line in read_file("trajectory.pdb"):
  pass""", """from file_io import read_file""")
t3 = t.timeit(count)
print("cdef Cython", t3, "sec")
 
t = timeit.Timer("""s = subprocess.Popen("./a.out", shell=True)
s.wait()
""", """import subprocess""")
t4 = t.timeit(count)
print("C", t4, "sec")

VN:F [1.9.22_1171]
Rating: 7.6/10 (7 votes cast)

Improved python gzip reading speed

Dealing with large files of protein trajectories, I realized that some of my python scripts are incredibly slow in comparison with c++ code. I noticed that unzipping a trajectory before reading is faster than using the gzip module to read directly from the gzipped file ^^.

I have five different approaches to benchmark the reading speed for the following two (same) files:

-rw-r--r-- 1 doep doep 2.4G Feb 15 16:05 traj.pdb
-rw-r--r-- 1 doep doep 609M Feb 15 15:59 traj.pdb.gz

Each runtime was measured twice using the real-time of the ‘time’ command. Each approach reads in every single line via:

while True:
    line = f.readline()
    if not line: break

The five methods are:

  1. Reading from uncompressed file via: open()
  2. Reading from uncompressed file using the io module: io.open()
  3. Reading from compressed file using the gzip module: gzip.open()
  4. Reading from compressed file using a small class based on the zlib module: zlib_file()
  5. Reading from compressed file using named pipes: os.mkfifo()

Results:

zlib

Conclusion:
Because storing/reading uncompressed file is not an option, the named pipes os.mkfifo() are the best/fastest solution for simply reading in files. But it also used the second system CPU, so the real-time is smaller than the user-time (90 +- 4.5). If you need seeks, etc you should extend the zlib_file class to your needs and gain a factor of ~2 in speedup. It is sad to see the performance of the gzip.open() approach, as ‘zcatĀ  traj.pdb.gz > /dev/null’ took only 21.165 seconds.

For uncompressed reads, the open() approach is the faster one, but on a different machine things were different as io.open() was 20x times faster than the open(). So you should check the open() speed on your machine before using it.

Complete code:

"""This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
 
You should have received a copy of the GNU Lesser General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>."""
 
from __future__ import print_function
 
import io
import zlib
import sys
 
class zlib_file():
    def __init__(self, buffer_size=1024*1024*8):
        self.dobj = zlib.decompressobj(16+zlib.MAX_WBITS) #16+zlib.MAX_WBITS -> zlib can decompress gzip
        self.decomp = []
        self.lines = []
        self.buffer_size = buffer_size
 
    def open(self, filename):
        self.fhwnd = io.open(filename, "rb")
        self.eof = False
 
    def close(self):
        self.fhwnd.close()
        self.dobj.flush()
        self.decomp = []
 
    def decompress(self):
        raw = self.fhwnd.read(self.buffer_size)
        if not raw:
            self.eof = True
            self.decomp.insert(0, self.dobj.flush())
 
        else:
            self.decomp.insert(0, self.dobj.decompress(raw))
 
    def readline(self):
        #split
        out_str = []
 
        while True:
            if len(self.lines) > 0:
                return self.lines.pop() + "\n"
 
            elif len(self.decomp) > 0:
                out = self.decomp.pop()
                arr = out.split("\n")
 
                if len(arr) == 1:
                    out_str.append(arr[0])
 
                else:
                    self.decomp.append(arr.pop())
                    arr.reverse()
                    out_str.append(arr.pop())
                    self.lines.extend(arr)
 
                    out_str.append("\n")
                    return "".join(out_str)
 
            else:
                if self.eof: break
                self.decompress()
 
        if len(out_str) > 0:
            return "".join(out_str)
 
    def readlines(self):
        lines = []
        while True:
            line = self.readline()
            if not line: break
 
            lines.append(line)
 
        return lines
 
if __name__ == "__main__":
    mode = int(sys.argv[1])
 
    if mode == 1:
        f = open("traj.pdb")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
 
    elif mode == 2:
        f = io.open("traj.pdb")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
 
    elif mode == 3:
        import gzip
        gz = gzip.open(filename="traj.pdb.gz", mode="r")
 
        while True:
            line = gz.readline()
            if not line: break
 
        gz.close()
 
    if mode == 4:
        f = zlib_file()
        f.open("traj.pdb.gz")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
 
    elif mode == 5:
        import os
        import subprocess
 
        tmp_fifo = "tmp_fifo"
 
        os.mkfifo(tmp_fifo)
 
        p = subprocess.Popen("gzip --stdout -d traj.pdb.gz > %s" % tmp_fifo, shell=True)
        f = io.open(tmp_fifo, "r")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
        p.wait()
 
        os.remove(tmp_fifo)

VN:F [1.9.22_1171]
Rating: 6.3/10 (3 votes cast)
Wir benutzen Cookies um die Nutzerfreundlichkeit der Webseite zu verbessen. Durch Deinen Besuch stimmst Du dem zu.