Improved python gzip reading speed

Dealing with large files of protein trajectories, I realized that some of my python scripts are incredibly slow in comparison with c++ code. I noticed that unzipping a trajectory before reading is faster than using the gzip module to read directly from the gzipped file ^^.

I have five different approaches to benchmark the reading speed for the following two (same) files:

-rw-r--r-- 1 doep doep 2.4G Feb 15 16:05 traj.pdb
-rw-r--r-- 1 doep doep 609M Feb 15 15:59 traj.pdb.gz

Each runtime was measured twice using the real-time of the ‘time’ command. Each approach reads in every single line via:

while True:
    line = f.readline()
    if not line: break

The five methods are:

  1. Reading from uncompressed file via: open()
  2. Reading from uncompressed file using the io module: io.open()
  3. Reading from compressed file using the gzip module: gzip.open()
  4. Reading from compressed file using a small class based on the zlib module: zlib_file()
  5. Reading from compressed file using named pipes: os.mkfifo()

Results:

zlib

Conclusion:
Because storing/reading uncompressed file is not an option, the named pipes os.mkfifo() are the best/fastest solution for simply reading in files. But it also used the second system CPU, so the real-time is smaller than the user-time (90 +- 4.5). If you need seeks, etc you should extend the zlib_file class to your needs and gain a factor of ~2 in speedup. It is sad to see the performance of the gzip.open() approach, as ‘zcatĀ  traj.pdb.gz > /dev/null’ took only 21.165 seconds.

For uncompressed reads, the open() approach is the faster one, but on a different machine things were different as io.open() was 20x times faster than the open(). So you should check the open() speed on your machine before using it.

Complete code:

"""This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
 
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
 
You should have received a copy of the GNU Lesser General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>."""
 
from __future__ import print_function
 
import io
import zlib
import sys
 
class zlib_file():
    def __init__(self, buffer_size=1024*1024*8):
        self.dobj = zlib.decompressobj(16+zlib.MAX_WBITS) #16+zlib.MAX_WBITS -> zlib can decompress gzip
        self.decomp = []
        self.lines = []
        self.buffer_size = buffer_size
 
    def open(self, filename):
        self.fhwnd = io.open(filename, "rb")
        self.eof = False
 
    def close(self):
        self.fhwnd.close()
        self.dobj.flush()
        self.decomp = []
 
    def decompress(self):
        raw = self.fhwnd.read(self.buffer_size)
        if not raw:
            self.eof = True
            self.decomp.insert(0, self.dobj.flush())
 
        else:
            self.decomp.insert(0, self.dobj.decompress(raw))
 
    def readline(self):
        #split
        out_str = []
 
        while True:
            if len(self.lines) > 0:
                return self.lines.pop() + "\n"
 
            elif len(self.decomp) > 0:
                out = self.decomp.pop()
                arr = out.split("\n")
 
                if len(arr) == 1:
                    out_str.append(arr[0])
 
                else:
                    self.decomp.append(arr.pop())
                    arr.reverse()
                    out_str.append(arr.pop())
                    self.lines.extend(arr)
 
                    out_str.append("\n")
                    return "".join(out_str)
 
            else:
                if self.eof: break
                self.decompress()
 
        if len(out_str) > 0:
            return "".join(out_str)
 
    def readlines(self):
        lines = []
        while True:
            line = self.readline()
            if not line: break
 
            lines.append(line)
 
        return lines
 
if __name__ == "__main__":
    mode = int(sys.argv[1])
 
    if mode == 1:
        f = open("traj.pdb")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
 
    elif mode == 2:
        f = io.open("traj.pdb")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
 
    elif mode == 3:
        import gzip
        gz = gzip.open(filename="traj.pdb.gz", mode="r")
 
        while True:
            line = gz.readline()
            if not line: break
 
        gz.close()
 
    if mode == 4:
        f = zlib_file()
        f.open("traj.pdb.gz")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
 
    elif mode == 5:
        import os
        import subprocess
 
        tmp_fifo = "tmp_fifo"
 
        os.mkfifo(tmp_fifo)
 
        p = subprocess.Popen("gzip --stdout -d traj.pdb.gz > %s" % tmp_fifo, shell=True)
        f = io.open(tmp_fifo, "r")
 
        while True:
            line = f.readline()
            if not line: break
 
        f.close()
        p.wait()
 
        os.remove(tmp_fifo)

VN:F [1.9.22_1171]
Rating: 6.3/10 (3 votes cast)
Improved python gzip reading speed, 6.3 out of 10 based on 3 ratings

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Wir benutzen Cookies um die Nutzerfreundlichkeit der Webseite zu verbessen. Durch Deinen Besuch stimmst Du dem zu.