Dealing with large files of protein trajectories, I realized that some of my python scripts are incredibly slow in comparison with c++ code. I noticed that unzipping a trajectory before reading is faster than using the gzip module to read directly from the gzipped file ^^.
I have five different approaches to benchmark the reading speed for the following two (same) files:
-rw-r--r-- 1 doep doep 2.4G Feb 15 16:05 traj.pdb -rw-r--r-- 1 doep doep 609M Feb 15 15:59 traj.pdb.gz
Each runtime was measured twice using the real-time of the ‘time’ command. Each approach reads in every single line via:
while True: line = f.readline() if not line: break
The five methods are:
- Reading from uncompressed file via: open()
- Reading from uncompressed file using the io module: io.open()
- Reading from compressed file using the gzip module: gzip.open()
- Reading from compressed file using a small class based on the zlib module: zlib_file()
- Reading from compressed file using named pipes: os.mkfifo()
Results:
Conclusion:
Because storing/reading uncompressed file is not an option, the named pipes os.mkfifo() are the best/fastest solution for simply reading in files. But it also used the second system CPU, so the real-time is smaller than the user-time (90 +- 4.5). If you need seeks, etc you should extend the zlib_file class to your needs and gain a factor of ~2 in speedup. It is sad to see the performance of the gzip.open() approach, as ‘zcat traj.pdb.gz > /dev/null’ took only 21.165 seconds.
For uncompressed reads, the open() approach is the faster one, but on a different machine things were different as io.open() was 20x times faster than the open(). So you should check the open() speed on your machine before using it.
Complete code:
"""This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>."""
from __future__ import print_function
import io
import zlib
import sys
class zlib_file():
def __init__(self, buffer_size=1024*1024*8):
self.dobj = zlib.decompressobj(16+zlib.MAX_WBITS) #16+zlib.MAX_WBITS -> zlib can decompress gzip
self.decomp = []
self.lines = []
self.buffer_size = buffer_size
def open(self, filename):
self.fhwnd = io.open(filename, "rb")
self.eof = False
def close(self):
self.fhwnd.close()
self.dobj.flush()
self.decomp = []
def decompress(self):
raw = self.fhwnd.read(self.buffer_size)
if not raw:
self.eof = True
self.decomp.insert(0, self.dobj.flush())
else:
self.decomp.insert(0, self.dobj.decompress(raw))
def readline(self):
#split
out_str = []
while True:
if len(self.lines) > 0:
return self.lines.pop() + "\n"
elif len(self.decomp) > 0:
out = self.decomp.pop()
arr = out.split("\n")
if len(arr) == 1:
out_str.append(arr[0])
else:
self.decomp.append(arr.pop())
arr.reverse()
out_str.append(arr.pop())
self.lines.extend(arr)
out_str.append("\n")
return "".join(out_str)
else:
if self.eof: break
self.decompress()
if len(out_str) > 0:
return "".join(out_str)
def readlines(self):
lines = []
while True:
line = self.readline()
if not line: break
lines.append(line)
return lines
if __name__ == "__main__":
mode = int(sys.argv[1])
if mode == 1:
f = open("traj.pdb")
while True:
line = f.readline()
if not line: break
f.close()
elif mode == 2:
f = io.open("traj.pdb")
while True:
line = f.readline()
if not line: break
f.close()
elif mode == 3:
import gzip
gz = gzip.open(filename="traj.pdb.gz", mode="r")
while True:
line = gz.readline()
if not line: break
gz.close()
if mode == 4:
f = zlib_file()
f.open("traj.pdb.gz")
while True:
line = f.readline()
if not line: break
f.close()
elif mode == 5:
import os
import subprocess
tmp_fifo = "tmp_fifo"
os.mkfifo(tmp_fifo)
p = subprocess.Popen("gzip --stdout -d traj.pdb.gz > %s" % tmp_fifo, shell=True)
f = io.open(tmp_fifo, "r")
while True:
line = f.readline()
if not line: break
f.close()
p.wait()
os.remove(tmp_fifo)