Currently input_size is the size of the raw input, which can be either compressed or uncompressed. When scaling memory based on input size you probably only care about the uncompressed size. But gzip does store the uncompressed size, which we could read into a separate uncompressed_jnput_size variable. The uncompressed size is stored in the last 4 bytes, this seems to work for me:
#!/usr/bin/env python3
import os
import sys
path = sys.argv[1]
with open(path, 'rb') as f:
f.seek(-4, os.SEEK_END)
size = int.from_bytes(f.read(4), 'little')
print(size)
The uncompressed size also isn't always set properly:
nate@pdp-11% gzip -l /home/nate/work/galaxy/test-data/1.bam
compressed uncompressed ratio uncompressed_name
3592 0 0.0% /home/nate/work/galaxy/test-data/1.bam
So we should have a default... actual size, or actual size * some constant factor.
Currently
input_sizeis the size of the raw input, which can be either compressed or uncompressed. When scaling memory based on input size you probably only care about the uncompressed size. But gzip does store the uncompressed size, which we could read into a separateuncompressed_jnput_sizevariable. The uncompressed size is stored in the last 4 bytes, this seems to work for me:The uncompressed size also isn't always set properly:
So we should have a default... actual size, or actual size * some constant factor.