Skip to content

Proposal: uncompressed input size #141

@natefoo

Description

@natefoo

Currently input_size is the size of the raw input, which can be either compressed or uncompressed. When scaling memory based on input size you probably only care about the uncompressed size. But gzip does store the uncompressed size, which we could read into a separate uncompressed_jnput_size variable. The uncompressed size is stored in the last 4 bytes, this seems to work for me:

#!/usr/bin/env python3
import os
import sys

path = sys.argv[1]

with open(path, 'rb') as f:
    f.seek(-4, os.SEEK_END)
    size = int.from_bytes(f.read(4), 'little')
    print(size)

The uncompressed size also isn't always set properly:

nate@pdp-11% gzip -l /home/nate/work/galaxy/test-data/1.bam
         compressed        uncompressed  ratio uncompressed_name
               3592                   0   0.0% /home/nate/work/galaxy/test-data/1.bam

So we should have a default... actual size, or actual size * some constant factor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions