1 comments

  • jalyper 23 hours ago
    I built a compression system specifically for numeric-heavy tables (IoT sensors, ML features, financial data). Uses learned vector quantization per column instead of treating tables as byte streams.

    Key results on 100K row datasets: - IoT Telemetry: 8.08× (Sempress) vs 3.58× (Gzip) = +125% - Sensor Physics: 5.88× vs 2.76× = +113% - ML Features: 5.46× vs 3.09× = +77% - Financial: 3.80× vs 2.51× = +51%

    How it works: - Auto-detects numeric vs categorical columns - Learns K-Means codebook (k=64) per numeric column - Encodes values as nearest centroid indices - Optional residuals for precision-critical columns - Packages with msgpack + zstd

    Paper: https://sempress.net/paper.pdf Code: https://github.com/jalyper/sempress-core (MIT license, ~500 LOC) Install: pip install -e .

    Best for: 60%+ numeric columns, >10K rows, IoT/ML/finance Still use gzip for: Text-heavy tables, small files, real-time streaming

    Independent research with AI coding assistance. All algorithmic decisions and experimental design are mine. Open to feedback and collaborators!

    What would you use this for? Any datasets you'd like me to benchmark?