Why this works

The modern SAP wrapper handled by this repository is not a separate magic compressor. It is an SAP header, followed by a few padding bits, in front of what becomes a standard raw DEFLATE stream once the bit alignment is corrected.

This means the hard part is not inventing a new decompressor. The hard part is understanding the wrapper, validating the header, and shifting the payload back into alignment before handing it to a standard inflater.

Observed structure

[4 bytes original size][1 byte version/algo][3 bytes flags][2-5 padding bits][raw DEFLATE stream]

Once the wrapper and padding are removed, Go's standard compress/flate reader can handle the actual decompression.

What that means in practice

The implementation stays small because the real work is understanding the wrapper and bit alignment correctly. The decompression itself is standard library work.

The full reverse-engineering story

The starting point was a file called example.blob: high-entropy binary data extracted from an SAP data BLOB, known to contain a document, but unreadable by ordinary archive and compression tools.

The problem

The file was 118,703 bytes of unrecognizable binary. It clearly came from an SAP-backed storage flow, but after extraction as a raw database BLOB, ordinary tools could not open it. The task was simple in theory and tedious in practice: determine what the wrapper was doing and recover the original file bytes without relying on proprietary tooling.

Step 1: First look

$ file example.blob
example.blob: data

$ python3 -c "
from collections import Counter
import math
data = open('example.blob','rb').read()
freq = Counter(data)
H = -sum((c/len(data))*math.log2(c/len(data)) for c in freq.values())
print(f'Entropy: {H:.2f} bits/byte')
print(f'Unique bytes: {len(freq)} / 256')
"
Entropy: 8.00 bits/byte
Unique bytes: 256 / 256

Entropy at the theoretical ceiling and all 256 byte values present meant the file was either heavily compressed or encrypted. It was not ordinary uncompressed document data.

Step 2: Try everything obvious first

Common formats were tested and rejected before any SAP-specific conclusion was drawn. That matters. Reverse engineering gets weaker when people jump straight to the answer they want.

Format or transformResult
gzip, bzip2, xz, zstd, lz4Not recognized
ZIP, 7z, RAR, TARNot recognized
Brotli, Snappy, LZO, LZMAFailed
PostgreSQL pglz, BSON, msgpackNo match
All 256 single-byte XOR keysNo PDF signature
Bit rotation, nibble swap, byte reversalNothing useful

After that sweep, the evidence pointed away from ordinary compression containers and toward a wrapper format or SAP-specific transform.

Step 3: The hex dump clue

Offset  Hex                              ASCII
000000  3F F7 01 00 12 1F 9D 02  25 ...  ?.......%
OffsetValueMeaning
0-33F F7 01 00 = 128,831Uncompressed size (little-endian)
40x12Version 1, algorithm 2
5-61F 9DFlags or magic-related bytes
70x02Flags
80x25 = %First recognizable byte of %PDF

That byte-level clue changed the posture. The payload clearly began immediately after an 8-byte wrapper, but the stream was not simply "header plus plain file." There was still alignment logic involved.

Step 4: The SAP connection

Once the file was treated as an SAP data BLOB rather than a generic archive, the public research trail became much stronger. SAP uses a family of compression routines across products such as NetWeaver, MaxDB, SAP GUI, SAPCAR, and storage-related workflows. The algorithm byte 0x12 decodes to version 1 and algorithm 2, which points to the LZH wrapper variant.

Step 5: The key discovery

Studying the MaxDB source lineage reveals the central fact: SAP's so-called LZH wrapper is not an exotic new compression algorithm in the payload body. Once normalized, the body behaves like standard RFC 1951 DEFLATE.

// SAP extra-length-bits table:
int CsExtraLenBits[] = {0,0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,
                        3,3,3,3,4,4,4,4,5,5,5,5,0};

// RFC 1951 DEFLATE uses the same structure.

// SAP bit-length order:
unsigned char bl_order[] = {16,17,18,0,8,7,9,6,10,5,11,4,12,3,13,2,14,1,15};

// RFC 1951 DEFLATE uses the same ordering.

The dynamic and fixed block handling in the SAP code line up with DEFLATE block types, and the bit reader is standard LSB-first logic. That narrows the SAP-specific behavior down to two additions around the payload, not a full proprietary compressor.

The two SAP-specific additions

  1. An 8-byte header carrying the original size, version/algo byte, and flags.
  2. 2-5 padding bits, historically called "nonsense bits" in the source lineage, inserted before the raw DEFLATE stream.

Those padding bits are enough to break off-the-shelf tools until the bitstream is shifted back into alignment.

Why the output is generic by default

The wrapper stores compressed bytes, not necessarily a trustworthy original filename. The recovered payload may be a PDF, image, Office document, ZIP, XML, text, or arbitrary binary.

That is why the CLI and web page both use best-effort type detection from magic bytes. When the recovered payload cannot be identified confidently, the safe fallback extension is .bin.

Research lineage

SourceContribution
Martin Gallo Public SAP compression and security research, including pysap and earlier reverse-engineering work.
Daniel Berlin SAP REPOSRC decompressor notes and practical direction for wrapper handling.
Hans-Christian Esperer hascar Haskell implementation and format documentation.
SAP AG / MaxDB Original open-source compression code that exposes the header layout and padding-bit behavior.
RFC 1951 The DEFLATE specification that explains why a standard inflater works after the wrapper is normalized.

This repository deliberately acknowledges the public research and upstream code it builds on. Reverse engineering work becomes weaker when it is presented as if it emerged from nowhere.

Deep dive into the logic

At implementation level, the logic is smaller than the explanation suggests.

  1. Read the first four bytes as a little-endian uncompressed size.
  2. Read the version/algo byte and confirm that the wrapper matches the supported SAP LZH format.
  3. Read the first two payload bits to determine how many padding bits follow, then skip them.
  4. Shift the remaining bitstream back into byte alignment.
  5. Pass the normalized body to a standard raw DEFLATE inflater.
  6. Validate that the recovered size matches the expected original size.
  7. Optionally inspect magic bytes to identify the recovered payload type.
stream = data[8:]         # skip SAP header
skip = 3                  # 2 length bits + 1 padding bit in one real sample

shifted = bytearray()
for i in range(len(stream) - 1):
    shifted.append((stream[i] >> skip) | ((stream[i + 1] & 0x07) << 5))

result = flate_or_zlib_raw_deflate_decode(bytes(shifted))

That is the core idea. The rest of the repository is engineering discipline: header validation, safer output handling, file-type identification, comments, tests, and a browser-friendly demo built on the same logic compiled to WebAssembly.

The practical conclusion is straightforward: once the wrapper is understood, the format is auditable, portable, and small enough to explain. That is exactly how such tooling should be published.