Two Sigma VO Interview Problem: Document Compressor (Huffman Coding, Priority Queue)

16 Views
No Comments

Implement 2 functions:

  • A'= encode(A) takes in a string (ASCII only) that represents the document A, and returns the binary string representation of the encoded string A'.
  • A = decode(A') takes in the output of encode()A', and returns the original string A.

For simplicity, we expect A' to be a string of 0s and 1s instead of the actual compressed bytes.

You can expect the call sequence to be encode(A), decode(A'), encode(B), decode(B').

You don’t need to worry about someone calling encode(A), encode(B), decode(A'), decode(B').

Compression algorithm

Suppose you have "aaaaabbbbcccdde". You can build a tree like below such that a = 00, b = 01, c = 11, d = 100, e = 101.

After encoding, it becomes "000000000001010101111111100100101", which is only 33 bits compared to the original 15 * 8 = 120 bits.

To construct a tree like above, you start by pairing the least frequent characters first, then build the tree upwards. There can be multiple valid trees. We aren’t requiring the absolute best compression.

Your implementation is considered good as long as it’s:

  • correct: successfully recovers to the original string A
  • effective: the compressed string is conceptually shorter than the original string: len(A') < len(A) * 8

This problem is a simplified Huffman coding task: count character frequencies, build a binary tree by repeatedly merging the two least frequent nodes, assign 0/1 along the edges, then encode the document as a bit string. Decoding walks the same tree bit by bit to recover the original text. Multiple valid trees are acceptable, so the goal is correctness and reasonable compression rather than the absolute optimal code. A priority queue is the natural data structure for building the tree, and the encoder/decoder should share the same tree state across a single encode-decode pair.

END
 0