Skip to content

Huffman coding and Run-length coding

Huffman encode

The basic idea of ​​Huffman encoding is to use short codes to represent characters with high frequency, and long codes to represent characters with low frequency. This reduces the average length and expected length of the encoded string, thereby achieving the purpose of compression. . Therefore, Huffman coding is widely used in the field of lossless compression. It can be seen that the huffman encoding is a variable encoding, not a fixed length encoding.

The Huffman coding process consists of two main parts:

-Construct a Huffman tree based on the input characters -Traverse the Huffman tree and assign the nodes of the tree to characters

As mentioned above, his basic principle is to use short codes to represent characters with high frequency, and use long codes to represent characters with low frequency, so the first thing to do is to count the appearance frequency of characters, and then according to the frequency of statistics. To build the Huffman tree (also called the optimal binary tree).

As shown in the figure, the huffman tree is a binary tree. The left child node path of the node is represented by 0, the right child node is represented by 1, and the value of the node represents its weight. The larger the weight, the smaller the depth. Depth is actually the length of the code. Usually we use the frequency of character appearance as the weight. When encoding is actually performed, similar to a dictionary tree, the node is not used for encoding, and the path of the node is used for encoding.

If the computer uses ternary, not binary, then the huffman tree should be a ternary tree.

Examples

For example, the results of our statistics on the frequency of a string are as follows:

character frequency
a 5
b 9
c 12
d 13
e 16
f 45

-Construct each element into a node, that is, a tree with only one element. And build a minimum heap, including all nodes, the algorithm uses the smallest heap as the priority queue.

-Select two nodes with the smallest weight, and add a node with a weight of 5+9=14 as their parent node. And update the smallest heap, now the smallest heap contains 5 nodes, of which 4 trees are still the original nodes, and the nodes with weights 5 and 9 are merged into one.

The result is this:

huffman-example

character frequency encoding
a 5 1100
b 9 1101
c 12 100
d 13 101
e 16 111
f 45 0

run-length encode

Run-length coding is a relatively simple compression algorithm. Its basic idea is to use repeated and consecutive characters (number of consecutive occurrences, a certain character) to describe.

For example, a string:

AAAAABBBBCCC

Using run-length coding can be described as:

5A4B3C

5A means that there are 5 consecutive A in this place. Similarly, 4B means that there are 4 consecutive Bs, 3C means that there are 3 consecutive Cs, and so on.

But in fact, the situation may be very complicated. We can encode a single character or multiple characters. Therefore, how to extract the subsequence is a problem. This is not as simple as it seems. Still taking the above example, we can also regard the whole AAAAABBBBCCC as a subsequence, so that the length of the code will be coded. Which method is used depends on the compression time and compression ratio. There are many more complicated situations, and no expansion is done here.

It is more suitable to compress a file when the binary in the file has a large number of continuous repetitions. A classic example is a BMP image with a large area of ​​color blocks. Because BMP is not compressed, you can see what it looks like when it is stored. What's it like

This is also when our pictures tend to be solid colors, compression will have a good effect

Consider a question. If we store two pictures on the CDN, the two pictures are almost the same. Can we optimize it? Although this is a problem that CDN manufacturers should be more concerned about, this problem still has a great impact on us and is worth thinking about.

to sum up

Run-length coding and Huffman are both lossless compression algorithms, that is, the decompression process will not lose any content of the original data. In reality, we first use run-length encoding, and then use Huffman to encode again. Almost all lossless compression formats use them, such as PNG, GIF, PDF, ZIP, etc.

For lossy compression, it usually removes colors that cannot be recognized by humans, hearing frequency range, etc. In other words, the original data is lost. But because humans cannot recognize this part of the information, it is worthwhile in many cases. This kind of coding that deletes the content that humans cannot perceive, we call it "perceptual coding" (maybe a new term created by ourselves), such as JPEG, MP3, etc. About lossy compression is not the scope of this article, you can search for relevant information if you are interested.

In fact, the principle of video compression is similar, except that video compression uses some additional algorithms, such as "temporal redundancy", which means that only the changed part is stored. For the unchanged part, it is enough to store it once.