This section provides practice in the use of list structure and data abstraction to manipulate sets and trees. The application is to methods for representing data as sequences of ones and zeros (bits). For example, the ASCII standard code used to represent text in computers encodes each character as a sequence of seven bits. Using seven bits allows us to distinguish $2^7$, or 128, possible different characters. In general, if we want to distinguish $n$ different symbols, we will need to use $\log_2 n$ bits per symbol. If all our messages are made up of the eight symbols A, B, C, D, E, F, G, and H, we can choose a code with three bits per character, for example With this code, the message

BACADAEAFABBAAAGAHis encoded as the string of 54 bits

001000010000011000100000101000001001000000000110000111Codes such as ASCII and the A-through-H code above are known as

100010100101101100011010100100000111001111This string contains 42 bits, so it saves more than 20% in space in comparison with the fixed-length code shown above.

One of the difficulties of using a variable-length code is knowing
when you have reached the end of a symbol in reading a sequence of
zeros and ones. Morse code solves this problem by using a special
*separator code* (in this case, a pause) after the sequence of
dots and dashes for each letter. Another solution is to design the
code in such a way that no complete code for any symbol is the
beginning (or *prefix*) of the code for another symbol. Such a
code is called a
*prefix code*. In the example above, A is
encoded by 0 and B is encoded by 100, so no other symbol can have a
code that begins with 0 or with 100.

In general, we can attain significant savings if we use variable-length prefix codes that take advantage of the relative frequencies of the symbols in the messages to be encoded. One particular scheme for doing this is called the Huffman encoding method, after its discoverer, David Huffman. A Huffman code can be represented as a binary tree whose leaves are the symbols that are encoded. At each non-leaf node of the tree there is a set containing all the symbols in the leaves that lie below the node. In addition, each symbol at a leaf is assigned a weight (which is its relative frequency), and each non-leaf node contains a weight that is the sum of all the weights of the leaves lying below it. The weights are not used in the encoding or the decoding process. We will see below how they are used to help construct the tree.

Figure 2.18 shows the Huffman tree for the A-through-H code given above. The weights at the leaves indicate that the tree was designed for messages in which A appears with relative frequency 8, B with relative frequency 3, and the other letters each with relative frequency 1.

Given a Huffman tree, we can find the encoding of any symbol by starting at the root and moving down until we reach the leaf that holds the symbol. Each time we move down a left branch we add a 0 to the code, and each time we move down a right branch we add a 1. (We decide which branch to follow by testing to see which branch either is the leaf node for the symbol or contains the symbol in its set.) For example, starting from the root of the tree in Figure 2.18, we arrive at the leaf for D by following a right branch, then a left branch, then a right branch, then a right branch; hence, the code for D is 1011.

To decode a bit sequence using a Huffman tree, we begin at the root and use the successive zeros and ones of the bit sequence to determine whether to move down the left or the right branch. Each time we come to a leaf, we have generated a new symbol in the message, at which point we start over from the root of the tree to find the next symbol. For example, suppose we are given the tree above and the sequence 10001010. Starting at the root, we move down the right branch, (since the first bit of the string is 1), then down the left branch (since the second bit is 0), then down the left branch (since the third bit is also 0). This brings us to the leaf for B, so the first symbol of the decoded message is B. Now we start again at the root, and we make a left move because the next bit in the string is 0. This brings us to the leaf for A. Then we start again at the root with the rest of the string 1010, so we move right, left, right, left and reach C. Thus, the entire message is BAC.

Given an alphabet

of symbols and their relative frequencies, how
do we construct the best

code? (In other words, which tree will
encode messages with the fewest bits?) Huffman gave an algorithm for
doing this and showed that the resulting code is indeed the best
variable-length code for messages where the relative frequency of the
symbols matches the frequencies with which the code was constructed.
We will not prove this optimality of Huffman codes here, but we will
show how Huffman trees are constructed.[1]

The algorithm for generating a Huffman tree is very simple. The idea
is to arrange the tree so that the symbols with the lowest frequency
appear farthest away from the root. Begin with the set of leaf nodes,
containing symbols and their frequencies, as determined by the initial data
from which the code is to be constructed. Now find two leaves with
the lowest weights and merge them to produce a node that has these
two nodes as its left and right branches. The weight of the new node
is the sum of the two weights. Remove the two leaves from the
original set and replace them by this new node. Now continue this
process. At each step, merge two nodes with the smallest weights,
removing them from the set and replacing them with a node that has
these two as its left and right branches. The process stops when
there is only one node left, which is the root of the entire tree.
Here is how the Huffman tree of Figure 2.18 was generated:

In the exercises below we will work with a system that uses Huffman trees to encode and decode messages and generates Huffman trees according to the algorithm outlined above. We will begin by discussing how trees are represented.

Leaves of the tree are represented by a list consisting of the
symbol `leaf`, the symbol at the leaf, and the weight:

function make_leaf(symbol, weight) { return list("leaf", symbol, weight); } function is_leaf(object) { return head(object) === "leaf"; } function symbol_leaf(x) { return head(tail(x)); } function weight_leaf(x) { return head(tail(tail(x))); }

A general tree will be a list of a left branch, a right branch, a set
of symbols, and a weight. The set of symbols will be simply a list of
the symbols, rather than some more sophisticated set representation.
When we make a tree by merging two nodes, we obtain the weight of the
tree as the sum of the weights of the nodes, and the set of symbols as
the union of the sets of symbols for the nodes. Since our symbol sets are
represented as lists, we can form the union by using the `append`
function
we defined in section 2.2.1:

function make_code_tree(left,right) { return list(left, right, append(symbols(left), symbols(right)), weight(left) + weight(right)); }

function left_branch(tree) { return head(tree); } function right_branch(tree) { return head(tail(tree)); } function symbols(tree) { return is_leaf(tree) ? list(symbol_leaf(tree)) : head(tail(tail(tree))); } function weight(tree) { return is_leaf(tree) ? weight_leaf(tree) : head(tail(tail(tail(tree)))); }

The
functions
`symbols` and `weight` must do something
slightly different depending on whether they are called with a leaf or
a general tree. These are simple examples of
*generic
functions* (functions
that can handle more than one kind of data),
which we will have much more to say about in
sections 2.4 and 2.5.

The following function implements the decoding algorithm. It takes as arguments a list of zeros and ones, together with a Huffman tree.

function decode(bits, tree) { function decode_1(bits, current_branch) { if (is_null(bits)) { return null; } else { const next_branch = choose_branch(head(bits), current_branch); return is_leaf(next_branch) ? pair(symbol_leaf(next_branch), decode_1(tail(bits), tree)) : decode_1(tail(bits), next_branch); } } return decode_1(bits, tree); } function choose_branch(bit, branch) { return bit === 0 ? left_branch(branch) : bit === 1 ? right_branch(branch) : Error("bad bit -- choose_branch",bit); }

The
function
`decode_1` takes two arguments: the list of remaining bits
and the current position in the tree. It keeps moving
down

the tree, choosing a left or a right branch according to
whether the next bit in the list is a zero or a one. (This is done
with the
function
`choose_branch`.) When it reaches a leaf, it
returns the symbol at that leaf as the next symbol in the message by
`pair`ing it onto the result of decoding
the rest of the message, starting at the root of the tree.
Note the error check in the final clause of `choose_branch`, which
complains if the
function
finds something other than a zero or a one in the
input data.

In our representation of trees, each non-leaf node contains a set of symbols, which we have represented as a simple list. However, the tree-generating algorithm discussed above requires that we also work with sets of leaves and trees, successively merging the two smallest items. Since we will be required to repeatedly find the smallest item in a set, it is convenient to use an ordered representation for this kind of set.

We will represent a set of leaves and trees as a list of elements,
arranged in increasing order of weight.
The following
`adjoin_set`
function
for constructing sets is similar to the one
described in exercise 2.61; however, items are compared
by their weights, and the element being added to the set is
never already in it.

function adjoin_set(x, set) { return is_null(set) ? list(x) : weight(x) < weight(head(set)) ? pair(x, set) : pair(head(set), adjoin_set(x, tail(set))); }

The following function takes a list of symbol-frequency pairs such as

list(list("A", 4), list("B", 2), list("C", 1), list("D", 1))

function make_leaf_set(pairs) { if (is_null(pairs)) { return null; } else { const first_pair = head(pairs); return adjoin_set( make_leaf(head(first_pair), // symb head(tail(first_pair))), // freq make_leaf_set(tail(pairs))); } }

const sample_tree = make_code_tree(make_leaf("A",4), make_code_tree(make_leaf("B",2), make_code_tree(make_leaf("D",1), make_leaf("C",1)))); const sample_message = list(0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0);

decode(sample_message, sample_tree); // should be: ["A", ["D", ["A", ["B", ["B", ["C", ["A", null]]]]]]]

function encode(message, tree) { return is_null(message) ? null : append(encode_symbol(head(message), tree), encode(tail(message), tree)); }

function encode_symbol(symbol, tree) { function contains_symbol(symbol, current_tree) { return member(symbol, symbols(current_tree)) !== null; } if (is_leaf(tree)) { return null; } else { const left_tree = left_branch(tree); const right_tree = right_branch(tree); return contains_symbol(symbol, left_tree) ? pair(0, encode_symbol(symbol, left_tree)) : contains_symbol(symbol, right_tree) ? pair(1, encode_symbol(symbol, right_tree)) : error("symbol not found"); } }

function generate_huffman_tree(pairs) { return successive_merge(make_leaf_set(pairs)); }

function successive_merge(leaves) { return length(leaves) === 1 ? head(leaves) : successive_merge( adjoin_set( make_code_tree(head(leaves), head(tail(leaves))), tail(tail(leaves)))); }

symbolsof an

alphabetneed not be individual letters.) Use

Get a job

Sha na na na na na na na na

Get a job

Sha na na na na na na na na

Wah yip yip yip yip yip yip yip yip yip

Sha boom

const lyrics_frequencies = list(list("A", 2), list("NA", 16), list("BOOM", 1), list("SHA", 3), list("GET", 2), list("YIP", 9), list("JOB", 2), list("WAH", 2)); const lyrics_tree = generate_huffman_tree(lyrics_frequencies); const lyrics = list( 'GET', 'A', 'JOB', 'SHA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'GET', 'A', 'JOB', 'SHA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'WAH', 'YIP', 'YIP', 'YIP', 'YIP', 'YIP', 'YIP', 'YIP', 'YIP', 'YIP', 'SHA', 'BOOM' ); length(encode(lyrics, lyrics_tree)); // 84

[1] See
Hamming 1980
for a discussion of the mathematical properties of Huffman codes.

2.3.4
Example: Huffman Encoding Trees