|
RNC
Technical description of the Rob Northern Compression system used by Theme Hospital
Phase-Implementation IntroductionMany of the original Theme Hospital data files (along with Dungeon Keeper data files) are compressed using a scheme called Rob Northern (lossless) Compression, typically referred to by the initials RNC. In most situations, RNC does not compress data as well as modern compression schemes like DEFLATE (aka. zlib / zip), but it is a compression scheme which we need to support in order to load the original game's data files. This document provides a technical description of exactly what RNC is - for actual source code, see the subversion repository, or a Dungeon Keeper tools page. The descriptions given in this page are based on a study of the RNC decompression code found at the aforementioned Dungeon Keeper site. Overall structureAn RNC file begins with a fixed-length 18 byte header, then the remainder of the file consists of an interleaved bit-stream and byte-stream containing the actual compressed data. The fields of the header are:
ChecksumsAn RNC checksum is a 16 bit unsigned integer, used to check archive integrity and correct decoding. Given an array of bytes b0 through bn, the pseudo-code for calculating the checksum is: checksum := 0 for i = 0 to n do checksum := checksum XOR b[i] checksum := (HIGH BYTE OF checksum) XOR f(LOW BYTE OF checksum) Where the function f(x) is calculated by the pseudo-code: result := x;
for i = 0 to 7 do
if LSB(result) IS SET then
result := ((result - 1) / 2) XOR 0xA001
else
result := result / 2The bit-stream / byte-streamImmediately after the header, the bit-stream / byte-stream begins. The first two bits of the bit stream have an unknown meaning. The remainder of the bit-stream / byte-stream consists of zero or more compressed chunks. The bit-stream / byte-stream is primarily a stream of bits consisting of variously sized (unsigned) integer values, each value being between 1 and 16 bits. Each value is stored with the least significant bit first, and with no regard for traditional byte boundaries. Occasionally, a stream of bytes appears in the bit stream, but when it does so, the bytes are delayed such that they occur after a multiple of 16 bits have appeared in the bit stream. For example, if the following fields were written:
This seems slightly complex, but it has two advantages:
Compressed ChunksA compressed chunk consists of 3 Huffman tables, and then between 1 and 65535 tuples, where each tuple consists of zero or more raw bytes and a length-distance pair. The raw bytes are copied directly from the byte-stream to the decompressed output. The length-distance pair causes each of the next length bytes of decompressed output to be equal to the byte distance bytes prior to it. It is these length-distance pairs which lead to the actual compression in RNC (by causing duplicated byte sequences to only be written out in full once). In a length-distance pair, the length must be at least 2 (and the distance at least 1), and hence the final tuple in each compressed chunk is truncated to just the raw bytes, as it cannot be guaranteed that the input data finishes with two bytes which have previously been seen together. Each tuple consists of an unsigned integer in the bit stream giving the number of raw bytes, then that many bytes, then for every tuple excluding the final one, an unsigned integer giving the distance minus 1, and finally an unsigned integer giving the length minus 2. These integers are encoded in a slightly unusual way: the length (in bits) of the integer is encoded using a Huffman table, and then followed by the integer itself, sans the most significant bit (as that bit is implied by the length of the integer). This means that the maximum value for each integer is 218-1, as this becomes a Huffman-encoding of the value 17, followed by a 16 bit value in the bit-stream. The three Huffman tables correspond to the three integers in each tuple. The first table is used for decoding the lengths (in bits) of the first value in each tuple, the second table for the second value, and the third table for the third value. Each table contains between 1 and 18 code / value pairs, as the value part of each pair is itself an integer between 0 and 17. Note that when there is only one pair in the table, the code for it must be at least 1 bit long, and if there are 18 pairs in the table, then no code must exceed 16 bits (technically, no code must ever exceed 16 bits, but it is only a possibility when there are more than 17 pairs). The tables themselves are written as:
| |||||||||||||||||||||||||||||||||||||||
It would be useful if we could permanently extract the graphics images from the original archives (for the purposes of developing replacements). Is there an open source tool that can do this?
I suppose you could use the tools at the site linked from the top of this page? But maybe that would be too easy...
At the moment it is not "too easy", because nobody seems to know how to use those tools. My feeble attempts at using these were noted on the forum:
http://forums.corsix-th.com/index.php/topic,388.0.html