My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
RNC  
Technical description of the Rob Northern Compression system used by Theme Hospital
Phase-Implementation
Updated Mar 29, 2010 by cor...@gmail.com

Introduction

Many of the original Theme Hospital data files (along with Dungeon Keeper data files) are compressed using a scheme called Rob Northern (lossless) Compression, typically referred to by the initials RNC. In most situations, RNC does not compress data as well as modern compression schemes like DEFLATE (aka. zlib / zip), but it is a compression scheme which we need to support in order to load the original game's data files. This document provides a technical description of exactly what RNC is - for actual source code, see the subversion repository, or a Dungeon Keeper tools page.

The descriptions given in this page are based on a study of the RNC decompression code found at the aforementioned Dungeon Keeper site.

Overall structure

An RNC file begins with a fixed-length 18 byte header, then the remainder of the file consists of an interleaved bit-stream and byte-stream containing the actual compressed data.

The fields of the header are:

Offset Size Type Description
0 4 String literal File signature ("RNC\001").
4 4 Big-endian unsigned integer Size (in bytes) of the decompressed data.
8 4 Big-endian unsigned integer Size (in bytes) of the bit-stream / byte-stream which makes up the remainder of the file.
12 2 Big-endian unsigned integer Checksum of the decompressed data.
14 2 Big-endian unsigned integer Checksum of the bit-stream / byte-stream which makes up the remainder of the file.
16 2 Unknown Unknown

Checksums

An RNC checksum is a 16 bit unsigned integer, used to check archive integrity and correct decoding. Given an array of bytes b0 through bn, the pseudo-code for calculating the checksum is:

checksum := 0
for i = 0 to n do
  checksum := checksum XOR b[i]
  checksum := (HIGH BYTE OF checksum) XOR f(LOW BYTE OF checksum)

Where the function f(x) is calculated by the pseudo-code:

result := x;
for i = 0 to 7 do
  if LSB(result) IS SET then
    result := ((result - 1) / 2) XOR 0xA001
  else
    result := result / 2

The bit-stream / byte-stream

Immediately after the header, the bit-stream / byte-stream begins. The first two bits of the bit stream have an unknown meaning. The remainder of the bit-stream / byte-stream consists of zero or more compressed chunks.

The bit-stream / byte-stream is primarily a stream of bits consisting of variously sized (unsigned) integer values, each value being between 1 and 16 bits. Each value is stored with the least significant bit first, and with no regard for traditional byte boundaries. Occasionally, a stream of bytes appears in the bit stream, but when it does so, the bytes are delayed such that they occur after a multiple of 16 bits have appeared in the bit stream. For example, if the following fields were written:

  1. 5 bit unsigned integer
  2. 3 bytes
  3. 1 bit unsigned integer
  4. 16 bit unsigned integer
Then the result would look like:

Byte Number Contents (from LSB to MSB)
0 <5 bit value> <1 bit value> <least significant 2 bits of the 16 bit value>
1 <the next 8 bits of the 16 bit value>
2, 3, 4 The 3 bytes
5 <the most significant 6 bits of the 16 bit value> <2 unused bits>

This seems slightly complex, but it has two advantages:

  • Whole bytes can be copied directly from the byte-stream, as they are aligned to whole-byte boundaries.
  • It is always safe to read 16 bits at a time from the bit-stream, which is useful, as individual fields in the bit-stream can be up to 16 bits wide.

Compressed Chunks

A compressed chunk consists of 3 Huffman tables, and then between 1 and 65535 tuples, where each tuple consists of zero or more raw bytes and a length-distance pair. The raw bytes are copied directly from the byte-stream to the decompressed output. The length-distance pair causes each of the next length bytes of decompressed output to be equal to the byte distance bytes prior to it. It is these length-distance pairs which lead to the actual compression in RNC (by causing duplicated byte sequences to only be written out in full once). In a length-distance pair, the length must be at least 2 (and the distance at least 1), and hence the final tuple in each compressed chunk is truncated to just the raw bytes, as it cannot be guaranteed that the input data finishes with two bytes which have previously been seen together.

Each tuple consists of an unsigned integer in the bit stream giving the number of raw bytes, then that many bytes, then for every tuple excluding the final one, an unsigned integer giving the distance minus 1, and finally an unsigned integer giving the length minus 2. These integers are encoded in a slightly unusual way: the length (in bits) of the integer is encoded using a Huffman table, and then followed by the integer itself, sans the most significant bit (as that bit is implied by the length of the integer). This means that the maximum value for each integer is 218-1, as this becomes a Huffman-encoding of the value 17, followed by a 16 bit value in the bit-stream.

The three Huffman tables correspond to the three integers in each tuple. The first table is used for decoding the lengths (in bits) of the first value in each tuple, the second table for the second value, and the third table for the third value. Each table contains between 1 and 18 code / value pairs, as the value part of each pair is itself an integer between 0 and 17. Note that when there is only one pair in the table, the code for it must be at least 1 bit long, and if there are 18 pairs in the table, then no code must exceed 16 bits (technically, no code must ever exceed 16 bits, but it is only a possibility when there are more than 17 pairs). The tables themselves are written as:

  • <N> - a 5 bit unsigned integer, where N is one greater than the maximum value in the table
  • For <i> from 0 to N-1 (inclusive) - a 4 bit unsigned integer giving the size (in bits) of the code corresponding to the value i. A size of zero indicates that the given value is not present in the table.
Canonical Huffman codes are then created for each value in the table from these code lengths.

Comment by markhob...@yahoo.co.uk, Jan 30, 2011

It would be useful if we could permanently extract the graphics images from the original archives (for the purposes of developing replacements). Is there an open source tool that can do this?

Comment by project member edvin.li...@gmail.com, Jan 30, 2011

I suppose you could use the tools at the site linked from the top of this page? But maybe that would be too easy...

Comment by gillianh...@gmail.com, Feb 17, 2011

At the moment it is not "too easy", because nobody seems to know how to use those tools. My feeble attempts at using these were noted on the forum:

http://forums.corsix-th.com/index.php/topic,388.0.html


Sign in to add a comment
Powered by Google Project Hosting