My favorites | Sign in
xar
Project Home Downloads Wiki Issues Source
Search
for
xarformat  
xarchive format
Phase-Implementation, Featured
Updated Nov 24, 2009 by bbraun

Format of a xar archive

The XAR file format has three main regions, The Header, The Table of Contents, and The Heap. The header is a small binary data structure that identifies the file format (file magic). The table of contents is parsed as an XML document. The heap occupies the remainder of the file. Files' data are stored in the heap.

The Header

The header starts with 32 bits of file magic ('xar!') in network byte order. The next 16 bits are the size of the header (including the 32 bits of file magic) in network byte order. A 16 bit xar file version number follows in network byte order, the current version is zero. Last is the 64 bit length of the table of contents regions, also in network byte order. The header may be represented as the following xar_header C structure.

#define XAR_HEADER_MAGIC 0x78617221
#define XAR_HEADER_VERSION 0
#define XAR_HEADER_SIZE sizeof(struct xar_header)

/*
 * xar_header version 0
 */
struct xar_header {
    uint32_t magic;	
    uint16_t size;
    uint16_t version;
    uint64_t toc_length_compressed;
    uint64_t toc_length_uncompressed;
    uint32_t cksum_alg;
};

The Table of Contents

The table of contents is an XML document. The table of contents should be encoded as UTF-8.

<?xml version="1.0"?>

<xar>
  <toc>
    <checksum style="sha1">
      <size>20</size>
      <offset>0</offset>
    </checksum>
    <file id="1">
      <name>xar</name>
      <type>file</type>
      <mode>0755</mode>
      <uid>0</uid>
      <gid>0</gid>
      <user>root</user>
      <group>wheel</group>
      <size>81180</size>
      <data>
        <offset>0</offset>
        <size>74108</size>
        <length>23083</length>
        <extracted-checksum style="md5">d852c77ac3c8e83f312c12b4c3198e6d</checksum>
        <archived-checksum style="md5">ceaf793ccb1990ecbadb20112d5f9e5d</checksum>
        <encoding style="application/x-gzip"/>
      </data>
      <ea>
        <name>com.apple.ResourceFork</name>
        <offset>0</offset>
        <size>7072</size>
        <length>3942</length>
        <extracted-checksum style="md5">0f7061dca2d7411352377db0e53792db</checksum>
        <archived-checksum style="md5">c72de8ac25abe462a930254d82958534</checksum>
        <encoding style="application/x-gzip"/>
      </ea>
    </file>
  </toc>
</xar>

The Heap

As its name suggests, the heap is an unstructured heap of data referenced by the table of contents. It is recommended that implementations use the heap as efficiently as possible and defragment the heap during archive creation as well order it sensibly. In order for an archive to be streamable, it is necissary for all of a file's heap entries to be grouped together, with extended attributes coming before the data portion of the file. When streaming, the heap entries will be extracted in the order they appear, and the EA data must be extracted before the data so the proper security context can be set on the file before the data is extracted.

Comment by paracel...@gmail.com, Dec 13, 2008

It would really help to have some proper specs for this file format, so people could implement their own readers and writers without having to read source code. The first obvious questions that come to mind are:

  1. What is "cksum_alg"?
  2. Where is the table of contents located in the file?
  3. How is the table of contents stored? Apparently it's compressed, but with what algorithm?
  4. What fields can exist in the table of contents?
  5. What is the format of the data for each one?
  6. Which fields are mandatory?
  7. How are directories stored?
  8. How are links stored?
  9. Assuming the "<encoding>" field specifies a compression algorithm, what values can it have and what algorithms do those correspond to?
  10. What does "application/x-gzip" mean? Is it a gzip file with its own header? Is it a zlib-format deflate stream? Is it a zip-format deflate stream?
Comment by paracel...@gmail.com, Dec 14, 2008

Also, having started trying to figure these things out by trial and error, I've found that the example on this page is apparently not even correct. The <ea> element seems to not work like that at all.

Comment by nicolas....@gmail.com, May 22, 2009

That's not a spec, that's an example of a document. Just because it's XML doesn't mean you can get away with that.

A spec should say what we can rely on and what we can't. Based on just an example, I may, for example, assume that <offset> always comes after <size> in a <checksum> element.

Comment by npac...@gmail.com, Apr 25, 2010

More detailed file layout:

  • Header
  • TOC
    1. Compressed body (deflate+lzib header/footer)
    2. Checksum (cksum_alg: 0=none/1=SHA1/2=MD5)
  • Compressed blocks

Comment by j...@thewolfweb.com, Aug 9, 2011

What garbage. This is not valid XML:

<extracted-checksum style="md5">d852c77ac3c8e83f312c12b4c3198e6d</checksum>

Are we supposed to trust this? I'm having a hell of a time trying to encode something, and the inaccuracies here aren't helping.


Sign in to add a comment
Powered by Google Project Hosting