What's new? | Help | Directory | Sign in
Google
debunk2
Read MS Outlook autocomplete (NK2) files and extract email addresses
  
  
  
  
    
Search
for
Updated Jan 26, 2007 by havard.dahle
Labels: Featured, Phase-Implementation
fileformat  
The NK2 file format explained.

Introduction

Microsoft Outlook stores its autocomplete data in a specially-crafted, binary file (NK2). Its format is not documented, presumably because it is regarded as an internal data structure to the program. Still, the contents of that file are often valuable, and are definately worth keeping.

The python library of this project, _nk2parser.py_ successfully pulls out names and email addresses from the file. Here's how.

Details

Contents

At the time of writing, the author is aware of the following resources withing the file:

According to internet sources(1), there's more:

There is no reason to expect this to be untrue, but your author has noe clue whether it is and how to extract that info if it is.

Unraveling the stuff

1. Luckily, the most important data is also most easy to pull out. The email addresses are stored as ascii strings, so by just running strings on the file, you'll get all the addresses.

2. To get the names assosciated with the addresses, you have to dig deeper.

The following is an attempt of describing what works for debunk2.

Record separators

Looking at the NK2 file in an hex viewer or something that can print unprintable bytes so that a human can interpret it is time-consuming, exhausting and not something that someone should have to do when there is plenty of green left in the world. It is also the only way of examining the blasted thing.

Since there is so much green yet to be experienced, your author took some sharp turns and elbow-greased some dirty code into chopping the NK2 file into records.

Thus, debunk2 splits at the following byte strings to make sense out of the NK2 files:

sep1 = '\x04H\xfe\x13' # record separator
sep2 = '\x00\xdd\x01\x0fT\x02\x00\x00\x01'  # record separator

After the file is split at both these byte sequences, the program runs through all records. A lot of the records are duplicates (they aren't really, but since we're only after the name and email addresses, they're redundant).

This isn't perfect, but it gets the job done.

NUL-byte separated strings

Most record fields are separated by triple NUL-bytes (ASCII 0). Some fields are separated by single NUL bytes.

  split1 = [x.replace(NUL, '') for x in y.split(NUL*3)] # SPLIT1: split record into something useful by separating at triple NULs
  if split1[1] != 'SMTP': # SPLIT1 failed
    split2 = [x.replace(NUL, '') for x in y.split(NUL*1)] # SPLIT2: split again, this time using single NULs as delimiter

Most field strings are sprinkled with NUL-bytes (every second byte, actually). This is how nk2parser.py does it:

  def strp(self, s):
	  "Return string stripped of NUL bytes and unprintable characters"
	  return s.replace(NUL, '')#.replace('\x00', '')

SMTP addresses

prefixed with "SMTP:"

Little help, please?

References


Comment by davehowe, Dec 19, 2007

Probably this is a dual-byte character encoding, so that you can store non-ascii characters

http://en.wikipedia.org/wiki/DBCS


Sign in to add a comment