2005.10.18 10:20 "[Tiff] Notes on Microsoft Office Document Imaging file format", by Brad Hards

I've been looking at the Microsoft Office Document Imaging (.mdi) file format. Notes to date are below - hope this helps. Also, any suggestions or updates would be appreciated.

Brad

MDI contains images of the page, and the text that it contains. Based on TIFF format.

Uses different magic number to TIFF: 0x5045

Same version number: 0x002a

There are unknown fields:

37679 - appears on every page, always starts with 0x01 0x00, then varies 37680 - only appears to occur on the first page, always appears to be length 4096, always starts with 0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1, then a string of zeros, and then varies.

37681 - appears on every page, always stars with 0x02 0x00 (+ 0x00, 0x00?), then varies

These unknown properties appear to occur in both TIFF and MDI files.

37679 - looks like the text version of the document contents. The content are 0x01 0x00, followed by a length (4 byte aka long) which is 6 bytes less than the actual length of this field, followed by the ascii text version. Each phrase is delimited by a space followed by a newline (0x20 0x0a aka ' \n'). The end is 0x0d 0x00.

37680 might be some kind of metadata dictionary. It is located at the end of the file, and there are 16-bit wide characters that look like "Root Entry", "CONTENTS" (sometimes more than once, even if only one page), "prop2" (sometimes more than once), "prop3" (somtimes more than once), "DICT", "Summary Information", "Owner" and some names. There might be some random stuff / fill in there too.

Also appears to be a consistent bit of stuff "AuvsxjatP0udlw1Aaq5eubr5h" (this might not be ASCII though - there is a 0x05 0x00 always on the front of it.

37681 hasn't been looked at yet - possibly the thumbnail image?

There are new kinds of image compression (259, 0x0103):

Plus existing TIFF compression types can be used. MDI appears to be mostly MOD_VECTOR. Don't know how any of this works yet.