2006.04.22 03:37 "[Tiff] Microsoft Document Imaging status / snapshot", by Brad Hards

2006.04.22 03:37 "[Tiff] Microsoft Document Imaging status / snapshot", by Brad Hards

After a very long hiatus (I've been working on some Qt-based crypto), I recently spent a little bit of time working on the .mdi file format extensions to tiff.

I've almost got the "OCR'd text" tag sorted out: 37679 - looks like the text version of the document contents. The content are 0x01 0x00, followed by a length (4 byte aka long) which has a value that is 6 bytes less than the actual length of this field, followed by the ascii text version. Each phrase is delimited by a space followed by a newline (0x20 0x0a aka ' \n'). The end is 0x0d 0x00. There are sometimes additional bytes (e.g. 0xe2 0x80 0x9c) which appear to be some kind of character / symbol encoding. Combinations include:

0xef  0x82  0xa7  = some kind of bullet point symbol
0xef  0x82  0xb7  = some kind of bullet point symbol (different to a7)
0xe2  0x80  0x93  = em-dash
0xe2  0x80  0x9c  = `` (smart doublequotes, left side of quoted material)
0xe2  0x80  0x9d  = '' (smart doublequotes, right side of quoted material)
0xe2  0x80  0x99  = ' (apostrophe of some kind)
0xe2  0x80  0xa6
0xe2  0x80  0x94 = short dash?
0xc3  0xa9 = e with grave. (00a9 is the unicode equivalent, perhaps

this will form some pattern)

My current set of notes is attached.

I've been hacking libtiff to try to figure out what is going on. A cvs diff is also attached - you'll need the tif_mdi.c file as well, if you want it to compile (it goes into libtiff/libtiff/).

Work on the actual image content (compression type) has only just started. Right now I have absolutely no idea what the format could be, although it certainly does appear to use some kind of compression format. I generated a few trivial files (a large filled blue rectangle, same rectangle without fill, same rectangle in a slightly different shade of blue, same rectangle in green) Here is what the content looks like: /home/bradh/mdi/greenrect.MDI:

char count: 336
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00
00 00 78 01 5d 91 bd 4a c4 50 10 85 cf cd 46 37 82 8a 18 11 41 8b b5 f0 0f 1b
 c1 b5 b7 d1 46 5c c4 42 7b d1 42 10 0b 5d b0 dd c2 c2 97 c8 1b 58 28 68 ef 33

f8 04 3e 85 ed fa 9d 35 37 9b cd c0 c9 3d 33 77 66 ce 64 6e 90 f4 0c 8a 20 e 5 89 d4 cb e0 6d 02 a5 65 e7 52 da 97 3a 47 a7 c7 52 d0 2b 39 eb dc b5 c0 2c b8 2b f3 de 66 a4 8f 39 e9 1b 7f 87 5e 75 3b e8 b6 b4 ff 92 ea 4c 0f ba d5 bd fa ea 80 1b 3d f2 b5 6d 00 f7 42 76 10 39 6e c5 e7 e1 ab 04 16 40 34 a4 26 fc 2b 7c f7 b0 f4 75 c9 97 a8 7b 82 73 4e 58 d4 e0 57 2b 8d 29 f8 2e 59 2b e0 eb

 67 38 ac 23 e6 37 e7 5b 26 d7 9a ae 59 2c b9 f5 ad e7 b8 67 ac f3 13 fc 43 b0

07 d0 63 0b ff f3 ba a6 fe 6f ef ec f2 d3 c5 a5 45 3f ce e1 b9 b7 b8 b3 ae 6 b 47 36 10 2f 33 de a5 e3 17 c0 5a cd b7 b5 76 11 f2 a4 97 19 45 bb 08 46 9e 6c 12 5f 03 a3 46 9c 95 d1 3b 6a fb ee 12 74 41 b3 ef 34 b1 df 74 dc 2f f6 77 2c d6 37 77 b8 4d 8d 77 e5 be 91 7b 76 de bc 7a 37 ef c6 7b fc 03 fe c8 3b 4d

/home/bradh/mdi/bluerect.MDI:
char count: 337
02 00 00 00 7b 00 00 00 ac 02 00 00 03 00 00 00 00 00 00 00 ff ff ff ff 00 00
00 00 78 01 5d 91 bd 4a c4 50 10 85 4f b2 d1 8d a0 22 46 44 d0 62 2d fc c3 46
 70 ed 6d dc 66 71 11 0b ed 45 0b 41 2c 74 c1 d6 c2 c2 97 c8 1b 58 28 28 58 fa

0c 3e 81 4f 61 1b bf b3 e6 66 b3 19 38 b9 67 e6 ce cc 99 cc 8d 24 3d 81 3c 9 2 b2 58 1a a4 f0 36 81 d2 d2 53 29 19 4a 9d a3 e3 9e 14 e9 85 9c 75 ee 5a 60 16 dc 94 79 af 33 d2 fb 9c f4 8d bf 43 af ba 1d 74 5b da 7f 4e 74 a2 3b 5d eb 56 43 75 c0 95 ee f9 da 36 80 7b 21 fb 18 38 6e c5 e7 e1 ab 04 16 40 30 a4 26 fc 0b 7c f7 b0 f4 65 c9 97 a8 7b 80 73 4e 58 d0 e0 57 2b 8d 29 f8 2e 59 2b e0

 eb a7 28 ea 08 f9 cd f9 96 c9 b5 a6 6b 16 4b 6e 7d eb 39 ee 19 eb bc 8f 7f 08

f6 00 7a 6c e1 7f 5e d7 d4 ff ed 8d 5d 7e b8 b8 b4 e0 87 39 3c f7 16 77 d6 7 5 ed c8 3e 0b 5e 66 bc 4b c7 cf 80 b5 9a 6f 6b ed 3c ca e2 41 6a e4 ed 3c 32 b2 78 93 f8 1a 18 35 e2 ac 8c de 41 db 77 e7 a0 0b 9a 7d a7 89 fd 26 e3 MDI contains images of the page, and the text that it contains. Based on TIFF format.

Office Document Imaging creates MDI files in these formats:

Office Document Imaging supports:

Office Document Imaging does not support:

MDI support document annotations.

Annotations toolbar.

Each MDI document consists of an ordered collection of pages (images), plus metadata.

The metadata has a standard set of properties, plus a custom set. Standard ("built in") properties includes "Title", "Author" and "Creation Date". Title and Author may not be present. "Last print date" and "Last save time" are "available but not used"

File format can vary:

Compression level can vary

Compression type (259, 0x0103) can vary:

Variable font families:

Variable face styles:

Serif styles:

Languages:

Thumbnail sizes:

 -_MEDIUM 2

Each page can have different image properties:

Layout - provides summary information (such as the number of words) about the recognized text on the page and gives access to the recognized text itself and to each individual word in the text.

The Word object exposes additional information about each word's font, its location on the page, and even the OCR engine's RecognitionConfidence factor, which estimates the likelihood of a recognition error:

Layout properties

? autom4te.cache
? rects.txt
? tiffmdi-snapshot-2006-04-22.patch
? libtiff/tif_mdi.c
Index: libtiff/Makefile.am

===================================================================
RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/Makefile.am,v
retrieving revision 1.21
diff -u -4 -p -r1.21 Makefile.am
--- libtiff/Makefile.am 21 Apr 2006 14:18:54 -0000      1.21
+++ libtiff/Makefile.am 22 Apr 2006 04:16:56 -0000
@@ -70,8 +70,9 @@ SRCS = \
        tif_getimage.c \
        tif_jpeg.c \
        tif_luv.c \
        tif_lzw.c \
+       tif_mdi.c \
        tif_next.c \
        tif_ojpeg.c \
        tif_open.c \
        tif_packbits.c \
Index: libtiff/tif_codec.c
===================================================================

RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_codec.c,v retrieving revision 1.10

diff -u -4 -p -r1.10 tif_codec.c

--- libtiff/tif_codec.c 21 Dec 2005 12:23:13 -0000      1.10
+++ libtiff/tif_codec.c 22 Apr 2006 04:16:56 -0000

@@ -68,8 +68,11 @@ static       int NotConfigured(TIFF*, int);

 #endif
 #ifndef LOGLUV_SUPPORT
 #define TIFFInitSGILog         NotConfigured
 #endif
+#ifndef MDI_SUPPORT
+#define TIFFInitMDI            NotConfigured
+#endif

 /*
  * Compression schemes statically built into the library.
  */
@@ -94,8 +97,12 @@ TIFFCodec _TIFFBuiltinCODECS[] = {

     { "AdobeDeflate",   COMPRESSION_ADOBE_DEFLATE , TIFFInitZIP }, 

     { "PixarLog",      COMPRESSION_PIXARLOG,   TIFFInitPixarLog },
     { "SGILog",                COMPRESSION_SGILOG,     TIFFInitSGILog },
     { "SGILog24",      COMPRESSION_SGILOG24,   TIFFInitSGILog },
+    /* TODO - add proper decompression for these */
+    { "MODI BW",       COMPRESSION_MODI_BLC,   TIFFInitMDI },
+    { "MODI Colour",   COMPRESSION_MODI_PTC,   TIFFInitMDI },
+    { "MODI Vector",   COMPRESSION_MODI_VECTOR, TIFFInitMDI },

     { NULL, 0, NULL }
 };

 static int
Index: libtiff/tif_dirinfo.c

=================================================================== RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_dirinfo.c,v retrieving revision 1.62

diff -u -4 -p -r1.62 tif_dirinfo.c

--- libtiff/tif_dirinfo.c       7 Feb 2006 10:45:38 -0000       1.62
+++ libtiff/tif_dirinfo.c       22 Apr 2006 04:16:57 -0000

@@ -268,8 +268,16 @@ tiffFieldInfo[] = {

     { TIFFTAG_STONITS,          1, 1,  TIFF_DOUBLE,    FIELD_CUSTOM,

       0,       0,      "StoNits" },
     { TIFFTAG_INTEROPERABILITYIFD, 1, 1, TIFF_LONG,    FIELD_CUSTOM,
       0,       0,      "InteroperabilityIFDOffset" },

+    { TIFFTAG_MDIOCRTEXT,    -1, -1,   TIFF_UNDEFINED,  FIELD_CUSTOM,

+      0,        0,  "TextContentsMDI" },

+/* MDI tag for document level metadata? */

+    { TIFFTAG_MDIMETADATA,    -1, -1, TIFF_UNDEFINED,   FIELD_CUSTOM,

+      0,        0,  "MDIMetaData" },
+/* MDI tag for page thumbnail? */
+    { TIFFTAG_MDITHUMBNAIL,  -1, -1,  TIFF_UNDEFINED,   FIELD_CUSTOM,
+      0,        0,  "MDIMetaData" },

 /* begin DNG tags */

     { TIFFTAG_DNGVERSION,      4, 4,   TIFF_BYTE,      FIELD_CUSTOM, 

       0,       0,      "DNGVersion" },

     { TIFFTAG_DNGBACKWARDVERSION, 4, 4,        TIFF_BYTE,      FIELD_CUSTOM, 

Index: libtiff/tif_dirread.c

=================================================================== RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_dirread.c,v retrieving revision 1.84

diff -u -4 -p -r1.84 tif_dirread.c

--- libtiff/tif_dirread.c       4 Apr 2006 02:00:08 -0000       1.84
+++ libtiff/tif_dirread.c       22 Apr 2006 04:16:57 -0000

@@ -29,8 +29,9 @@
  *
  * Directory Read Support Routines.
  */
 #include "tiffiop.h"
+#include "ctype.h"

 #define        IGNORE  0               /* tag placeholder used below */

/*
 * Copyright (c) Brad Hards <bradh@frogmouth.net>
 *

 */

#include "tiffiop.h"
#ifdef MDI_SUPPORT

/*
 * TIFF Library.
 *
 * MDI Image Support
 *
 */
static int
MDISetupDecode(TIFF* tif)
{
    return (1);
}

/*
 * Setup state for decoding a strip.
 */
static int
MDIPreDecode(TIFF* tif, tsample_t s)
{
    return 1;
}

static int
MDIDecode(TIFF* tif, tidata_t op, tsize_t occ, tsample_t s)
{
    int lv;
    printf("decode size: %i (%i)\n", occ, 2480*3508*3);
    printf("char count: %i\n", tif->tif_rawcc);
    for (lv = 0; lv < tif->tif_rawcc; ++lv) {
        printf("%02x ", tif->tif_rawcp[lv]);
    }
    printf("\n");
    return 0;
}

static int
MDISetupEncode(TIFF* tif)
{
    return 0;
}

/*
 * Reset encoding state at the start of a strip.
 */
static int
MDIPreEncode(TIFF* tif, tsample_t s)
{
    return 0;
}

/*
 * Encode a chunk of pixels.
 */
static int
MDIEncode(TIFF* tif, tidata_t bp, tsize_t cc, tsample_t s)
{
    return (1);
}

/*

 */
static int
MDIPostEncode(TIFF* tif)
{
    return 1;
}

static void
MDICleanup(TIFF* tif)
{
}

int
TIFFInitMDI(TIFF* tif, int scheme)
{
        assert( (scheme == COMPRESSION_MODI_BLC)

                || (scheme == COMPRESSION_MODI_VECTOR)
                || (scheme == COMPRESSION_MODI_PTC)
                );

        // printf("Init MDI\n");

        /*
         * Install codec methods.
         */

        tif->tif_setupdecode = MDISetupDecode;
        tif->tif_predecode = MDIPreDecode;
        tif->tif_decoderow = MDIDecode;
        tif->tif_decodestrip = MDIDecode;
        tif->tif_decodetile = MDIDecode;
        tif->tif_setupencode = MDISetupEncode;
        tif->tif_preencode = MDIPreEncode;
        tif->tif_postencode = MDIPostEncode;
        tif->tif_encoderow = MDIEncode;
        tif->tif_encodestrip = MDIEncode;

        tif->tif_encodetile = MDIEncode;
        tif->tif_cleanup = MDICleanup;
}
#endif