2009.02.24 16:07 "[Tiff] Converting JPEG to Multipage TIFF", by Richard Nolde

Zhang,

Hi. I can see the implementation difficulty here. However I still suggest to the developers that somebody would be brave and take this job, because it has practical use case.

Users working in companies dealing with a lot of documents find it easier to manage documents if using TIFF to join pages of text that was part of a same original document into single TIFF instead of a zip archive of hundreds of images. The ease-of-management comes because of ease faxing the document as a whole, ease of doing process to all pages (e.g. cut the margin), and being sure the pages are always in the right order despite user's preference of sorting files in directories.

However chances are they received the documents as a lot of jpeg file

in the first place. And they naturally want to enclose them into TIFF or PDF without losing quality.

Most commercially available scanner software produces PDFs that are either 8 bit grayscale or 24 bit RGB or 32 bit CMYK and users don't often understand the amount of disk storage that each format will consume. We have a home grown document imaging system that contains over 5 million documents that vary in size from one page to 570 pages each. Our particularly pipeline works like this:

Files are scanned as bilevel Group 4 compressed multipage TIFFs from all over our wide area network. Any files that are not in that format are converted back to that format on the fly so we are sure that is our starting point. Then, tiff2ps is used with the -map3 options to invoke the Postscript Imagemask operator and produce a Level 3 Postscript file. Ghostscript is then used to produce a PDF file which is little more than a wrapper around the TIFF images. Over millions of pages, we see an average storage requirement of 35 to 50KB per page whereas PDF files off the scanners (various models of standalone and high speed copier/scanners) run anywhere from 150K to 10MB per page depending on the original settings and software package. Very few of the vendor provided scanning packages offer to produce bilevel images with compression as efficient as Group4 and PDF file format.

Since JPEG can be grayscale or color, it is clear that this solution won't necessarily work for the original author of the question, but it might be worth considering what information he needs to store and whether the color information in its original form is all that important. If he received JPEG images from clients that do not contain important color information, he may wish to convert them to bilevel TIFF instead of grayscale to save space. If he has to preserve some or all of the color information, he may wish to convert them to indexed TIFFs instead of RGB or CMYK PDFs.

The ability to manipulate multi-page TIFF images as single file for cropping, rotating, selecting portions for OCR or bar code recognition, etc is essential to our document processing system. In the Unix and Windows worlds, there are free alternatives to the Acrobat reader that work very well with much smaller footprint than Acrobat so I would not be too concerned about depending entirely on one vendor for reading the files. Not all of these are open source however and most are only useful for viewing PDF files or converting them to other formats to perform the operations we need to on them.

Richard Nolde

Tiffcrop author