[Leptonica Home Page]

Applications


Updated: May 21, 2006

1. Jbig2 image compression
2. Color quantization
3. Color segmentation
4. Document skew
5. Outlines of connected components

1. Jbig2 image compression

What is the origin of the Jbig2 application?

The original jbig facsimile encoder was an effort to make an improvement over the old CCITT fax encoding standard. A little history may help. CCITT used two different encoders, both of which computed run-lengths (of ON pixels) followed by Huffmann encoding on the runs. The Huffmann codes were derived from a set of 8 binary images, scanned at 200 ppi. None of the 8 images had halftones, and one consequence is that the CCITT encodings work very poorly on halftone images. One encoder, G3 (for Group 3), was specifically designed to work in the presence of uncorrected fax errors. It was originally 1-dimensional, encoding each raster line individually. It had a special end-of-line symbol. If an error occurred in a raster line, determined by a check of some error detection bits, it would just copy the previous line. This limited the error propagation to typically a single line. The second encoder, G4 (for Group 4), was designed to work with much better compression on an error-free communication channel. It was two-dimensional, in that it encoded the runs in each line relative to the runs in the line above, and it omitted the end-of-line symbols. It could not be used for fax because any error would be propagated through the rest of the page. Later, the G3 encoder was improved by adopting up to 4 lines of two-dimensional encoding. The first line would use the original one-dimensional G3 method, and the next 1 to 3 lines would use a modified G4 method, still preserving the end-of-line symbols.

This remained the state of the art for nearly 20 years. jbig was designed in the early '90s to give better lossless compression on error-free channels. It could thus not be used on ordinary fax, because fax modems were not designed to correct all errors. However, given the existence of 33.6 KHz data modems, which have very low error rates on typical phone lines, there is no reason why jbig could not be used on modified fax machines that used data modems. Things didn't work out that way, however. New fax machines were not built, and problems arose in the jbig standardization process. I have heard several "sides" of the fiasco, and I believe the main problem was that IBM held several critical patents on arithmetic coding of the image, which prevented others from freely implementing the standard.

In the mid to late 90s, the jbig2 compression standard was proposed. Although there is a lossless mode for jbig2, it would typically be used for lossy (but "visually lossless") compression. It also requires an error-free channel, so that the main applications would be compression of scanned binary images, and perhaps image transmission on internet fax. Because the standard needed to be written for maximum flexibility, jbig2 can be used for grayscale or color images. However, the most practical use of jbig2 with grayscale or color images is for the compression of a binary layer in a mixed raster format (MR format; more on this below). The image and data can be compressed in a variety of ways; the best is probably to do everything with arithmetic coding. The final ISO draft specifications for jbig2 were published in 1999, and can be found at http://www.jpeg.org/jbig/jbigpt2.html.

Adam Langley has just released an open source jbig2-compliant encoder, the first one yet built. You can get this at http://www.imperialviolet.org/jbig2.html. It uses arithmetic encoding throughout. It encodes binary images losslessly with a single arithmetic coding over the full image. It also does both lossy and lossless encoding from connected components, using leptonica to classify similar elements and generate the templates representing each cluster.

What is jbig2?

From here on, when I refer to "jbig2" it is to the generic method of compression, not to the specific file format in the ISO draft. jbig2 builds a dictionary of shapes from the connected components in the image. It does this with an unsupervised classifier: shapes that are sufficiently similar are placed in the same class, and every instance of that shape in the image is rendered using a single representative image, or template. This is particularly effective for compressing text documents, which are composed of connected components representing characters from a limited set of fonts and font styles. Further, the compression efficiency generally improves with longer documents, because the templates need only be stored once. Documents compressed with jbig2 are typically between 10K and 15K bytes/page, which is about 5 times better compression that that achievable with CCITT/G4.

A jbig2 implementation consists of these three steps:

  1. The connected components (either 4 or 8 connected) are identified.
  2. The components are placed in similarity classes, using an unsupervised classifier. Each class has an image template as a representative. The location of each instance is recorded.
  3. The templates and the (index, location) triples are compressed and written to file. If done according to the ISO draft, the templates are compressed using arithmetic coding and the triplets are compressed using arithmetic or Huffmann coding on coordinate differences.

We provide the first two steps, which are the low-level imaging part of the encoder. For the third step, we simply tile the templates in a single image written in png format, and write the uncompressed (index, location) triples for each instance into a second file.

For the classification step, the goal is to do a minimal overclassification. We are willing to split some classes, as long as we never put two truly different characters into the same class. The over-classification results in a loss of efficiency, which is in the best case relatively small, and is compensated by avoiding substitution errors.

What is provided in Leptonlib?

We provide implementations of two different classifiers. One is based on image correlation. The correlation method compares two bitmaps by computing the ratio of the square of the number of pixels in the AND of the two bitmaps to the product of the number of ON pixels in each. This ratio has a maximum of 1 when the two bitmaps are identical and properly aligned. All pixels that differ in the two bitmaps are weighted equally in the correlation method. We provide two parameters: a base threshold on the correlation and a weight factor that increases the threshold for templates that have a larger fraction of ON pixels. Judicious use of these two parameters will result in visually lossless substitutions.

A classifier that uses less classes for the same image quality is based on hausdorff distance. This comparator gives a (yes/no) answer to the question of whether the two bitmaps are within a certain distance of each other. We can speak of a hausdorff "distance" between two images because the hausdorff forms a true metric. The hausdorff method is better for this application because it tolerates differences in the bitmap shapes near the foreground/background boundary. In effect, it gives a small effective weight to boundary pixels and a large effective weight to pixels that are not very near the boundary. This weighting is expected statistically because the difference in two printed and scanned instances of a character occur almost entirely at the boundary. Thus, the Hausdorff weighting corresponds better to the expected probabilities of pixel values in scanned images.

We also give a rank hausdorff comparator, which is generalization of the hausdorff comparator that permits a match if some fraction of the pixels exceeds the given hausdorff distance. In practice, for scanned text at 300 or 400 ppi, the rank number should be set very close to 1.0, and in any event should not exceed 0.99. (For other applications, such as comparison of word images in document image summarization, where a few different words can be put in the same class in order to avoid the worse error of splitting a word equivalent class, it is useful to use a smaller value of the rank order.)

The main problem with the rank hausdorff method is that there is no set of input parameters that will always avoid some observable substitutions (other than the trivial one that generates a new class for each exemplar). The reason is that the implementation uses dilation, and the smallest practical dilation is with a 3x3 structuring element. As a result, the small letter 'o' can be confused with the italic 'o' in the same font. When the centroids are lined up, a small 'o' and a small italic 'o' can have a hausdorff distance of 1 (with rank = 1.0). So if absolute accuracy is required, the more expensive correlation method must be used.

What is the definition of the hausdorff distance, and how is it implemented efficiently?

The hausdorff distance between two images A and B is defined as follows. After the images are aligned, find the distance of the pixel in B that is farthest from any pixel in A, and v.v. Each of those distances is called the directed hausdorff distance from one image to another, and the hausdorff distance is the maximum of the directed distances. Two images whose pixels differ only at the boundaries will have a small hausdorff distance. The comparator is implemented efficiently by dilating first one image by a structuring element of a given size, testing whether the other image fits within it, and then carrying out the same procedure with the two images reversed. For example, if the two images are to be tested to see if they are within a hausdorff distance of 1, a 3x3 structuring element is used; for a distance of 2, a 5x5 is used, etc. A hausdorff distance of 1 is a good parameter to use for the jbig2 character classifier.

How are the images aligned for comparison?

To use either of these comparators, it is necessary to align the two images. The easiest way to align is by using a corner. However, because the corners are delimited by boundary pixels, which are variable from instance to instance, this would introduce considerable variability in the alignment. Two images could easily be misaligned by up to two pixels in either direction. A much better way to do alignment is by aligning the centroids. Centroid alignment averages over many boundary pixels, greatly reducing the statistical error, so that when aligning to the nearest (integer) pixels, the alignment error is usually less than 1/2 pixel in either direction. Usually, but not always. For the final alignment step, it is necessary to literally XOR the template with the instance, count the pixels, and choose the translation with the minimum counts. Starting from the best alignment using centroids, about 25% of the templates need to be moved either up or down by one pixel for best alignment. It's a little expensive to check for nine positions before making final placement of each instance, but it's worth doing.

What's special about the Leptonlib implementation?

The low-level part of the jbig2 encoder provided here thus has the following useful features:

It has also been integrated, as mentioned above, into an open source, fully jbig2-compliant encoder.

What is Mixed Raster format?

For pages with mixed text and graphics/images, jbig2 is the best choice for the compression of the the text layer, which is usually called the foreground layer in a more general MR format. MR format is a general way to represent an image for compression. It is composed of an ordered set of (mask, image) pairs. It works by painting the first image through the first mask, then overlaying the painting of the second image through the second mask, and so forth.

The best implementation of MR format is the open DjVu format, for which there exists both proprietary and open source implementations. The DjVu format uses a restriction of the general MR format to three layers: a foreground mask (e.g., the text), a foreground image (the image painted through the text mask; often black), and a background image. The background image is painted first, and the masked foreground image is overlaid. Typically, the background image is represented at about 100 ppi, the foreground image at 25 ppi, and the foreground mask at 300 - 400 ppi.

The DjVu format uses a jbig2-type classifier for the foreground mask, and compresses the templates in that mask using an arithmetic coder. It compresses both the foreground and background images using wavelets. There are several tricky parts, such as the method by which the foreground and background are segmented, and the method of compressing the background image in the presence of the high resolution foreground mask that obscures parts of it. Many of the details and a reference open source implementation can be found at djvuzone.org.


2. Color quantization

How did the need for color quantization arise?

In the old days when computers were slow and memory was expensive, it was difficult to display color images. One couldn't simply put 24 MB for frame buffers on a video display card. To display color, rather than using 24 bits -- 8 bits for each of Red, Green and Blue -- per pixel, each pixel was typically only 8 bits deep. These 8 bits then used as an index into a palette, or color table. The color table was in fact a set of three tables, one for each color. Then, using the R, G and B tables in the color table, a single 8-bit index could be used to specify up to 256 different 24-bit colors.

What are the basic problems in color quantization?

Some interesting problems arise. Of the possible 16 million colors that can be described by 24-bit RGB, how do you choose a mere 256 to be the representatives in the color table? How do you display these 256 colors to eliminate visual artifacts from such a small number of colors? And once you have the color table, how do you quickly do the inverse process of finding the best index for each RGB pixel in the image?

The first problem, that of choosing a good set of approximate representatives from a distribution of instances in a much larger set, is an example of vector quantization (VQ). Each of the representatives is a "vector" with Red, Green and Blue components. Each representative will be used as an approximation to the color of some number of actual pixels, and the set of representatives is typically chosen to minimize some error criterion for the actual distribution of colors in the image. The solution of the first problem is represented as a color table for the image.

VQ differs from scalar quantization in that for the latter, the quantization would be an outer product of separate scalar quantization in each of the three colors. VQ is better because it allows far more flexibility in the way the full color space can be divided up.

Once the color table is selected, the image must be scanned and all pixels assigned an index in the color table. This requires an inverse color map. One brute force approach, by which the 24 bits of RGB are used as an index into a table with 16 million 8-bit indices, requires too much memory. Likewise, another brute force method where for each pixel, the distance to each of the 256 representatives is computed and the smallest is chosen, is impractical because it requires too much computation. The inverse color map must be performed efficiently, which is one reason that we've implemented color quantization using octrees.

Why is error diffusion dithering important?

However, even with the best set of 256 colors, there are many images that look bad. They have visible contouring in regions where the color changes slowly. Flesh colors can be notably poor. The best solution to this problem is error diffusion dithering (EDD), which has the interesting aspect that it typically increases the mean square error. EDD works by computing the error between the pixel value and its nearest representative, and then propagating that error to nearby pixels that have not yet been assigned to a color in the color map. After using EDD, the average color in a small region is very close to the average color in the original, but the individual colors are constrained to be taken from the set defined by the color map. By propagating the error in order to achieve visual color accuracy, EDD thus trades spatial resolution for resolution in depth. Using EDD, a surprisingly bad color quantization can be made to look remarkably good. The lesson is not that it is OK to do a lousy job of color quantization; rather, it is sufficient to do a tolerable job, and silly to spend much effort to do marginally better.

There is another interaction between EDD and the inverse color map. Without EDD, the inverse color map needs only be defined for those RGB values are found in the initial full color RGB image. However, with EDD, the RGB values can be anywhere in the color space, and thus the inverse color map procedure must find a good indexed color for any RGB value.

Finally, for EDD to accurately represent every color, it is necessary that the actual colors fall within the convex hull of the representative colors in the color table. That way, each color can be accurately interpolated between the limited set of colors in the color table. To take an extreme example, if all of the representative colors have a Blue value of 0, it would be impossible to represent any amount of blue in any pixel. In practice, the convex hull criterion is not strictly possible. However, it is important that the representative colors span the color space so that this condition is approximately satisfied.

What methods have been proposed for color quantization?

Going back to the problem of finding a good set of colors, several methods have been proposed. The simplest is a fixed color partitioning, such as one that uses equal volumes of color space. This has the virtue that it can cover the entire color space and does not require a first pass over the image to generate the partitioning. We have implemented a one-pass quantizer, using equal volumes, where each quantized volume is composed of two octcubes that are 32/256 of the color space in each dimension. Using the center of the volume as the representative color, the maximum errors in the (red,green,blue) values are (16,16,32). Partitioning by octcubes is a way of minimizing the maximum pixel error.

Of course, a partitioning that is adapted to use the statistics of the pixel colors in an image will do a better job. This requires two passes, although the first pass, which gathers statistics, need not sample every pixel in the image. There are three types of such adaptive color quantizations for which practical implementations have been made: popularity, median cut and octree. The division of the color space can be represented by a tree, and this tree can be generated either by starting with the entire color space and splitting it up, or by starting with a large number of small volumes and merging them. The division of colors along a line can be either flat or hierarchical. The popularity method uses a simple selection of colors from the color space, the median cut method uses a splitting algorithm with a flat division of colors, and the octree method is hierarchical and is naturally amenable to an algorithm that merges regions of color space.

Before we discuss these, I should mention that there is another type of clustering, called K-means, along with related vector quantization methods such as the Linde-Buxo-Gray algorithm for clustering for data compression. In K-means, you start with an initial set of centers, make a pass over the image to assign each pixel to its nearest center, compute the centroids of each cluster so obtained, and use these centroids as the centers for the next iteration. The total error is guaranteed not to increase from one iteration to the next, so the method will converge to a locally optimal solution. However, it is not guaranteed to be globally optimal; the final centers will depend on the initial centers that are chosen. The LBG algorithm is similar. These clustering methods are not practical for color quantization because, in addition to sensitivity to initial conditions, they require many iterations and are very slow.

The popularity method selects the color representatives from the set of most popular colors. Make a histogram of the actual colors in an image (say, clipped to 15 bits so that you have a significant number of pixels in the most popular colors), and choose the top 256, for example. This method does poorly for images that have many different colors represented. For example, if the image has a small number of pixels of some particular colors, these colors will not be represented at all. It also does poorly when using dithering (see below) because dithering can only interpolate within the convex hull of the representatives.

The median cut method attempts to put roughly equal numbers of pixels in each color cell. It is implemented by repeatedly dividing the space in planes perpendicular to one of the color axes. The plane is chosen to divide the largest side and to leave about half the pixels in each of the resulting regions. This method does well for pixels in a high density region in color space, where both the cell volume and color error is very small, but the volume of a cell in a low density part of the color space can be very large, resulting in large color errors. In the open source jpeg library implementation, half the representatives are chosen by the median cut, and half are chosen by dividing up the volume in the largest remaining regions. This reduces the largest color errors that may occur, and spreads the representative colors out more evenly through the color space. The latter allows better results with dithering.

Octrees are a good way to divide up the color space, while allowing fast indexing for an inverse color table. One method has been published, by M. Gervautz and W. Purgathofer, a summary of which can be found in Glassner's Graphics Gems I, 1990. This method builds an octree by taking pixels, one at a time, and either making a new color or merging them with an existing color. After the prescribed number of colors in the color table have been established, the new pixel either makes a new color (causing merging of two existing colors), or it is merged with an existing color. Each color is represented by an octcube, or set of octcubes, so the merging operation effectively prunes the (potential) octree. This method has the advantage that only up to 256 colors need be stored at any time. It has the disadvantage that the merging operations are complicated, and it is not easy to understand how to represent the full color space, which is necessary for dithering. In fact, the description was so sketchy that I can only commend D. Clark for providing an implementation in Dr. Dobb's Journal (reference at the end of this section).

What is the octree data structure used in Leptonlib?

The method we use in Leptonlib is simpler, both conceptually and in implementation. Our basic data structure is a set of arrays, describing the leaves of an octree at each of the levels 0 through 4 (or 5 or 6). At each level, there are 8 cubes that correspond to a single cube at the level above. The cubes are indexed so that the cube i at level l contains the eight cubes 8i+j, j = 0, ... 7 at level l+1. The cube index is computed from the RGB values by taking the most significant bits of the three samples in order

        r7 g7 b7 r6 g6 b6 r5 g5 b5 r4 g4 b4 r3 g3 b3 ...
down to the level of interest. For example, at level 5, the index consists of the 15 bits that are shown above. For any RGB value, it is easy to compute the cube indices at any level of the octree, and we provide lookup tables to make this fast. Note that the octree that is represented by this set of arrays is virtual, given by the indexing relation above, because there are no pointers going between different levels! We just have arrays of these cube data structures, with fast indexing into the arrays. Keep this "pyramid" in mind in the following description of the algorithm

How is the octree formed?

There are two passes. The octree with the selected color table entries (CTEs) is built in the first pass. To build this, we only need to label the specific octcubes within these arrays that are to represent the CTEs; we do not need to build any other data structures. How do we determine which cubes to label? In the first pass, we place the pixels in octcube leaves at a designated depth or level, which can be 4, 5 or 6. With level 5, there are 25 = 32K leaves. We store only the number of pixels in each of these leaves. The tree is then pruned in the following way. Starting at the deepest level, and iterating for every cube at that level, check the octcubes in sets of eight, where the eight octcubes are those that compose the octcube at the next level up. For each set of eight octcubes, one of the following will pertain:

  1. One or more of the cubes has enough pixels to become a CTE. Make it a new CTE.
  2. One or more of the cubes has already been selected as a CTE.
  3. None of the cubes is already a CTE or has enough pixels to become a CTE.
In both the first and second cases, the octcube at the next level up automatically becomes a CTE, which contains those pixels that are not in a subcube that already is a CTE. In the third case, the pixels are simply aggregated into the octcube at the next level up. When all cubes are done at that level, the procedure is repeated at the next level up.

The decision for forming a new CTE is that the number of pixels in the cube exceeds a threshold that is proportional to the number of pixels yet to be assigned to a color divided by the number of colors left to be assigned. We actually hold back 64 colors in reserve, because when we get to the second level we require that each of these 64 octcubes be a CTE by default. In this way, we constrain the maximum color error while insuring that the entire color space is covered by the color table. The value 1.0 for the proportionality constant at each level works well. (The proportionality constants for levels 0, 1 and 2 are not used, because the octcubes at level 2 become CTEs automatically.)

What color is associated with each octcube in the color table?

We associate with each octcube in the color table the coordinates of the center of the cube. This is easier than maintaining a running centroid, and has very little effect on the result. In fact, because the octcubes at each level are indexed as they would appear in the octree from left to right, we can quickly compute the center of the octcube in which any pixel would fall, so we don't even need to store it.

How are color indices assigned to the image pixels?

The second pass, where the pixels are assigned their index, can be done very quickly in two different ways, of which we implemented the second:

  1. Make an explicit inverse colormap. Compute and store the index in the leaf array at the deepest level (in the place where we initially stored the histogram). Then for each pixel, convert the RGB value of the pixel to the truncated octcube index at that level, and look up the colortable index from the pre-computed array.
  2. For each pixel, run down the tree from the root to find the octcube that it belongs in. Do this by converting the RGB value to an index into the array of octcubes at each successive level, stopping when you find an octcube that is marked as a CTE and the octcube at the next level down is not a CTE. (If you reach the bottom level, take that octcube.) Then take the colortable index from the CTE octcube. For dithering, you also need to know the center of the octcube, which can either be computed or read from the value stored there.

This octree method has the drawback of not generating exactly the number of color requested for the color table. It typically has a few less. The actual number depends on the distribution of colors in the image. Note that if enough pixels are present to make a CTE from an octcube at level 6, then CTEs for the containing octcubes will be generated automatically at levels 5, 4, and 3, regardless of the number of unassigned pixels in those cubes. (The only exception is if all 8 subcubes of an octcube are already CTEs, the containing octcube will be labelled as a leaf for purposes of traversing the tree in the labelling step, but will not be assigned a color).

How is the error diffusion dithering applied?

To maximize the accuracy of the color appearance, we use EDD in the second pass. The error in each of the three samples is passed down to three adjacent pixels whose indices have not yet been computed. We use just three pixels, with fractions of 3/8, 3/8 and 1/4, for simplicity. The implementation uses six color buffers for the pixels in the current line and the next line, and does all computation in integers with a multiplying factor of 64 to reduce the roundoff error. To the extent that the actual RGB pixel colors are within the convex hull of the CTE values, EDD will accurately reproduce the original pixel color when averaged over a few pixels.

How fast is the implementation?

For efficiency, the first pass, where the octree color table is generated, can be done on a subsampled version of the input image. For an image with a million pixels, subsampling by 4x in each direction is perfectly adequate, as it leaves about 70K pixels to form the octree. The conversion speed for generating a colormapped image from an RGB image depends on the deepest level at which pixels are allowed to form CTE octcubes. On a million pixel RGB image, using a 1 GHz Pentium III, the total conversion time for levels 4, 5 and 6 is 0.40, 0.45, and 0.60 seconds, respectively.

All details are found in the code in colorquant.c and colorquant.h. The executable for both the 1-pass and adaptive octcube methods is made from prog/colorquanttest.c. It should be emphasized that EDD does a fine job of representing color even for the 1-pass, fixed cell, equal volume method. Without EDD, the 1-pass method looks poor on most images. You can experiment with the various options by changing the number of levels in the 2-pass method (in colorquant.h), and changing the number of colors, selecting to dither or not, and selecting the 1- or 2-pass method in prog/colorquanttest.c.

For more information on octree color quantization, I've written a report. There are five comparative images referenced in the report. These are given below. It should be noted that the actual rendering of these images will depend on the browser, the frame buffer depth (8, 16 or 24 bits) of the computer, and the video display card.


One-pass quantization, no dithering.



One-pass quantization, with dithering.



Two-pass octree quantization, no dithering.



Two-pass octree quantization, with dithering.



Full RGB color image.



For further reading on color quantization...




3. Color segmentation

What is color segmentation and why do it?

There is an entire industry devoted to the problem of searching images in a large collection of images. These are typically images of scenes taken with a camera, rather than drawings or other composed images. The query protocol for the search is typically to choose some image and request "all images that are similar." A related problem is to place each image in the collection into a subset of similar images, doing some type of unsupervised classification. These operations require the ability to calculate a meaningful "distance" between two images. Unsupervised classification follows a simple algorithm. Given a distance function and a threshold, each image is sequentially placed either into an existing class or starts a new class, depending on the distance of that image from the representatives of each existing class. The first image to start each class is the representative for the class, at least on the first iteration. There may be subsequent iterations where the parameters of the classes are averaged to find a "center" for each class in the parameter space.

How is the distance between images determined? That's the hard part -- there's no obvious way. In fact, for two-dimensional images, there are very few distance metrics (satisfying the triangle inequality) that have been found. One of them is the Hausdorff metric, which we use for classification of binary shapes; e.g., see Jbig2 image compressing. For color images, no useful metric distance is known.

IBM had a big R&D program called "Query By Image Content," or QBIC for short. None of the published methods work very well, so it continues to be an "interesting" problem to academics. Image search works much better if there is some text associated with the images: you do a text search and don't bother about the pixels in the image. After an initial fanfare, IBM seemed to give up on the QBIC project near the end of the last century.

But if you wanted to write a QBIC program, what parameters would you choose? Intuitively, you want to include both color and shape information. Images of natural scenes usually have many colors and complicated shapes. So the first logical step would be to simplify both the colors and the shapes. Once that is done, you still have the difficult problem of finding a good way to use your simplified colors and shapes to generate a meaningful "distance." The color segmentation in Leptonlib tackles the easy part: finding regions of significant size and nearly uniform color. We offer it as a starting point for any QBIC-like image search application.

There are other reasonable approaches this problem, notably involving textures. Rather than smoothing out regions and looking for a few representative colors, you can identify regular or irregular patterns of colored shapes. Textures are sufficiently varied that taxonomies are non-trivial and somewhat arbitrary. A possible implementation would omit the smoothing step (described below), and would select binary textures formed by assigning fg and bg to two specific colors. For N colors, there are N(N-1)/2 color pairs, and for small N (say, less than 8) a search of all pairs is feasable. We won't pursue this any further here, but we do use textures implicitly in morphological analysis of document images, for segmentation.

How does color segmentation differ from color quantization?

The purpose of color quantization is to generate an approximation to the original image with a smaller palette of colors. If there is a relatively small number of colors, without a high frequency pattern such as halftoning, such an image will compress very well losslessly, often much better than with lossy compression such as jpeg, where a fine-grained set of RGB colors is used. In Leptonlib we use octcube partitioning and octree indexing because it allows fast color quantization with arbitrary color accuracy. For best accuracy, an octree that represents the most important colors is combined with error-diffusion dithering to approximate the original colors accurately in each small region of the image. However, dithered images do not compress well losslessly.

In color segmentation, fidelity to the original image is not a goal. Instead, you want a small set of regions, of significant extent and with smooth boundaries, each of which is of a uniform color, and with a relatively small total number of colors in the image. We want the few resulting colors to be the best representative of each subset of pixels. The result is an image that has been "simplified," both in colors and shapes. As will be shown, octcube indexing can be used to accelerate some steps of the classification process. Although it's not important for the application, color segmented images have excellent lossless compression.

How do we generate the color segmented images?

The best description of the method used here is the code itself (colorseg.c). The top-level function, pixColorSegment(), has four parameters, of which two -- the smoothing parameter and the number of final colors -- are the most important. The other two parameters could be generated programmatically, but an argument can be made to keep them for experimentation. One of these parameters is the maximum number of colors to be quantized in the first phase, and as a rough guide, this should be at least twice the final number of colors. The other parameter is the initial guess for the threshold euclidean distance for determining if a color belongs to an existing class. This distance is related to the radius of the resulting clusters, which is related to the maximum number of clusters that will be found in the RGB space. Guidelines are also given for the relation between the input euclidean distance and the maximum number of clusters.

The process has four phases:

  1. Greedy, unsupervised classification. This is an iterative procedure. We start with the threshold cluster radius and the maximum number of colors. Pixels are taken in raster order. The first pixel becomes the representative for the first cluster. Successive pixels are assigned to this or other existing clusters, or become representatives for new clusters. If the maxcolors limit is exceeded, the threshold radius is increased by a multiplicative constant and the process is repeated, until a cluster assignment is made that obeys the maxcolors constraint. The average cluster color is computed during accumulation.
  2. Reclassification using the cluster averages.. Each pixel is re-assigned to the cluster whose average color is closest to the pixel color. This improves the assignment. The cluster averages are stored in a colormap. We make the time to compute the assignment independent of the number of pixel clusters, by constructing an octcube in RGB space and assigning to each cube the nearest color in the colormap. Then for each pixel, we use table-lookup twice: first to find the index of the containing octcube, and second to find the nearest color in the colormap to that octcube. We also keep track of the number of pixels assigned to each cluster.
  3. Smoothing regions and boundaries.. Starting with the color cluster with the most pixels, we generate a binary mask where those pixels are fg and all other pixels are bg. Then do a closing with a Sel size given by the smoothing parameter. Xor the result with the initial mask to find the new pixels to be assigned to this color, and reassign them. Repeat for all colors.
  4. Reduce the number of colors to the input 'finalcolors'.. Identify all the pixels that are not in the most populated color clusters, by building a binary mask over them. Assign all these pixels temporarily to one of the color clusters that will be saved. Then remove unused colors from the colormap. This does a compression of the colormap, causing a reassignment of the pixel values (which are just the colormap indices). Finally, reassign all the pixels under the mask to their closest color in the colormap. We can use the same function from the second phase to do this, except this time we only reassign the masked pixels.

How does a color segmented image look?

Here is a fairly busy image:



Suppose we want finalcolors = 4, and we use a 5x5 Sel for smoothing. The rule-of-thumb says to choose maxcolors = 8 (approximately), and we get:



This isn't bad for using only four colors in the final result. If we choose maxcolors = 4, the result is more random in the final assignment:



whereas going to the other extreme, starting with maxcolors = 16, the result is very noisy:



This is due to the fact that the smoothing step in phase 3 was performed with many colors, and hence was relatively ineffective. Then in phase 4, many of the pixels were reassigned based on color rather than proximity. This suggests that we should perhaps reduce the number of colors to the final value before smoothing. I will leave it to the users of the library to experiment further. Please let me know when you get some interesting results, and I will post them here.


4. Measuring the skew of document images

As with most applications of practical importance, there have been many published papers, using diverse methods, on ways to determine the skew of document images. It may not come as a surprise that, notwithstanding the size of this bibliography, there is only one method that can be recommended for both speed and accuracy. It maximizes, in essence, the variance of differential line sums, where the line sums are the pixel projections taken at different angles. When the angle chosen is equal to the skew angle, the difference in pixel projections on adjacent raster lines, when squared and summed over all raster lines, has a maximum value. This method was first published by W. Postl in a 1988 U.S. patent, #4,723,297.

I can't resist telling a little story here. Way back in 1991 I gave a paper at the first ICDAR Conference, held in the beautiful medieval town of St. Malo, France. The format had the speakers sitting at a table at the front of the room, and I was placed next to a well-known graybeard who had published a textbook on algorithms for image processing way back in 1982. We started talking about our recent work, and he said his paper was on segmenting a scanned document page that was skewed by a few degrees. I volunteered that it would be much harder to do that on a skewed page, and he agreed. So I asked him the obvious question: "Why do something so difficult when there is a much easier way to do it? Deskew the image first." (I didn't add a second reason: that the result of segmenting a scanned image is also expected to be much poorer if the image hasn't been deskewed.) He replied that it would be preferable to deskew the image, but it was much too slow. I replied that surely it took longer to segment an image than to deskew it. No, he said, it takes at least a minute just to determine the skew angle. I was surprised, and asked him what method he was using for deskew. He said he was using a Hough Transform, but was vague on the details of the search scheme. I then told him that the skew angle could be found to sufficient accuracy in a few seconds using Postl's variance of line sums (or of differential line sums). He was skeptical; if what I said was true, his attempt to solve a difficult problem was meaningless. There are several conclusions we can draw from this tale. (1) Don't believe something just because an eminence grise says it is true -- figure it out for yourself. (2) Always look for elegant solutions. They have lasting value and usually teach you something important, which can be used in other situations as well. (3) Remember two things about computer algorithms. Computers get faster in accordance with Moore's Law, so you can assume that algorithms that are too slow today will eventually become practical. But, that being said, at any time, it is important to search for efficient methods.

How fast is deskew on a 1 GHz P3 with an 8 Mpixel image (standard page at 300 ppi) that is skewed about 1 degree? It depends on how much accuracy you want. But if you do it at 4x reduction, which will give you an accuracy of about 1/20 of a degree, it takes about 0.10 seconds to find the skew angle and about 0.12 seconds to rotate the image. That is less than 1/4 second in total. And on a 2.5 GHz P4, the entire operation will take less than 0.1 second. Just the blink of an eye.

I have written technical notes (in ps pdf, djvu), that summarize my understanding of the different methods and their relative merits and deficiencies, with emphasis on the method that uses the variance of differential line sums. Additionally, in the section on recent publications, you can find a paper where the preferred method is tested on a database of about 1000 pages.

5. Outlines of connected components

The generation of connected component (c.c.) borders is a relatively simple application in leptonlib. There are many rendering machines that can convert outlines to binary images. PostScript and other page description languages use outline fonts for scalable rendering, to get smooth pixellated curves at any output device resolution. There are several different font formats in use, such as the original PostScript outline fonts, Microsoft's TrueType fonts, and the open source FreeType fonts.

Borders of connected components can be generated by following either the border pixels or the "cracks" between border pixels of different colors. The first type are called chain codes; the second are crack codes. It is relatively simple to convert crack codes to chain codes, but we will use chain codes throughout. Borders must be followed consistently so that the inside of the c.c. is on either the left or right. We choose having the inside on the right side, so that outer borders are traversed cw and inner (hole) borders are traversed ccw.

After the chain code for the border(s) of a c.c. are determined, the following constructions are of interest:

Let's look at each of the operations in some detail. The source is in ccbord.c and ccbord.h, which should be consulted for further details.

Generating chain codes for connected components

We first find the bitmaps and bounding boxes for each 8-connected component. The border of an 8-connected component is represented by an 8-connected chain of pixels. Then for each c.c., the outer border is found by first adding a one-pixel border of OFF pixels to the PIX for the c.c., finding a first border pixel in the fg (i.e., an ON pixel next to an OFF pixel), and sequentially finding the next pixel on the border. We use a method described by Rosenfeld and Kak in Digital Picture Processing, Vol 2, pp. 219ff, Academic Press, 1982. The next pixel in an 8-connected border is found by looking in the 8 directions from the current pixel, starting with the pixel next in cw rotation from the previous border pixel. The search sweeps around until an ON pixel is found. The chain is terminated after the first pixel is reached again AND the next pixel would be the second pixel in the chain. This is necessary because pixels can be traversed on the chain multiple times (up to 4 times; consider a 3x3 plus sign).

Remember, the background to 8-connected foreground is 4-connected. The holes are found as follows. We take each minimum-sized image of an 8 c.c., and use 4-filling from the border. This fill stops when you hit the outside of the c.c. Then inverting the result, you get an image of the holes (if any) as 4-connected fg objects. Run the c.c. finder looking for these 4-connected objects. Then for each hole, the hole border in the fg of the original c.c. is found by first finding a pixel within the hole, then searching in the original c.c. for a border pixel, and then following the border to generate the chain code. Doing it this way, we are guaranteed to find all holes within components, and not to run into trouble in very complicated situations where there can be components within holes within components, etc, nested to any arbitrary depth.

Serializing the data for a compressed file format

We show a simple way to do this, that gives a typical compression on text images that is about 25% better on borders of connected components than using png on the original image. If there are halftones, you can expect worse compression than png. These outline representations should generally not be used on halftones, but they work well with text and unstippled line art.

The chain code can be represented as a sequence of directions for each step. There are 8 directions, so this requires 3 bits of information. Pack 2 steps into each byte. You can do better, putting 8 steps into 3 bytes, but gzip, which works on bytes, will be less successful at compressing such data, and we use gzip on the sequential step data. For each c.c. you also need to store the UL corner coordinates, and for each border (both outer and hole) you need to store the coordinates of the starting pixel. We also store the width and height of each c.c., but this is not necessary because it can be determined from the outer border. This data is collected in memory, using the byte buffer utility in bbuffer.c. It is then compressed in memory with gzip and written out to file. It can be read back, gunzip'd in memory, and parsed, with the construction of the CCBORDA data structure.

There are other ways to do this. For example, we can store run-lengths at each direction rather than single steps. We can compress using a Huffman code on run-lengths, rather than a universal code on bytes composed of 2-step direction codes. I adopted the approach here because it's easy to implement using the byte buffer utility and the in-memory gzip functions.

Forming a single border for each connected component

Kris Popat suggested that it is useful to connect the hole borders with the outer border, so that there is a single border for each c.c. (In fact, this whole chain code representation is Kris's suggestion.) To avoid having artifacts generated by the rendering machine, each cut path between inner and outer borders is traversed twice, in opposite directions, along exactly the same pixel path. The renderers are usually smart enough to figure out what is inside and what is outside.

The implementation is most simple if we find a path from each hole border to the outer border, making sure that it is fully contained within the fg of the c.c. We use again a bitmap of the c.c. under study. For each hole, we start at the center of the hole, making sure it is an OFF pixel, and march in one of 4 directions until we find an ON pixel, which must be on the hole border. We then continue in the same direction until we hit either an OFF pixel or an ON pixel at the edge of the bitmap. Check if the last ON pixel is on the outer border of the c.c. If it is, we have our cut path; if not, choose another of the four directions and repeat. This way we have four chances. Because the application is for text, where the most likely joining of characters is horizontal, we pick the first two directions vertically and the last two horizontally. If we fail in all four directions, we just lose a hole.

Then, once a cut path is located for each hole, we start on the outer border. When we come to a cut path, we take it, go around the hole, come back on the cut path to the outer border, and continue there. After the outer border has been completely traversed, we are finished.

For completeness, I note that we offer the following representations of the chain code in the CCBORDA data structure:

Rendering the outline as a filled raster

The standard method for filling outlines is scan line conversion. This sweeps a line across the image, noting when it goes through end points of oriented line segments, and keeping track of how many lines (and, optionally, their orientations) that it cuts. You can understand how this works intuitively by tracking across just one raster line, starting at the left edge. Suppose you start in bg. The first line you cross will be oriented up, and after crossing it you will be in fg. (This follows our convention that inside is on the right as you traverse the path.) For simple shapes, each line you cross will be oriented oppositely to the preceeding line, and you will toggle from fg <--> bg. In the general case there can be any number of lines oriented up or down, and a rule is needed to determine what to do when crossing. There are two common rules:
  1. Nonzero winding number rule. Sum the crossings, using +1 if the path crosses oriented up and -1 if it is going down. If the sum is 0, you are in bg; otherwise, you are in fg.
  2. Even-odd rule. Sum the crossings, independent of path orientation. If the sum is even, you are in bg; otherwise, you are in fg.
For simple shapes, these two rules give the same result, but for complex paths they will differ. Using the nonzero winding number rule, if you pass 2 lines both oriented up, the winding number is 2; you are still in fg and will remain there until the winding number has gone back to 0. Scan line conversion has not yet been implemented in leptonlib.

We provide instead two topological algorithms for filling the outlines, which are represented as separate outer and hole borders for each c.c. These are the functions ccbaDisplayImage1() and ccbaDisplayImage2(). Both algorithms are described in some detail at the top of ccbord.c. The algorithms are very similar, but Method 2 is a little simpler, so I describe it here.

Each 8-connected component is filled separately, and then rasterop'd into the destination image. For each c.c.,

  1. Make a PIX with the border pixels in the c.c. in the fg, and with an added 1 pixel border of bg pixels. If w and h are the width and height of the c.c., the PIX has width w+2 and height h+2. This is the clipping mask.
  2. Make a seed image of the same size, and for each border (both outer and holes) put one seed pixel outside the border. "Outside" is determined by our right-hand convention for borders. The 1 pixel border was added to each image to simplify the procedure for finding a seed outside the outer border of the c.c.; namely, by guaranteeing that those pixels are accessible as seeds.
  3. Fill the seed image, using the border pixels as a clipping mask to stop each fill at the borders. This is done with pixSeedfillBinary(), inverting the clipping mask to make it a filling mask, and using 4-connected filling for the bg. We have now filled both the holes and the region outside the outer border of the c.c.
  4. Invert the seed image, to get the properly filled c.c, still centered in the oversize (by 2 pixels) PIX.
  5. Rasterop using XOR the filled c.c. (but not including the outer 1 pixel boundary) into the full dest image.
This is relatively fast, requiring only seedfill and rasterop as the basic bit-parallel functions.

Putting it together

You can run all these operations with prog/ccbordtest.c. Let R be the ratio of 8-c.c. to 4-c.c. For typical text, R is very close to 1. However, if halftones are present, R can be much smaller than 1. The following three page images were used to test different aspects of the border finding and rendering routines:

[Leptonica Home Page]