SWT voting-based color reduction for text detection in natural scene images
© Ikica and Peer; licensee Springer. 2013
Received: 8 February 2013
Accepted: 15 April 2013
Published: 1 May 2013
Skip to main content
© Ikica and Peer; licensee Springer. 2013
Received: 8 February 2013
Accepted: 15 April 2013
Published: 1 May 2013
In this article, we propose a novel stroke width transform (SWT) voting-based color reduction method for detecting text in natural scene images. Unlike other text detection approaches that mostly rely on either text structure or color, the proposed method combines both by supervising text-oriented color reduction process with additional SWT information. SWT pixels mapped to color space vote in favor of the color they correspond to. Colors receiving high SWT vote most likely belong to text areas and are blocked from being mean-shifted away. Literature does not explicitly address SWT search direction issue; thus, we propose an adaptive sub-block method for determining correct SWT direction. Both SWT voting-based color reduction and SWT direction determination methods are evaluated on binary (text/non-text) images obtained from a challenging Computer Vision Lab optical character recognition database. SWT voting-based color reduction method outperforms the state-of-the-art text-oriented color reduction approach.
Text detection in natural scene images is a very challenging task, far from being completely solved. Complex backgrounds, uneven illumination, and presence of almost unlimited number of text fonts, sizes, and orientations pose great difficulties even to state-of-the-art text detection methods. Unlike document images, where text is usually superimposed on either blank or complex backgrounds and is therefore more distinct [1–3], natural scene images deal with scene text, which is already a part of the captured scene and is often much less distinct. Nevertheless, text detection has become a very popular research area due to its enormous potential in many applicative areas such as sign translation, content-based web image searching, and assisting the visually impaired.
In this article, we propose a stroke width transform (SWT) voting-based color reduction method. It reduces the number of initial colors in the original image to only a few, typically less than 10, while preserving all dominant text colors. SWT voting-based color reduction corresponds to the first two stages of the text detection flowchart (see yellow rectangles in Figure 1). Two spatially connected pixels of a color-reduced image that belong to the same color class correspond to the same connected component as well.
The proposed method improves the state-of-the-art color reduction approach for text detection by Nikolaou and Papamarkos  with additional SWT information . Since SWT pixels most likely belong to text regions, they are mapped to color space, where they supervise the color reduction process, more specifically, the mean-shifting stage. When a particular color receives a high SWT vote, it is blocked from being mean-shifted away.
Popular text detection datasets such as the International Conference on Document Analysis and Recognition (ICDAR) 2003 dataset  and ICDAR 2011 dataset  are inappropriate for evaluating our method since they are annotated with word rectangles. To evaluate the performance of our color reduction method, a per-character evaluation is necessary. Thus, binary ground truth images obtained from Computer Vision Lab Optical Character Recognition DataBase (CVL OCR DB)  are used for evaluation. Text and background pixels in ground truth images correspond to non-zero and zero values, respectively.
Text detection literature does not directly address the problem of finding correct SWT search direction. Typically, methods based on SWT execute the SWT method in both gradient and counter-gradient directions and combine the results of both directions. This, however, results in detecting inter-character and non-text areas. Thus, besides SWT voting, our contribution is an adaptive SWT direction determination method that uses SWT profiles to partition an image into sub-blocks and analyzes their SWT histograms of both SWT search directions.
The rest of the paper is organized as follows. Section 2 describes the proposed method in general. Sections 3 and 4 give a detailed description of both SWT direction determination and SWT voting-based color reduction methods. Experimental results are presented in Section 5. The article is concluded in Section 6.
Text in natural scene images is distinguished from other image structures and background by its characteristic shape (character strokes are more or less parallel) and color uniformity. Unlike many other text detection methods that analyze either shape or color, the proposed method combines both by integrating the SWT  and Nikolaou text-oriented color reduction  methods.
SWT method proposed by Epshtein et al.  is a region-based text detection method. It follows the stroke width constancy assumption, which states that stroke widths remain constant throughout individual text characters. After obtaining an edge map of an input image, SWT method locates pairs of parallel edge pixels in the following fashion: for each edge pixel p a search ray in the edge gradient direction is generated, and the first edge pixel q along the search ray is located. If p and q have nearly opposite gradient directions, an edge pair is formed and the distance between p and q (called stroke width) is computed. All pixels lying on the search ray between p and q (including p and q) are assigned a corresponding stroke width. After assigning stroke widths to all image pixels, the SWT method groups pixels with similar stroke widths into connected components and filters out those that violate geometrical properties of the text. When the edge threshold is sufficiently low, SWT typically finds all characters in the image or at least small portions of each of them. However, it often fails to detect whole characters and leaves parts of them undetected (see letter ‘A’ in ‘RHODIA’ in Figure 2b). Another SWT drawback is the detection of non-text structures with nearly parallel edges.
Text-oriented color reduction method proposed by Nikolaou and Papamarkos  and applied in  successfully deals with the problem of partially detected characters (see Figure 2c). The idea behind the method is to reduce colors in an image to only a few dominant image colors thus making text detection much easier. The method starts by creating an RGB histogram h RGB of an image. Next, initial color cubes of fixed size are randomly generated inside the h RGB until they completely cover all non-zero h RGB cells. To further reduce the number of colors, the initial color cubes undergo mean-shift stage and are shifted towards dominant gravity centers in h RGB. Additionally, if particular color cubes appear close enough to each other, they are merged together. Centers of the resulting color cubes correspond to the final colors C. Finally, color-reduced image is generated by replacing image colors with their closest match in C. When particular text colors cover only a small portion of an image and are not sufficiently far from other colors in the RGB color space, they are often mean-shifted away – in worst cases, towards background colors. In such scenarios, the color reduction is unable to detect text properly (see missing ‘Clairefontaine’ text in Figure 2c). Decreasing the cube size could preserve lost text colors, but would also increase the number of final colors, which is unacceptable.
We implemented the Nikolaou color reduction method with modifications. The initial cubes are not selected randomly as in  but in the following manner: RGB histogram bins are sorted in descending order and initial cube centers are always selected from the top of the unvisited bins list. This way, we guarantee that algorithm works in deterministic fashion. HSL color model is used for generating color-reduced images from the final color clusters. We use HSL distance metrics defined in .
The left column in Figure 4 depicts the original image (a), its gradient SWT image (c), and counter-gradient SWT image (e). We’ll refer to them as SWT + and SWT −, respectively. Each color in the SWT image corresponds to a particular stroke width; therefore, pixels sharing the same stroke widths are represented with the same color. If we carefully observe both SWT images, we can see that SWT + is more compact and contains less colors compared to SWT −. This is reasonable since SWT + corresponds to the actual text with uniform stroke widths. On the other hand, SWT − corresponds to non-text areas with randomly distributed ‘strokes’. However, if we look at the right column in Figure 4, the distinction between SWT + and SWT − is not so clear anymore. But still, the given assumption holds for the central region with text ‘ROžA’.
where i is the bin index, p is the image pixel, and m a x SWT is the maximum SWT value of both SWT and SWT sub-blocks. Simply put, each SWT histogram bin contains a number of pixels with stroke widths in a given bin range. After obtaining SWT histograms, both histograms are sorted in ascending order. Examples of sorted SWT histograms are depicted in Figure 6.
While examining SWT histograms of several SWT images, we came across some interesting observations. SWT histograms corresponding to true text are usually steeper, more compact, and edgier as opposed to the non-text histograms, which are typically wider and ascend in a more continuous fashion (see Figure 6). Our empirical observations seem to be reasonable since the text usually contains equal stroke widths. On the other hand, non-text regions contain more SWT noise; therefore, stroke widths are more evenly distributed over the whole spectrum.
After obtaining all sub-blocks SWT SB they are glued together into the final SWT RES image of the same size as the input image I.
where μ(·) corresponds to the matrix mean and 0<α<1. Since the distance between parallel edges of the character is usually shorter than the distance between the character and other structures in an image (such as signboards), Equation 3 serves for limiting the influence of the longer (usually non-text) distance. Note that longer distance does not always correspond to non-text area; therefore, correct SWT direction typically cannot be determined merely on the length observation basis. Nevertheless, the upper SWT boundary is a very useful addition to the SWT sub-block histogram analysis.
Filtering SWT sub-blocks with upper SWT boundary before generating SWT histograms and computing the f measure successfully reduces the number of SWT outliers as shown in Figure 7c,d. In the sections that follow, we denote the final SWT RES image simply as SWT image.
To block probable text colors from being mean-shifted away, a correspondence between text shape and image colors must be established. A key to plausible integration of both lies in the SWT information, since image pixels corresponding to non-zero values in the SWT image most likely belong to text regions. Therefore, before executing a mean-shift stage, non-zero SWT pixels are mapped to the color space using SWT lookup table.
SWT lookup table is a three-dimensional table of the same size as the RGB histogram. Each table cell corresponds to a particular RGB triplet and contains a list of non-zero SWT values of all RGB occurrences in the original image. For instance, the SWT lookup table entry for RGB triplet (120,50,70) is generated by locating all pixels with R=120, G=50 and B=70 in the original image and storing SWT values at corresponding locations in the SWT image. If a particular color is a text color, its SWT lookup cell contains more or less similar SWT values. On the other hand, non-text colors mostly contain very few SWT values (non-text regions mostly correspond to zero SWT values), which are randomly distributed.
The SWT lookup table represents an efficient mapping from stroke width space to color space and allows us to quickly obtain SWT information for a particular color.
To block text colors from being mean-shifted away, we propose the following solution: before each mean-shift step, SWT properties of source and target cubes are compared (source and target cube correspond to color cube before and after particular mean-shift step, respectively). When a source cube rich with SWT pixels is about to be mean-shifted towards a target cube that is drastically poorer with SWT pixels, mean-shifting of a current color is blocked since the probable text/non-text transition has occurred. We call this process SWT voting since SWT pixels cast a vote to a color they are assigned to and determine whether it is a text or non-text color.
Let CB denote a color cube with center C C B and edge length L C B , and let L T SWT denote a SWT lookup table. Before each mean-shift iteration, a smaller concentric SWT cube C B SWT with edge length L SWT=β·L C B (0<β≤1) is generated inside the color cube. The following properties of the SWT cube are computed:
where h RGB is the RGB histogram of an original image.
Standard deviation of SWT lengths S D SWTL. S D SWTL measures stroke width variance of SWT pixels covered by the SWT cube. Lower deviation indicates that SWT cube covers pixels of uniform stroke widths and therefore corresponds to a text color.
Standard deviation of SWT offsets S D SWTO. S D SWTO indicates how scattered the SWT pixels are with respect to SWT cube’s origin.
Let us explain this condition. When a significant drop in SWT density is detected (first row in Equation 5), the cube is probably undergoing a text/non-text transition; however, some additional checks need to be performed: first, the SWT density of a source cube must be relatively high. Otherwise, density drop can be a result of SWT noise present in low SWT density cubes (second row in Equation 5). Second, we empirically found out that SWT length deviation typically rises when text/non-text transitions occur (third row in Equation 5). Third, mean-shifting from text to background color is gradual, wherein transition is typically made of more than one mean-shift steps. To assure that mean-shifting is blocked in the first step and not in the intermediate steps, the fourth row in Equation 5 must be true. SWT cubes corresponding to dominant text color tints usually have SWT pixels spread all over the cube. When mean-shifting towards background colors, SWT pixels slowly vanish and appear only at SWT cube borders thus lowering SWT offset deviation. Due to the SWT and color noise, all four conditions in Equation 5 must be true. Otherwise, mean-shifting proceeds normally. Parameters τ D, τ L, and τ O in Equation 5 determine sensitivity of the SWT voting. By lowering/raising them, the mean-shift stopping condition becomes more/less strict thus affecting how many color cubes are blocked from mean shifting any further.
All experiments were carried out using the following empirically obtained parameter values: N B=40, α=0.33, β=0.63, D m i n =0.18, τ D=0.70, τ L=0.80, τ O=0.80. Other color reduction parameters were identical as in .
SWT direction determination results
Number of text
Number of all
Basic SWT direction
SWT direction method with
upper SWT boundary
Relatively high detection rate in Table 1 indicates that the SWT direction determination method (in combination with upper SWT boundary) works well and fails to determine the correct SWT direction only in a few cases. Since these cases typically correspond to smaller parts of the scene text (such as words, word segments, or even single characters), they do not critically affect the succeeding SWT voting process. Enough color information is still available in the remaining (correctly determined) parts of the text.
CC detection rate
Mean detection rate
Results in Tables 2 and 3 indicate that SWT voting outperforms the state-of-the-art Nikolaou text-oriented color reduction method . SWT regions, most likely belonging to text regions, contain valuable text color information, which is used to correctly supervise the mean-shifting in favor of text colors.
We presented a novel SWT voting-based color reduction method. First, an adaptive sub-block SWT direction determination method is described. By splitting an image into sub-blocks and analyzing corresponding SWT histograms of gradient and counter-gradient directions, the method is able to achieve 91% detection rate. Second, a SWT voting approach for color reduction is proposed. Colors rich with SWT pixels most likely belong to text characters and are therefore blocked from being mean-shifted away. Besides improving the state-of-the-art Nikolaou text-oriented color reduction approach, SWT information can be successfully applied in the connected component filtering stage. Only connected components with the SWT cover ratio higher than a predefined threshold are treated as text. All others are filtered out since they are not covered by enough SWT pixels and most probably do not correspond to text. SWT voting-based color reduction achieves up to 80% mean detection rate and up to 71% CC detection rate.
Thresholds in the SWT voting condition determine sensitivity of the text/non-text transition detection in the mean-shift stage. If the thresholds are relaxed, even more text colors can be preserved, but in this case text characters are often split into several text colors. Our future work will therefore focus on merging text colors that belong to the same text character.
This research is supported by the Public Agency for Technology of the Republic of Slovenia (TIA) – operation partly financed by the European Union, European Social Fund.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.