Skip to main content

Automatic Moving Object Segmentation from Video Sequences Using Alternate Flashing System

Abstract

A novel algorithm to extract moving objects from video sequences is proposed in this paper. The proposed algorithm employs a flashing system to obtain an alternate series of lit and unlit frames from a single camera. For each unlit frame, the proposed algorithm synthesizes the corresponding lit frame using a motion-compensated interpolation scheme. Then, by comparing the unlit frame with the lit frame, we construct the sensitivity map, which provides depth cues. In addition to the sensitivity term, color, coherence, and smoothness terms are employed to define an energy function, which is minimized to yield segmentation results. Moreover, we develop a faster version of the proposed algorithm, which reduces the computational complexity significantly at the cost of slight performance degradation. Experiments on various test sequences show that the proposed algorithm provides high-quality segmentation results.

1. Introduction

Due to the advances in computation and communication technologies, the interest in video contents has increased significantly, and it has become more and more important to analyze and understand video contents automatically using computer vision techniques. To address the growing demand, various video analysis techniques have been introduced. Among them, moving object segmentation is a fundamental tool, which is widely used in a variety of applications. Especially, it plays an important preprocessing role in vision-based human motion capture and analysis, since the shape of a human subject after the segmentation is one of the main features for understanding human behaviors [1]. For example, in human pose estimation, 2D outlines of a human subject, which are extracted from one or more viewpoints using object segmentation techniques, are employed to reconstruct the 3D shape of a generic humanoid model [2–5]. Also, based on moving object segmentation, the outline of a human body can be tracked and used in human gesture analysis and human-machine interface [6, 7]. Moreover, the body shape and dynamics of a human subject can be used to recognize his or her identity [8, 9]. Therefore, the development of accurate video object segmentation techniques is essential to understand human behaviors.

Many approaches have been proposed for video object segmentation. They can be classified roughly into two categories: semiautomatic and automatic methods. Semi-automatic methods [10–13] first identify regions of interest coarsely using initial user interactions. Then, based on the initial information, they construct color, position, or motion models of objects and the background. The models are then used to separate the objects from the background more accurately. In [10], a background subtraction method was proposed to segment objects in video sequences with static backgrounds. It extracts moving objects by subtracting a given background from each frame in a video sequence. In [11], Criminisi et al. proposed a discriminative model, which is composed of motion, color, and contrast cues with spatial and temporal priors. Their algorithm achieves high quality video segmentation in realtime, but it works only if ground truth data is available for training model parameters. Also, tracking-based algorithms have been proposed in [12, 13]. They extract objects in the first frame based on users' markings, and then track the objects in subsequent frames using color, position, and temporal cues. These semi-automatic methods [10–13] can achieve relatively accurate segmentation results using initial interactions. However, the interactions prevent them from being used in applications in which full automation is required.

On the other hand, automatic video segmentation methods extract objects without initial interactions [14–17]. They have the object detection stage, which defines objects of interest. Since objects of interest are usually moving, motion information is typically employed to distinguish the objects from the background. The motion field between consecutive frames is estimated, and then regions are classified as object or background based on the motion information. Chien et al. [14] proposed a background registration technique to estimate a reliable background image. Moving objects are extracted by comparing each frame with the estimated background. Tsaig and Averbuch's algorithm [15] divides each frame into small regions, finds the matching regions between consecutive frames, and declares the regions with large motions as objects. Yin et al.'s algorithm [16] learns segmentation likelihoods from the spatial contexts of motion information, and extracts objects automatically with tree-based classifiers. In [17], Zhang et al. proposed estimating the depth information of sparse points to detect foreground objects. These automatic methods [14–17] are effective, provided that objects and the background exhibit different motion characteristics. However, they may not provide accurate results for sequences with no or small object motions.

Recently, a new approach to automatic object segmentation, which uses extra information, such as depth, flash/no-flash difference, and depth-of-field (DoF), has been introduced [18–21]. Kolmogorov et al.'s algorithm [18] uses a stereo camera to estimate depth information, which is in turn used to extract foreground objects. It does not depend on the motion information between successive frames, but on the disparity information between stereo views. However, the disparity estimation is another challenging task, requiring heavy computational loads. In [19, 20], a flash is used to extract foreground objects using a single camera. After acquiring an ordinary image without flashing, it also captures an additional image lit by a flash. Then, by comparing the flash image with the no-flash one, color and intensity differences are obtained to extract objects. An alternative method is to use a matting model [21]. In an image with a shallow DoF, objects are focused while the background is not. Thus, the focused objects can be extracted automatically.

In this paper, we propose a novel algorithm to extract objects as well as humans from video sequences automatically. We extend the image segmentation techniques in [19, 20], which use a pair of flash and no-flash images, to the video segmentation case. The proposed algorithm is a tracking-based scheme using an alternate flashing system. When acquiring a video sequence, we capture even and odd frames with and without flash lights, respectively. Then, we find matching points between lit and unlit frames to construct a sensitivity map, from which the depth information can be inferred. In addition to the sensitivity map, color and temporal features are used to define an energy function, which is minimized by a graph cut algorithm to yield segmentation results. Simulation results demonstrate that the proposed algorithm provides reliable segmentation results.

The main contributions of this paper can be summarized as follows. First, we design a dedicated flashing system to capture an alternate series of lit and unlit frames. Second, we develop an efficient motion-compensated interpolation scheme, which matches lit and unlit frames to construct a sensitivity map. Third, we use the sensitivity map to accurately extract complex and deformable objects, especially humans, which are hard to segment out using conventional segmentation algorithms. Last, we implement a faster version of the proposed algorithm, which can be employed in real-time segmentation applications.

The rest of this paper is organized as follows. Section 2 describes our flashing system. Section 3 explains the features for segmentation, and Section 4 details the energy minimization scheme. Section 5 discusses implementation issues for real-time segmentation. Section 6 provides simulation results. Finally, Section 7 concludes the paper.

2. Flashing System

2.1. Video Acquisition

The proposed algorithm extracts objects based on the depth information estimated from pairs of lit and unlit frames. To capture an alternate series of lit and unlit frames, we construct a dedicated flashing system in Figure 1, which is composed of an 8 8 array of light emitting diodes (LEDs). An LED has a short response time and thus can be turned on or off quickly. When a camera starts to acquire a scene, the LEDs are turned on to light it. Then, they are turned off, before the next frame is captured. The flashing system is connected to a camera (Grasshopper [22]), which provides a trigger signal to the LEDs to synchronize with the image capturing system. We capture lit and unlit frames alternately by switching the LEDs on only when even frames are acquired. We capture lit frames with a short exposure time and unlit frames with a long exposure time as illustrated in Figure 2. Unlit frames, from which we extract objects, exhibit natural colors with an adequate exposure time. On the other hand, lit frames contain brighter objects and the darker background than unlit frames due to flash lights and a shorter exposure time. By comparing unlit frames with lit frames, we can perform high-quality segmentation of unlit frames.

Figure 1
figure 1

A dedicated flashing system, which captures lit and unlit frames alternately.

Figure 2
figure 2

The proposed algorithm acquires an alternate series of lit and unlit frames. For an unlit frame , the proposed algorithm estimates the motion vector field, and then synthesizes the corresponding lit frame using a motion-compensated interpolation scheme.

When we capture a video of human subjects, alternate flashing may annoy them. To alleviate the annoyance, we set the frame rate to 120 frames/s, which corresponds to the flashing frequency of 60 Hz. At this relatively high frequency, humans can hardly notice flickering and the lights appear to be turned on steadily.

Since the proposed algorithm can achieve accurate segmentation results, it can be employed in various applications, in which the flashing system can be installed. For example, it can be used to understand human behaviors in indoor environments [1], to substitute backgrounds in video conferencing applications [23, 24], and for mobile robots to detect obstacles [25]. It is noted that the flashing system is less effective in bright outdoor environments. Also, the current prototype of the flashing system is relatively bulky, but we expect that its size would be reduced by sophisticated packaging and it would be combined into a handheld camera system.

2.2. Matching between Lit and Unlit Frames

Using the alternate flashing system, we capture an input sequence with the frame rate of 120 frames/s. The proposed algorithm extracts the object layer from the unlit sequence with the frame rate of 60 frames/s. As shown in Figure 2, for an unlit frame at time instance , the proposed algorithm synthesizes the corresponding lit frame , and then compares with to derive the depth information.

A synthesized lit frame is interpolated from the neighboring frames , , , and , as shown in Figure 2. To employ a motion-compensated interpolation scheme to synthesize , we develop a two-step motion estimation procedure. First, we estimate the global motion from to , which represents the motion of the background. Second, we refine the local motions of objects in a bilateral manner using the information in the subsequent frames , as well as the past frames , .

We first estimate the global background motion from to . However, the motion estimation between lit and unlit frames using image intensities is unreliable, since the scene irradiance varies dramatically according to lighting conditions. Instead, assuming that the background maintains a constant velocity within a short-time interval, we estimate the global motion field from to and use half the motion field approximately as the motion field from to . The global motion from to is represented by an affine model, given by

(1)

where and denote the coordinates of the matching pixels in and , respectively. Then, the motion of the pixel in is given by . The unknown six parameters, in (1) are estimated based on the optical flow equation [26],

(2)

where and denote the spatial derivatives, and denotes the temporal derivative of the image intensity. By plugging (1) into the optical flow equation in (2), a linear system of equations for the unknown six parameters are derived and then solved using the least square method [27]. Note that an equation is set up for each in the already segmented background layer of only to avoid the effects of individual object motions in the global background motion estimation. Then, the global motion between and is approximated as the half of that between and .

After the global motion estimation, some pixels, especially object pixels, may experience different motions, which are not faithfully represented by the global motion model. Therefore, the local motion of a pixel, whose matching error is larger than a threshold after the global motion compensation, is refined using a bilateral block matching procedure. We assume that objects and the background have constant velocities within a short time interval. Thus, if the motion from to is , then the motion from to is regarded as . Similarly, the motion from to is , and the motion from to is , as shown in Figure 2. Therefore, the local motion of pixel from to is computed, using the block matching algorithm in a bilateral manner, by

(3)

where

(4)

, are intensities of lit and unlit frames at pixel , and is the 9 9 block around pixel . This bilateral motion estimation attempts to obtain coherent forward and backward motion vectors.

Finally, is synthesized as

(5)

where denotes the pixel matching error after the global motion compensation.

3. Features for Segmentation

3.1. Sensitivity

In general, as an object is closer to the flashing system, it reflects stronger light. The irradiance is inversely proportional to the squared distance between the flash and the object. Therefore, the depth information of objects and the background can be inferred from a sensitivity map [28, 29], which represents the ratio of the amounts of light reaching each pixel in the unlit frame and the lit frame. More specifically, the sensitivity map is given by

(6)

where and denote the irradiances of unlit and lit frames, and are the corresponding exposure times. The amount of light, , is obtained using the camera response function [30]. Since the camera response function is monotonically increasing, the amount of light on pixel is derived as

(7)

Figure 3 shows the normalized sensitivity map, obtained from the lit and unlit frames in Figure 2. Note that sensitivities in the background layer are bigger than 1 due to the short exposure time for the lit frame. On the other hand, sensitivities in the objects are less than 1, since the objects react to the flash light strongly.

Figure 3
figure 3

(a) Unlit frame, (b) lit frame, and (c) the sensitivity map.

After computing the sensitivity map, we smooth the sensitivity distribution using a uniform kernel and cluster the sensitivities into two Gaussian distributions using the expectation-maximization (EM) algorithm [31]. The smoothing is performed to prevent local convergence of the EM algorithm. The sensitivity distribution is then modeled by

(8)

where , and and are the mixing weights of the Gaussian mixture model. Also, denotes the Gaussian distribution with mean and variance , and represents the sensitivity distribution of background pixels. Similarly, represents the sensitivity distribution of object pixels. Therefore, and can be interpreted as the likelihoods that pixel belongs to the background layer and the object layer, respectively.

Since the sensitivity is a main feature for the classification, the exposure time for lit frames, which affects the quality of the sensitivity map, should be selected carefully. Note that the exposure time for unlit frames is set to record the natural moods and colors of scenes properly. If is identical to , the intensities of object pixels may be saturated due to the limited dynamic range of the camera, making the sensitivity map unreliable. On the other hand, if the exposure time is too short, the pixels in the lit frame may be underexposed. Therefore, in this work, we set the exposure time for lit frames by considering the tradeoff between the saturation and the underexposure problems.

Although the sensitivity is a robust feature for segmentation, there are limitations in separating objects from the background using only the sensitivity map. Since the amount of reflected light is determined not only by the distance from the camera to the object but also by the surface albedo and normals, the sensitivity map does not match the depth information perfectly. Therefore, to achieve more reliable segmentation results, we use color and temporal coherence as additional features.

3.2. Color

Colors make it easy to distinguish the layers, since they do not change dramatically between adjacent frames. After segmenting the last unlit frame , we estimate the probability density functions (pdf's) of colors in the object layer and the background layer, respectively, and use those pdf's to segment the current unlit frame . We regard the sensitivity as an additional color component, and represent colors in the space, where represents the sensitivity.

Let and , respectively, denote the set of pixels in the background layer and the object layer in the last unlit frame . Also, let denote the four-dimensional color vector of pixel . Then, using the sample sets and , nonparametric color pdf's for the background layer and the object layer, respectively, are estimated by

(9)

where is a kernel function, is the bandwidth of the kernel, is the dimension of , means the number of pixels in , and the label equals 0 for the background or 1 for the object. Note that in this work. We use the multivariate Epanechinikov kernel [32], given by

(10)

where is the volume of a unit -dimensional sphere: , and so forth. The Epanechinikov kernel is radially symmetric and uni-modal. It has an advantage that it can be calculated more quickly than the Gaussian kernel.

3.3. Temporal Coherence

Unlike image segmentation, video segmentation can exploit temporal coherence between consecutive frames. Most pixel positions in a current frame, which are already classified as the object layer in past frames, tend to belong to the object layer again. Thus, we define the temporal coherence map, which represents the likelihood of each pixel to belong to the background or object layer by counting the number of times that it is labeled as object in past frames [12]. Let be the binary label of pixel in the current frame : equals 1 if is an object pixel, and 0 otherwise. Then, the temporal coherence for each label is defined as

(11)

where denotes the number of times that pixel is classified as the object layer in the last unlit frames , and is the motion vector of pixel in the current frame , which is estimated using the method in Section 2. is higher, if is assigned the same label as its predecessors.

After the segmentation of the current frame, the accumulated count is updated by

(12)

4. Segmentation by Energy Minimization

We assign a label, , to each pixel in the current frame by minimizing an energy function. Let denote the label image composed of the pixel labels. The energy function is composed of the sensitivity, color, temporal coherence, and smoothness terms, which impose constraints on the pixel labels.

First, the sensitivity term is defined as

(13)

This sensitivity term indicates that, if pixel is labeled or classified as , its sensitivity probability in that class should be higher than that in the other class. Note that is the Gaussian mixture model of the overall sensitivity distribution in (8), which can be regarded as the reliability of the sensitivity . By incorporating as a weight in the summation in (13), pixels with more reliable sensitivity values play more important roles in the energy minimization.

Second, the color term is similarly defined as

(14)

which constrains that each pixel should be assigned the label with the higher color probability.

Third, the temporal coherence term is defined as

(15)

The temporal coherence term attempts to reduce outliers by giving a penalty to a pixel, which is assigned a different label from its temporal predecessors.

Finally, the smoothness term enforces the constraint that neighboring pixels of similar intensities should be assigned the same label [33], which is defined as

(16)

where is the set of pairs of 8-adjacent pixels in , is the vector of pixel , and is the standard deviation of over all pairs in .

The overall energy function is then defined as a weighted sum of the four terms, given by

(17)

The energy minimization is carried out through the graph cut algorithm in [34], which is an effective energy minimization method. The min-cut of a weighted graph provides the segmentation that best separates objects from the background.

5. Real-Time Segmentation

The proposed algorithm, described in Sections 2–4, achieves high quality segmentation results, but its computational complexity is relatively high. In this section, we develop a faster version of the proposed algorithm, which reduces the computational complexity significantly at the cost of slight performance degradation. The faster version can be used in real-time applications.

The proposed real-time segmentation algorithm also employs sensitivity, color, and coherence features. However, the feature computations are simplified as follows.

5.1. Simplified Motion Estimation

To compute the sensitivity map for an unlit frame , we synthesize the corresponding lit frame based on the motion-compensated interpolation. Since the motion estimation demands high complexity, it is simplified in the following way. While the global motion estimation and the local motion refinement are performed in Section 2.2, the real-time algorithm carries out the global motion estimation only. Furthermore, it uses the translation model instead of the affine model in (1). The two parameters for the translational motion are also estimated using the optical flow equation in (2) and the least square method. Pixels near object boundaries tend to have high matching errors after the global motion compensation. Thus, we mark those pixels as void and use only the non-void pixels, whose matching errors are less than a threshold, to synthesize and compute the sensitivities. The sensitivity distribution in (8) is also obtained and modeled using the EM algorithm, excluding the void pixels.

5.2. Block-Based Color Model

Instead of the four-dimensional color vector in Section 3.2, the real-time algorithm uses the three-dimensional color vector .

Moreover, the color pdf's are estimated at the block level, rather than at the frame level as done in (9). Specifically, we divide each frame into blocks of size 16 16. Let be a block. Then, we model the color pdf's for the background and the object layers in by

(18)

where is the vector of pixel . This equation is the same as (9), except for the reduction of the sample space from the frame to the block and the reduction of the color dimension. To reduce the complexity, a uniform kernel with a narrow bandwidth is employed in (18) and the color distributions for each block are saved as lookup tables.

The block-based color model is spatially adaptive, since it estimates the color pdf's for each block separately. However, when an object moves fast from one block to another, the block-based color model may fail to represent the sudden changes in color distributions correctly, leading to classification errors. To address this problem, we employ the notion of macroblock. Specifically, suppose that we need to use the color model for a pixel. Let denote the block containing the pixel. Instead of using the color model for , we use the average of the color models of nine blocks, consisting of and its eight neighbors. More specifically, the color pdf's of the macroblock, composed of the nine blocks, are estimated by

(19)

where is the expanded macroblock, and , , denotes one of the nine blocks. Figure 4 shows two frames, where a green square depicts an expanded macroblock and a red square depicts a block for maintaining the probability look-up tables. By expanding the block size, the color pdf's can be estimated more reliably, even when objects experience fast motions.

Figure 4
figure 4

A red square depicts a block, whereas a green square depicts a macroblock. The pdf's for the background and the object layers are estimated for each block separately, but the pdf's for the nine blocks, which compose a macroblock, are averaged for the color modeling.

5.3. Coherence Strip

The real-time algorithm exploits the property that an object generally does not change its positions abruptly between consecutive frames. Specifically, given the object contour in the previous frame, we construct a coherence strip [13], in which the object contour in the current frame is likely to be located. The notion of coherence strip helps to extract a spatio-temporally coherent video object as well as to reduce the computational complexity.

Figure 5(a) is a segmented object, and Figure 5(b) shows a coherence strip along the object contour. If the object does not move abruptly, the object contour in the next frame tends to be located within the coherence strip. We construct the spatio-temporal coherence map that represents the likelihood that each pixel belongs to the object class.  Specifically, the spatio-temporal coherence of a pixel within the coherence strip can be computed by

(20)

where is the width of the strip, is the minimum distance of from the object contour in frame , and . Also,

(21)

Notice that becomes lower, as pixel moves away from the contour to the outgoing direction. Figure 5(c) shows the spatio-temporal coherence map, where brighter pixels belong to the object layer with higher probabilities.

Figure 5
figure 5

(a) Segmentation result, (b) coherence strip, and (c) spatio-temporal coherence map.

The pixels inside the coherence strip are classified as object, while the pixels outside the strip as background. Then, the segmentation is performed only on the pixels within the coherence strip, reducing the computational complexity. The width of the strip should be determined carefully. If the width is too wide, many pixels should be classified, taking a longer processing time. On the other hand, if the width is too narrow, the object contour may move outside the coherence strip, when the object has fast or abrupt motion.Thus, we determine the width based on the motion vectorof the object.Similar to the global motion estimation, the motion vector of the object, , is estimated using the translational motion model. Finally, we set to be determined by the magnitude of via

(22)

where is set to 10 for all experiments in this work. In addition, we translate the coherence strip in (20) by the object motion vector [13]. The shifted spatio-temporal coherence is more accurate than (20) especially for an object with fast motion.

5.4. Energy Minimization

An energy function is defined over the pixels within the coherence strip, and then minimized using the graph cut algorithm as done in Section 4. By applying the energy minimization to the strip only, instead of the whole frame, the computational complexity is reduced significantly. The sensitivity term is the same as (13), except that the likelihood is set to 0 for both classes and 1, if pixel is void and its sensitivity is not estimated. The color term is defined as

(23)

where denotes the expanded macroblock for pixel . The spatio-temporal coherence term is defined as

(24)

using the prior probabilities in (20) and (21). The smoothness term is the same as (16), except that the 8-neighborhood is replaced by the 4-neighborhood. Then, the energy is defined as the weighted sum of the four terms

(25)

6. Experimental Results

The proposed video object segmentation algorithm is implemented in the C++ language on a personal computer with Pentium-IV 3.0 GHz CPU and 2 Gbyte memory. Two versions of the proposed algorithm are implemented: the proposed algorithm I denotes the algorithm described in Sections 2–4, whereas the proposed algorithm II denotes the faster algorithm in Section 5 for real-time applications.

We use several test sequences of CIF size (), captured using the alternate flashing system. As mentioned in Section 2, the sequences are captured with the frame rate of 120 frames/s, and the segmentation is performed only on the unlit frames with the frame rate of 60 frames/s. For the proposed algorithm I, the bandwidth of the kernel in (9) is set to 2, in (11) is set to 8, and the weights in (17) are fixed to 0.3, 0.32, 0.012, respectively. For the proposed algorithm II, in (18) is 1, and in (25) are 0.2, 0.12, 0.04.

Figures 6, 7, 8, and 9 show the segmentation results on some test sequences, obtained by the proposed algorithm I. In the "Dolls" sequence in Figure 6 and "Throwing Doll" sequence in Figure 7, some parts of the objects have the same colors as the backgrounds. Moreover, there are quick motions in the "Throwing Doll" sequence. However, since the proposed algorithm I estimates the sensitivity maps reliably using the pairs of lit and unlit frames, it segments the objects correctly. The "Moving Object and Background" sequence in Figure 8 contains motions in both background and object layers, but the proposed algorithm I still provides faithful segmentation results. The "Woman in Red" sequence in Figure 9 is the most challenging sequence, since the same color is found across the object contours. The red shirt and the red wall are difficult to separate even with user interactions. However, we see that the proposed algorithm automatically extracts the person from the background and the resulting contours are very accurate.

Figure 6
figure 6

Segmentation of the "Dolls" sequence by the proposed algorithm I. The leftmost column shows a pair of unlit and lit frames, while the other columns show a series of unlit frames and their segmentation results.

Figure 7
figure 7

Segmentation of the "Throwing Doll" sequence by the proposed algorithm I. The leftmost column shows a pair of unlit and lit frames, while the other columns show a series of unlit frames and their segmentation results.

Figure 8
figure 8

Segmentation of the "Moving Object and Background" sequence by the proposed algorithm I. The leftmost column shows a pair of unlit and lit frames, while the other columns show a series of unlit frames and their segmentation results.

Figure 9
figure 9

Segmentation of the "Woman in Red" sequence by the proposed algorithm I. The leftmost column shows a pair of unlit and lit frames, while the other columns show a series of unlit frames and their segmentation results.

Next, we examine how the three features, that is, the sensitivity, color, and temporal coherence terms in the energy function in (17), affect the segmentation results of the proposed algorithm I. Figure 10 compares the error rates of the segmentation results on every 5th frame in the sequences in Figures 6, 7, 8, and 9. To compute the error rates, every 5th frame is segmented using the GrabCut algorithm with human interactions [35], and the result is regarded as the ground truth data. Three cases are considered: First, only the sensitivity term among the three features is minimized. Second, the sensitivity term and the color term are minimized together. Third, all three terms are considered. In all these cases, the smoothness term is included in the energy minimization. When only the sensitivity is considered, the average error rate over all frames and all sequences is about 0.75%. The average error rate is reduced to 0.51% if the color term is combined with the sensitivity term. It is further reduced to 0.50% if all three terms, including the temporal coherence term, are considered. The reduction of the average error rate due to the temporal coherence term appears minor. However, the temporal coherence term contributes to more reliable segmentation, as illustrated in Figures 11 and 12. When only the sensitivity term is employed, some pixels are misclassified as indicated by red ellipses in Figures 11 and 12. Those misclassified pixels are corrected, when the color term and the sensitivity term are included in the energy minimization. In these examples, the background pixels, which are misclassified as object pixels, are corrected by adding the color term, and the object pixels on the dark hairs are correctly classified only after the temporal coherence term is included. These results indicate that the proposed algorithm I achieves reliable segmentation by combining the three features efficiently.

Figure 10
figure 10

The error rates of the proposed algorithm I on the (a) "Dolls," (b) "Throwing Doll," (c) "Moving Object and Background," and (d) "Woman in Red" sequences. Each curve corresponds to a different combination of features. In every curve, the smoothness term is included in the energy minimization.

Figure 11
figure 11

Segmentation results of the proposed algorithm I on the "Moving Object and Background" sequence: (a) sensitivity term only, (b) sensitivity and color terms, and (c) sensitivity, color, and temporal coherence terms.

Figure 12
figure 12

Segmentation results of the proposed algorithm I on the "Woman in Red" sequence: (a) sensitivity term only, (b) sensitivity and color terms, and (c) sensitivity, color, and temporal coherence terms.

Figure 13 compares the error rates of the proposed algorithm I with those of the Criminisi et al.'s algorithm [11] and the proposed algorithm II. The Criminisi et al.'s algorithm is one of the state-of-the-art segmentation methods. It constructs a model, composed of spatial prior, temporal prior, motion, contrast, and color cues, which are trained from the ground truth data for every 10th frame. Notice that the Criminisi et al.'s algorithm does not use the information in lit frames, but it implicitly requires human interactions since it uses ground truth data periodically. It does not provide reliable segmentation results, especially when an object and the background have the same color across their boundary. From Figure 13, we see that the proposed algorithm I consistently outperforms the Criminisi et al.'s algorithm. We also observe that the proposed algorithm II provides comparable or lower error rates than the Criminisi et al.'s algorithm. However, it yields worse performance than the proposed algorithm I, since it approximates several procedures to reduce the computational complexity.

Figure 13
figure 13

Comparison of the error rates on the (a) "Dolls," (b) "Throwing Doll," (c) "Moving Object and Background,'' and (d) "Woman in Red" sequences.

Since the proposed algorithm provides accurate segmentation results, it can be employed in various applications. As a simple example, we substitute the background of a test sequence in Figure 14. The background substitution can be used to protect privacy in mobile video communications, when a user does not want to reveal the background.

Figure 14
figure 14

Background substitution. The 1st and 3rd column shows original frames, and the 2nd and 4th columns show the synthesized frames with the replaced background. The proposed algorithm I is used for the segmentation.

Figures 15 and 16 show the segmentation results of the proposed algorithm II. These sequences are captured with the frame rate of 30 frames/s, and only the unlit frames with the frame rate of 15 frames/s are segmented. In the "Tiger" sequence in Figure 15, some fingers and the background near the hand are not correctly classified, since they move very fast beyond the coherence strip. However, those pixels are finally classified correctly in the subsequent frames. The "Man in Blue" sequence in Figure 16 is more accurately segmented. In our PC implementation, the processing speed of the proposed algorithm II is about 15.5 frames/s. Thus, unlit frames can be segmented in real time, while they are captured. Figure 17 compares the error rates on the sequences in Figures 15 and 16. It shows similar tendency to Figure 13.

Figure 15
figure 15

Segmentation of the "Tiger" sequence by the proposed algorithm II. The leftmost column shows a pair of unlit and lit frames, while the other columns show a series of unlit frames and their segmentation results.

Figure 16
figure 16

Segmentation of the "Man in Blue" sequence by the proposed algorithm II. The leftmost column shows a pair of unlit and lit frames, while the other columns show a series of unlit frames and their segmentation results.

Figure 17
figure 17

Comparison of the error rates on the (a) "Tiger" and (b) "Man in blue" sequences.

7. Conclusions

In this paper, we proposed an automatic video segmentation algorithm, which can provide high quality results using the alternate flashing system. By comparing unlit frames with lit ones, the proposed algorithm obtains the sensitivity map indicating depth information. The proposed algorithm also obtains the color pdf's for the object and the background layers, and constructs coherence likelihoods. By minimizing the energy function, composed of the sensitivity, color, coherence, and smoothness terms, the proposed algorithm obtains accurate segmentation results. Moreover, we developed a faster version of the proposed algorithm, which reduces the computational complexity significantly to achieve real-time segmentation. Experimental results on various test sequences demonstrated that the proposed algorithm provides reliable and accurate segmentation results.

References

  1. Moeslund TB, Hilton A, Krüger V: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 2006, 104(2-3):90-126. 10.1016/j.cviu.2006.08.002

    Article  Google Scholar 

  2. Plänkers R, Fua P: Articulated soft objects for multiview shape and motion capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(9):1182-1187. 10.1109/TPAMI.2003.1227995

    Article  Google Scholar 

  3. Carranza J, Theobalt C, Magnor MA, Seidel H-P: Free-viewpoint video of human actors. ACM Transactions on Graphics 2003, 22(3):569-577. 10.1145/882262.882309

    Article  Google Scholar 

  4. Agarwal A, Triggs B: 3D human pose from silhouettes by relevance vector regression. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), 2004 882-888.

    Google Scholar 

  5. Sminchisescu C, Kanaujia A, Li Z, Metaxas D: Discriminative density propagation for 3D human motion estimation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005 20-25.

    Google Scholar 

  6. Cui Y, Weng J: Appearance-based hand sign recognition from intensity image sequences. Computer Vision and Image Understanding 2000, 78(2):157-176. 10.1006/cviu.2000.0837

    Article  Google Scholar 

  7. Song P, Yu H, Winkler S: Vision-based 3D finger interactions for mixed reality games with physics simulation. International Journal of Virtual Reality 2009, 8(2):1-6.

    Google Scholar 

  8. Wang L, Tan T, Ning H, Hu W: Silhouette analysis-based gait recognition for human identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003, 25(12):1505-1518. 10.1109/TPAMI.2003.1251144

    Article  Google Scholar 

  9. Kale A, Sundaresan A, Rajagopalan AN, et al.: Identification of humans using gait. IEEE Transactions on Image Processing 2004, 13(9):1163-1173. 10.1109/TIP.2004.832865

    Article  Google Scholar 

  10. Sun J, Zhang W, Tang X, Shum H: Background cut. Proceedings of the European Conference on Computer Vision, 2006 628-641.

    Google Scholar 

  11. Criminisi A, Cross G, Blake A, Kolmogorov V: Bilayer segmentation of live video. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), 2006 53-60.

    Google Scholar 

  12. Liu Z, Shen L, Han Z, Zhang Z: A novel video object tracking approach based on kernel density estimation and Markov random field. Proceedings of the 14th IEEE International Conference on Image Processing (ICIP '07), 2007 373-376.

    Google Scholar 

  13. Ahn J-K, Kim C-S: Real-time segmentation of objects from video sequences with non-stationary backgrounds using spatio-temporal coherence. Proceedings of the International Conference on Image Processing (ICIP '08), October 2008, San Diego, Calif, USA 1544-1547.

    Google Scholar 

  14. Chien S-Y, Ma S-Y, Chen L-G: Efficient moving object segmentation algorithm using background registration technique. IEEE Transactions on Circuits and Systems for Video Technology 2002, 12(7):577-586. 10.1109/TCSVT.2002.800516

    Article  Google Scholar 

  15. Tsaig Y, Averbuch A: Automatic segmentation of moving objects in video sequences: a region labeling approach. IEEE Transactions on Circuits and Systems for Video Technology 2002, 12(7):597-612. 10.1109/TCSVT.2002.800513

    Article  Google Scholar 

  16. Yin P, Criminisi A, Winn J, Essa I: Tree-based classifiers for bilayer video segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'07), 2007

    Google Scholar 

  17. Zhang G, Jia J, Xiong W, Wong T-T, Heng P-A, Bao H: Moving object extraction with a hand-held camera. Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV '07), 2007

    Google Scholar 

  18. Kolmogorov V, Criminisi A, Blake A, Cross G, Rother C: Bi-layer segmentation of binocular stereo video. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005 407-414.

    Google Scholar 

  19. Sun J, Kang SB, Xu Z-B, Tang X, Shum H-Y: Flash cut: foreground extraction with flash and no-flash image pairs. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'07), 2007

    Google Scholar 

  20. Sun J, Li Y, Kang SB, Shum H-Y: Flash matting. ACM Transactions on Graphics 2006, 25(1):772-778.

    Article  Google Scholar 

  21. Li H, Ngan KN: Unsupervised video segmentation with low depth of field. IEEE Transactions on Circuits and Systems for Video Technology 2007, 17(12):1742-1751.

    Article  Google Scholar 

  22. Point Greay Research : Triclops on-line manual. http://www.ptgrey.com/

  23. Baker H, Bhatti N, Tanguay D, et al.: Understanding performance in coliseum an immersive videoconferencing system. ACM Transactions on Multimedia Computing, Communications, and Applications 2005, 1(2):190-210. 10.1145/1062253.1062258

    Article  Google Scholar 

  24. Gharai L, Perkins C, Riley R, Mankin A: Large scale video conferencing: a digital amphitheater. Proceedings of the 8th International Conference on Distributed Multimedia Systems, 2002

    Google Scholar 

  25. Soumare S, Ohya A, Yuta S: Real-time obstacle avoidance by an autonomous mobile robot using an active vision sensor and a vertically emitted laser slit. In Intelligent Autonomous Systems 7. IOS Press; 2002.

    Google Scholar 

  26. Horn BKP, Schunck BG: Determining optical flow. Artificial Intelligence 1981, 17(1–3):185-203.

    Article  Google Scholar 

  27. Smolić A, Sikora T, Ohm J-R: Long-term global motion estimation and its application for sprite coding, content description, and segmentation. IEEE Transactions on Circuits and Systems for Video Technology 1999, 9(8):1227-1242. 10.1109/76.809158

    Article  Google Scholar 

  28. Raskar R, Tan K-H, Feris R, Yu J, Turk M: Non-photorealistic camera: depth edge detection and stylized rendering using multi-flash imaging. ACM Transactions on Graphics 2004, 23(3):679-688. 10.1145/1015706.1015779

    Article  Google Scholar 

  29. Agrawal A, Raskar R, Nayar SK, Li Y: Removing photography artifacts using gradient projection and flash-exposure sampling. ACM Transactions on Graphics 2005, 24(3):828-835. 10.1145/1073204.1073269

    Article  Google Scholar 

  30. Debevec PE, Malik J: Recovering high dynamic range radiance maps from photographs. Proceedings of the ACM Conference on Computer Graphics (SIGGRAPH '97), 1997 369-378.

    Google Scholar 

  31. Moon TK: The expectation-maximization algorithm. IEEE Signal Processing Magazine 1996, 13(6):47-60. 10.1109/79.543975

    Article  Google Scholar 

  32. Silverman BW: Density Estimation for Statistics and Data Analysis. Champman and Hall, London, UK; 1986.

    Book  MATH  Google Scholar 

  33. Boykov YY, Jolly M-P: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. Proceedings of the 8th International Conference on Computer Vision, 2001 105-112.

    Google Scholar 

  34. Boykov Y, Kolmogorov V: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004, 26(9):1124-1137. 10.1109/TPAMI.2004.60

    Article  MATH  Google Scholar 

  35. Rother C, Kolmogorov V, Blake A: "GrabCut"—interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics 2004, 23(3):309-314. 10.1145/1015706.1015720

    Article  Google Scholar 

Download references

Acknowledgments

This paper was supported partly by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology (2009-0083495), and partly by Seoul R & BD Program (no. ST090818).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chang-Su Kim (EURASIP Member).

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ahn, JK., Lee, DY., Lee (EURASIP Member), C. et al. Automatic Moving Object Segmentation from Video Sequences Using Alternate Flashing System. EURASIP J. Adv. Signal Process. 2010, 340717 (2010). https://doi.org/10.1155/2010/340717

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/2010/340717

Keywords