Denoising algorithm for the 3D depth map sequences based on multihypothesis motion estimation
© Jovanov et al; licensee Springer. 2011
Received: 5 June 2011
Accepted: 12 December 2011
Published: 12 December 2011
Skip to main content
© Jovanov et al; licensee Springer. 2011
Received: 5 June 2011
Accepted: 12 December 2011
Published: 12 December 2011
This article proposes an efficient wavelet-based depth video denoising approach based on a multihypothesis motion estimation aimed specifically at time-of-flight depth cameras. We first propose a novel bidirectional block matching search strategy, which uses information from the luminance as well as from the depth video sequence. Next, we present a new denoising technique based on weighted averaging and wavelet thresholding. Here we take into account the reliability of the estimated motion and the spatial variability of the noise standard deviation in both imaging modalities. The results demonstrate significantly improved performance over recently proposed depth sequence denoising methods and over state-of-the-art general video denoising methods applied to depth video sequences.
The impressive quality of user perception of multimedia content has become an important factor in the electronic entertainment industry. One of the hot topics in this area is 3D film and television. The future success of 3D TV crucially depends on practical techniques for the high-quality capturing of 3D content. Time-of-flight sensors [1–3] are a promising technology for this purpose.
Depth images also have other important applications in the assembly and inspection of industrial products, autonomous robots interacting with humans and real objects, intelligent transportation systems, biometric authentication and in biomedical imaging, where they have an important role in compensating for unwanted motion of patients during imaging. These applications require even better accuracy of depth imaging than in the case of 3D TV, since the successful operation of various classification or motion analysis algorithms depends on the quality of input depth features.
One advantage of TOF depth sensors is that their successful operation is less dependent on a scene content than for other depth acquisition methods, such as disparity estimation and structure from motion. Another advantage is that TOF sensors directly output depth measurements, whereas other techniques may estimate depth indirectly, using intensive and error-prone computations. TOF depth sensors can achieve real-time operation at quite high frame rates, e.g. 60 fps.
The main problems with the current TOF cameras are low resolution and rather high noise levels. These issues are related to the way the TOF sensors work. Most TOF sensors acquire depth information by emitting continuous-wave (CW) modulated infra-red light and measuring the phase difference between the sent (reference) and received light signals. Since the modulation frequency of the emitted light is known, the measured phase directly corresponds to the time of flight, i.e., the distance to the camera.
However, TOF sensors suffer from some drawbacks that are inherent to phase measurement techniques. The first group of depth image quality enhancement methods aims at correction of systematic errors of TOF sensors and correcting distortions due to non-ideal optical system, as in [4–7]. In this article, we address the most important problem related to TOF sensors, which limits the precision of depth measurements: signal dependent noise. As shown in [1, 8], noise variance in TOF depth sensors, among other factors, depends on the intensity of the emitted light, the reflectivity of the scene and the distance of the object in the scene.
A large number of methods have been proposed for spatio-temporal noise reduction in TOF images and similar imaging modalities, based on other 3D scanning techniques. Techniques based on non-local denoising [9, 10] were applied to sequences acquired using the structured light methods. For a given spatial neighbourhood, they find the most similar spatio-temporal neighbourhoods in other parts of the sequence (e.g., earlier frames) and then compute a weighted average of these neighbourhoods, thus achieving noise reduction. Other non-local techniques, specifically aimed at TOF cameras have been proposed in [8, 11, 12]. These techniques use luminance images as a guidance for non-local and cross-bilateral filtering. The authors of [12–14] present a non-local technique for simultaneous denoising and up-sampling of depth images.
In this article, we propose a new method for denoising depth image sequences, taking into account information from the associated luminance sequences. The first novelty is in our motion estimation, which takes into account information from both imaging modalities and accounts for spatially varying noise standard deviation. Moreover, we define reliability to this estimated motion and we adapt the strength of temporal denoising according to the motion estimation reliability. In particular, we use motion reliabilities derived from both depth and luminance as weighting factors for motion compensated temporal filtering.
The use of luminance images brings us multiple benefits. First, the goal of existing non-local techniques is to find other similar observations in other parts of the depth sequence. However, in this article, we look for observations both similar in depth and luminance. The underlying idea here is to average multiple observations of the same object segments. As luminance images have many more textural features than depth images, the located matches can be better in quality, which improves the denoising. Moreover, the luminance image is less noisy, which facilitates the search for similar blocks. We have confirmed this experimentally by calculating peak signal-to-noise ratio (PSNR) of depth and luminance measurements, using ground truth images obtained by temporal averaging of the 200 static frames. Typically, depth images acquired by SwissRanger camera have PSNR values of about 34-37 dB, while PSNR values of luminance are about 54-56 dB. Theoretical models from  also confirm that noise variance in depth is larger than noise variance in luminance images.
The article is organized as follows: In Section 2, we describe the noise properties of TOF sensors and a method for generating the ground truth sequences, used in our experiments. In Section 3, we describe the proposed method. In Section 4, we compare the proposed method experimentally to various reference methods in terms of visual and numerical quality. Finally, Section 5 concludes the article.
TOF cameras illuminate the scene by infra red light emitting diodes. The optical power of this modulated light source has to be chosen based on a compromise between image quality and eye safety; the larger the optical power, the more photoelectrons per pixel will be generated, and hence the higher the signal-to-noise ratio and therefore the accuracy of the range measurements. On the other hand, the power has to be limited to meet safety requirements. Due to the limited optical power, TOF depth images are rather noisy and therefore relatively inaccurate. Equally important is the influence of the different reflectivity of objects in the scene, which reduce the reflected optical power and increase the level of noise in the depth image. Interferences can also be caused by external sources of light and multiple reflections from different surfaces.
where A and B are the amplitude of the reflected signal and its offset, L the measured distance and ΔL the uncertainty on the depth measurement due to noise. As the equation shows, the noise variance, and therefore the depth accuracy ΔL is inversely proportional to the demodulation amplitude A.
The signal-to-noise ratio of static parts of the scene (w.r.t. the camera) can be significantly improved through temporal filtering. If n successive frames are averaged, the noise variance will be reduced by a factor n. While this is of limited use in dynamic scenes, we exploit this principle to generate an approximately noise free reference depth sequence of a static scene captured by a moving camera.
Each frame in the noise-free sequence is created as follows: the camera is kept static and 200 frames of the static scene are captured and temporally averaged. Then, the camera is moved slightly and the procedure is repeated, resulting in the second frame of the reference depth sequence. The result is an almost noise free sequence, simulating a static scene captured by a moving camera. This way we simulate translational motion of the camera. If the reference "noise-free" depth sequence contains k frames, k × 200 frames should be recorded.
The motion estimation step is followed by the wavelet decomposition step and by motion compensated filtering, which is performed in the wavelet domain, using a variable number of motion hypotheses (depending on their reliability) and data dependent weighted averaging. The weights used for temporal filtering are derived from the motion estimation reliabilities and from the noise standard deviation estimate. The remaining noise is removed using the spatial filter from , which operates in wavelet domain and uses luminance to restore lost details in the corresponding depth image.
The most successful video denoising methods use both temporal and spatial correlation of pixel intensities to suppress noise. Some of these methods are based on finding a number of good predictions for the currently denoised pixel in previous frames. Once found, these temporal predictions, termed motion-compensated hypotheses are averaged with the current, noisy pixel itself to suppress noise.
Our proposed method exploits the temporal redundancy in depth video sequences. It also takes into account that a similar context is more easily located in the luminance than in the depth image.
Each frame F(t) in both the depth and the luminance is divided into 8 × 8 non-overlapping blocks. For each block in the frame F(t), we perform a three-step search algorithm from  within some support region V t -1.
where dt ≤ N f . In other words, for each block B i in the frame F(t) we search for the blocks in the frames F(t - N f ),..., F(t - 1), F(t + 1),..., F(t + N f ) which maximize the similarity measure between blocks.
where V is the set of all possible motion vectors, excluding vectors that are previously found as best candidates.
where F(t) and F(t - dt) are the frames containing depth and luminance values for each pixel and v(t) is the motion vector between the frames F(t) and F(t - dt). The conditional probability that models how well the image F(t) can be described by the motion vector v(t) and the image F(t - dt) is denoted by P(F(t)|v(t), F(t - dt)). The prior probability of the motion vector v(t) is denoted by P(v(t)|F(t - dt)). We replace the probability P(F(t)|F(t - dt)) by a constant since it is not a function of the motion vector v(t) and therefore does not affect the maximization process over v.
where and are the variances of depth and luminance blocks and and are noise variances in the depth and the luminance images, respectively, l is the vector containing spatial coordinates of the current block, v is the motion vector of the current block, and F L and F D denote the luminance and the depth components of F. Variances of the displaced pixel differences contain two components: one due to the random noise and the other due to the motion compensation error. The variance due to the additive noise is derived from the locally estimated noise standard deviation in the depth image and from the global estimate of the noise standard deviation in the luminance image. The use of the variance as a reliability measures for motion estimation in noise-free sequences was studied in [22, 24].
Therefore, each of the motion hypotheses for the block in the central frame is assigned a reliability measure, which depends on the compensation error and the similarity of the current motion hypothesis to the best motion vectors from its spatial neighbourhood. The reason we introduce these penalties is that the motion compensation error grows with the temporal distance and the amount of texture in the sequence. From the previous equations, it can be concluded that the current motion vector candidate v is not reliable if it is significantly different from all motion vectors in its neighbourhood. Motion compensation errors of motion vectors in uniform areas are usually close to the motion compensation error of the best motion vector in the neighbourhood. However, in the occluded areas, estimated motion vectors have values which are inconsistent with the best motion vectors in their neighbourhood. Therefore, the motion vectors in the occluded areas usually have low a posteriori probabilities and thus low reliabilities.
In this section, we describe a new approach for temporal filtering along the estimated motion trajectories. The strength of the temporal filtering depends on the reliability of estimated motion.
where is the temporally filtered version of the depth wavelet band at the location k of the frame that is in the middle of the temporal buffer. Furthermore, s D (h, t) is the depth wavelet coefficient from the frame F(t) at the location h.
The amount of filtering is controlled through the weighting factors α(t, h), which depend on reliability of the motion estimation defined in Equation 12. Weighting factors derived from conditional probabilities are also used in  for motion-compensated de-interlacing and in  for distributed video coding purposes. In the ideal case, motion estimation would be performed per wavelet band and reliabilities derived accordingly. Here we use same motion vectors for all wavelet bands, and calculate the reliability for each wavelet band separately, which can be justified by the fact that motion is unique.
where s(t) denotes the block of wavelet coefficients in the frame t, s(h, t - dt) denotes the motion hypothesis h in the frame t - dt and denotes the set of the motion hypothesis for the current block. P(v|s(t), s(h, t - dt)) has the form given in Equation 12.
We estimate the noise level by assuming that the noise variance at the location k is related to the inverse of the signal amplitude as σ k = c n /A.
An important novelty is that we introduce a variable number of temporal candidate blocks used for denoising the block in the frame F t variable. Using all the blocks within the support region of the size w s , V t , t = T -ws/2,..., T + ws/2 for weighted averaging may cause some disturbing artefacts, especially in the case of occlusions and scene changes. In these cases, it is not possible to find blocks similar enough to the currently denoised block, which may cause over-smoothing or motion blur of details in the image. To prevent this, we only take into account the blocks whose average differences with the currently denoised block are smaller than some predetermined threshold Dmax.
We relate this maximum distance to the local estimate of the noise at the current location in the depth sequence and the motion reliability. The noise standard deviation in the luminance image is constant for the whole image. Moreover, it is much smaller than the noise standard deviation in the depth image. We found experimentally that a good choice for the maximum difference is . By introducing the local noise standard deviation into threshold Dmax, we are taking into account the fact that even if we find a perfect match of the current block within the previous frame F(t - 1), it will differ from the current block in the frame F(t), due to the noise.
The proposed temporal filtering is also applied on the low pass band of the wavelet decomposition of both sequences, but in a slightly different manner. In the case of the low pass wavelet band, we set the smoothing parameter to the local variance of noise at location l. The value of the smoothing parameter for the low pass wavelet band is less than for high pass wavelet bands, since the low pass band already contains much less noise due to the spatial low pass filtering. In this way, we address the appearance of low-frequency artefacts present in the regions of the sequence that contain less texture.
The amount of noise is significantly reduced after the proposed temporal filter. To suppress the remaining noise, we use our earlier method for the denoising of static depth images .
In this subsection, we analyse the computational complexity of the proposed algorithm. Motion estimation algorithm is performed over 7 depth and luminance frames, in a 24 × 24 pixels search window, on 8 × 8 pixel blocks. The main difference compared to classical gray-scale motion estimation algorithms is that the proposed algorithm calculates similarity metrics in both depth and luminance images, which doubles the number of arithmetical operations. In total, arithmetical operations are needed during the motion estimation step, where N c = 2 is the number of the best motion candidates N f = 7 is the number of frames, t is a time instant, N s = 24 size of the search window, N b is the size of the motion estimation block and Nblocks is the number of blocks in the frame. Then, we perform the wavelet transform and motion compensated temporal filtering in the wavelet domain. This step requires arithmetical operations in total to calculate filtering weights and additions to perform filtering, where N t is a total number of candidates which participate in filtering.
Finally, spatial filtering step requires (4 + (2K + 1)2)L additions, 6L subtractions, 3L divisions and 4L multiplications per image, locations, where K is the window size and L is the number of image pixels.
Compared to the method of , the number of operations performed in a search step is approximately the same, since we calculate similarity measures using two imaging modalities and choose a set of best candidate blocks, while in  search is performed twice, using only depth information, first time on noisy depth pixels and second time on hard-thresholded depth estimates. Similarly, the proposed motion compensated filtering does not add much overhead, since filtering weights are calculated during the motion estimation step. In total, number of the operations performed by the proposed algorithm and the method from  is comparable.
The processing time for the proposed technique was approximately 0.23 s per frame and 0.2 s per frame for  on a system based on Intel Core i3, 2.14 GHz processor with 4 GB RAM. We have implemented the search operation as a Matlab mex-file, while filtering was implemented as a Matlab script. The method of  was implemented as a Matlab mex file.
For the evaluation of the proposed method, we use both real sequences acquired using the Swiss Ranger SR3100 camera  and "noise-free" depth sequences acquired using ZCam 3D camera  with artificially added noise that approximates the characteristics of the TOF sensor noise.
We evaluate the proposed algorithm on two sequences with artificially added noise, namely "Interview" and "Orbit", and three sequences acquired using a Swiss Ranger SR3100 TOF camera. In the proposed approach, we use two levels of the non-decimated wavelet decomposition and Daubechies db4 wavelet.
Average PSNR values of the proposed and the reference algorithms in dB
On the other hand, the proposed method preserves details more effectively (see the details of the face in "Interview" sequence). Furthermore, the surface of the table is much better denoised and closer to the noise free frame than in the case of the reference methods. Similarly, the mask and the objects behind in "Orbit" are much better preserved, while the noise is uniformly removed. The boundaries of the object are also preserved rather well, and do not contain the blocking artefacts as in the case of block-wise non-local temporal denoising. In the other scenario, we set the value of the input noise variance for [10, 27] to the maximum local value of the estimated noise variance. Noise is now thoroughly removed. However, the sharp transitions in the depth image are severely degraded.
As in the previous cases, we compare the proposed method with the method of  for video sequences and with the method of  for denoising point clouds generated using structured light approach. The comparison is performed using objective measures and visually. The PSNR values of the different methods are shown in Figure 10. A visual comparison of the proposed methods is shown in Figure 13. The methods used for comparison [10, 27] take a noise standard deviation as an input parameter. To provide these algorithms with the noise variance estimate, we used the median of residuals noise estimator from . We can see from Figure 10 that the proposed method performs better than methods of [10, 27] in all frames of the sequence. This is clearly visible in Figure 13, especially at the borders of the images, where other methods fail to remove the noise of higher intensity, while the proposed method removes noise in these regions quite successfully. Moreover, the edges of the books on the shelf, small surfaces like chairs and circular object in the shelf are better preserved than when denoised with the reference methods.
In this article, we have presented a method for removing spatially variable and signal dependent noise in depth images acquired using a depth camera based on the time-of-flight principle. The proposed method operates in the wavelet domain and uses multi hypothesis motion estimation to perform temporal filtering. One of the important novelties of the proposed method is that the motion estimation is performed on both depth and luminance sequences in order to improve the accuracy of the estimated motion. Another important novelty is that we use motion estimation reliabilities derived from both the depth and the luminance to derive coefficients for motion compensated filtering in wavelet domain. Finally, our temporal noise suppression is locally adaptive, to account for the non-stationary character of the noise in depth sensors.
We have evaluated the proposed algorithm on several depth sequences. The results demonstrate improvement in this application over some of the best available depth and video sequences denoising algorithms ([10, 27])
In future work, we will investigate GPU-based implementation and motion estimation with a variable block size.
This study was funded by FWO through the Project 3G002105.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.