new probabilistic approaches for rgb-d video enhancement and … · 2020. 4. 27. · abstract iii...

Probabilistic Approaches for RGB-D VideoEnhancement and Facial Pose Tracking

SHENG, Lu

A Thesis Submitted in Partial Fulfilment

of the Requirements for the Degree of

Doctor of Philosophy

in

Electronic Engineering

The Chinese University of Hong Kong

October 2016

題題題獻獻獻/Dedication

獻給

我親愛的妻子邵婧

及

我們的父母親

To

my dear wife Shao Jing

&

our beloved parents

i

Abstract

Abstract of thesis entitled:

Probabilistic Approaches for RGB-D Video Enhancement and Facial Pose

Tracking

Submitted by SHENG, Lu

for the degree of Doctor of Philosophy

at The Chinese University of Hong Kong

Acquiring high-quality and well-defined depth data from real scenes has been a hot

research topic in multimedia and computer vision. With the prevalence of various 3D

computer vision applications, it has been used in virtual reality, 3DTV, free-viewpoint

TV, human computer interaction and robot vision. Conventional passive acquisition

algorithms (e.g., stereoscopic vision, shape-from-X , etc.) mostly assume that the cap-

tured 3D scene is simple and artificial, i.e., under constant lighting conditions or other

constraints, only containing static or slowly-moving objects. Fortunately the depth

cameras, e.g., time-of-light cameras, laser scanners or structured-light sensors, are able

to capture standard-resolution depth maps in video frame rate, making real-time 3D

natural scene reconstruction, rendering, manipulation and interaction feasible. Nev-

ertheless, artifacts like noise, outliers, depth-missing regions and low resolution deter

direct usage of the raw depth data. Hence, there is an imperative need to develop a

unified and high-quality spatio-temporal depth video enhancement algorithm.

Accompanied by synchronized color videos offered by these sensors, the composed

RGB-D videos provide multi-modal structural features that are shared by both texture

and geometry, enabling effective guidance by texture features to regularize the depth

videos. Furthermore, such kind of guidance and structure-sharing properties between

different kinds of feature maps (e.g., RGB maps versus depth map) enable a series of

structure-preserving/propagation filters that do not only handle depth data but are

ii

ABSTRACT iii

also applicable to a much more broad area in image/video processing, graphics and

computer vision.

This thesis proposes solutions for exploring probabilistic approaches to discover

effective ways for efficient spatio-temporal RGB-D video enhancement. In addition,

probabilistic structure-preserving/propagation filters for various image and video ap-

plications are designed. Moreover, applications based on the RGB-D videos, like 3D

facial pose tracking, are effectively treated under the probabilistic view as well. The

depth videos employed in this thesis were captured by Kinect version 1 and low reso-

lution time-of-flight camera.

The employed probabilistic approaches not only handle the uncertainties, e.g., noise,

outliers and other artifacts, they also enable compact and learnable models with reliable

predictions. For example, the enhanced depth videos for RGB-D video enhancement,

the tracking parameters for rigid facial pose tracking, and the face model descriptions

for online face model personalization.

This thesis first demonstrates the spatial and temporal depth video enhancement

under the guidance of the synchronized color video. In spatial enhancement, at first

a novel hybrid strategy is proposed to simultaneously smooth the depth surface and

preserve the discontinuities by the combination of joint bilateral filtering and segment-

based surface structure propagation. Secondly, a probabilistic approach is proposed to

accelerate the time-consuming local weighted distribution estimation for the weighted

median/mode filters, which is based on a novel separable kernel defined by a weighted

combination of a set of probabilistic generative models. It reduces the large number of

filtering operations in conventional algorithms to a small amount, and is also compactly

adaptive to the structure of the input image. This method is not only compatible

with the RGB-D video enhancement, but also suitable for various image and video

applications, e.g. detail enhancement, structure extraction, JEPG artificial removals,

and etc.

In in temporal enhancement, an efficient online enhancement is developed by in-

troducing a probabilistic intermediary capturing the static structure of the captured

scene. By performing a novel variational generative model with respect to the static

structure, the proposed method both maintains long-range temporal consistency on the

static scene and keeps necessary depth variations in the dynamic content. With added

iv ABSTRACT

spatial refinement, it can produce flicker-free and spatially optimized depth videos with

reduced motion blur and depth distortion.

Thirdly, one application is presented in this thesis that applies the RGB-D videos

to track 3D facial pose with online face model personalization. Its inherent probabilis-

tic model brings about (1) robust estimation of the tracking parameters that are less

vulnerable under uncontrolled scenes with heavy occlusions and facial expression vari-

ations, and (2) reliable face model adaptation avoiding the interference of occlusions

and expression changes. The experimental results reveal that the proposed approach

is effective and superior to the state-of-the-art methods.

摘摘摘要要要

近年來，從真實場景中獲取高質量、高精細度的深度數據成為了一個在多媒體及計算

機視覺領域日益活躍的研究課題。隨著各種三維計算機視覺應用的持續流行，深度數

據已經被廣泛的應用于虛擬現實、三維電視、自由視角電視、人機交互以及機器視覺

等領域。傳統的被動式深度獲取算法（例如雙目立体視覺系統，shape-from-X等等）大多數假設拍攝場景條件簡單，比如說恆定、勻質的光照條件，只包含靜態或者緩慢

移動的物體等等。幸運的是深度攝像機──比如time-of-flight相機、激光掃描和結構光

傳感器等等──能夠實時錄製標清深度圖像，使得實時三維自然場景的重建、生成、

交互和操作變成可行的任務。然而測得深度數據的噪聲和離群性，特殊區域深度量的

缺失，以及深度像的低分辨率致原始的深度數據不適合直接使用。因此，我們亟需一

種綜合的高質量時空深度視頻恢復和加強的算法作為一種必要的預處理。

如果深度（depth）視頻與同步的彩色（RGB）視頻配合起來，這種RGB-D視頻提

供了被紋理和幾何結構共享的多模結構性特征。因此我們可以採用紋理特征來幫助約

束和指導深度視頻的處理。不僅如此，這種不同特征圖之間（例如，彩色圖與深度圖

之間）的結構指導或者結構共享啟發了一系列新穎的結構保持、結構擴展濾波器。這

些濾波器不僅可以用來處理深度數據，而且可以應用到更為廣闊的圖像與視頻處理，

計算機圖形學和計算機視覺領域之中。

本博士論文探索了使用概率方法在時空域上高效修復和加強RGB-D視頻的算法。

同時，設計了針對多種圖像與視頻應用的概率式結構保持和結構擴展濾波器。並且，

基於RGB-D視頻信號，從概率角度研究了三維面部方位追蹤問題。本論文採用的概率

方法不但可以描述數據的不確定性，比如說噪聲、離群點和其他缺陷，而且支持緊緻

的可學習模型：不但可以對RGB-D視頻提供可靠的深度視頻預測，而且可以對對頭部

方位追蹤以及在線面部建模問題提供有效的參數估計。

在本博士論文首先論述了在同步彩色視頻指導下的時域和空域深度視頻的修復

和增強算法。在空域增強部分，提出了一個新穎的基於聯合雙邊帶濾波器和超像素

（superpixel）表面結構擴散的混合策略來同時平滑深度表面和維持斷面的結構。然

後，本文提出了一種概率型結構保持、結構擴散濾波器。它不僅可以用於RGB-D視頻

v

vi ABSTRACT

的加強，而且可以應用于多種圖像與視頻應用，比如說細節增強，結構提取，JPEG圖

像畸變消除等等。該方法是一種對於加權中位數和眾數濾波器中高時間複雜度的局部

加權分佈估計的加速算法。該算法基於一種定義于一系列概率生成模型加權組合的新

穎的可分離濾波核。此算法极大少了之前算法所需的濾波運算，而且同時可以緊緻地

適應輸入圖像的結構特征。

在時域增強部分，提出了一種以採集場景的靜態結構（static structure）為概率

媒介的即時增強算法。通過一種新穎的針對于靜態結構的變分生成模型（variational

generative model），該方法同時對靜態場景保證了長期的時域一致性以及對動態內容

保持了必要的深度變化。在加入空域調優之後，該方法可以產生消除閃爍擾動、優化

空域信號，以及較少運動模糊和深度扭曲的深度視頻。

第三，本博士論文論述了一項利用RGB-D視頻的應用：三維面部方位追蹤。本論

文在該應用中採用的概率模型不僅使其對追蹤參數的魯棒估計免受非受控場景和強烈

遮擋引起的干擾，而且保證在線面部建模免收來自遮擋和表情變化帶來的畸變。實驗

結果表明本論文提出的算法高效而且優於當前最優的方法。

Acknowledgments

First and foremost, I wish to thank my supervisor Prof. King Ngi Ngan for his en-

couragement, support and mentorship. He is an accomplished scholar in his field of

expertise of image and visual signal processing. Not only did he guide me to think in

a creative way, but also provided plenty of innovative ideas to broaden my scope of re-

search horizon. Any achievement during my doctorial study cannot be gained without

his insightful supervision. Moreover, his attitude towards perfectness always motivates

me to move on and work harder.

My deep gratitude also goes to Prof. Jianfei Cai with the School of Computer

Science and Engineering in Nanyang Technological University (NTU), for his great

guidance and help when I did the overseas research internship in NTU for six months.

He provided me a valuable chance to improve my research skills and broaden the field of

my research vision. I would also like to thank Prof. Xiaogang Wang, Prof. Thierry Blu,

Prof. Wai Kuen Cham and Prof. Hung Tat Tsui, who are faculty members in the image

and video processing (IVP) laboratory. Their insightful suggestions and comments

engendered a more thorough understanding of my research topics, and introduced to

me a plethora of advanced knowledges about signal processing, computer vision and

machine learning.

I must express my appreciations to my colleges in IVP lab. Thanks go to Songnan

Li, Lin Ma, Wanli Ouyang, Qiang Liu, Qian Zhang, Feng Xue, Cong Zhao, Miaohui

Wang, Ran Shi, Chi Ho Cheung, Yichi Zhang, Tianhao Zhao, Fanzi Wu, Yu Zhang, Kai

Kang, Tong Xiao, Qinglong Han, Wei Li, Hanjie Pan, Xingyu Zeng, Zhisheng Huang,

Cong Zhang and others in IVP lab. I will treasure forever the time with them during

my PhD study. I also need to thank Jie Chen, Di Xu and Teng Deng for their help

both in academic and daily life when I was with Nanyang Technological University.

Last but not least, I am deeply indebted to my family. My most sincere grati-

tude goes to my wife, Shao Jing, for her consistent love, support, encouragement and

vii

viii ACKNOWLEDGMENTS

understanding. None of my achievements is possible without her support. I want to

especially thank my parents, for their unconditional love and support in the past twenty

seven years. Their love motivates me to pursue my dreams with the strongest resolve.

致致致謝謝謝

當我博士論文即將書寫完成之際，我的學生生涯也將畫下一個句號。我從一個“江南

水鄉”寧波出發，來到“人間天堂”杭州，再來到“東方明珠”香港，輾轉求學已二

十一年有餘。來到香港中文大學前後五載，時間雖然短暫，但是成長迅速、受益良

多。雖然研究成果並不豐富，但是學習科研、見聞見識都有了長足的進步。在此，謹

允許我向幫助、鼓勵和支持過我的老師、同學、朋友和家人致以深深的敬意和衷心的

感謝。

首先感謝我的恩師顏慶義教授（Prof. King Ngi Ngan）對我的諄諄教誨和悉心指

導。顏教授學識淵博、造詣深厚，不僅給予我許多關鍵的啟發性建議，而且也助我邁

過諸多彎路。我在攻讀博士學位期間獲得的任何成果和進步都離不開顏教授的磨礪和

鞭策。不但如此，顏教授嚴謹的治學態度和一絲不苟的工作作風，更是讓我深深敬佩

而且受益終身。

同時要感謝的是南洋理工大學計算機科學與工程學院的Prof. Jianfei Cai，以及香

港中文大學圖像與視訊處理實驗室的王曉剛教授、Prof. Thierry Blu、湛偉權教授和

徐孔達教授。他們對各種學術問題的獨到見解和精闢評論，同樣給我很大幫助和啟

發，讓我對自己的博士研究課題有跟進一步的理解，同時學到了大量信號處理、計算

機視覺和機器學習相關的前沿技術和知識。特別感謝Prof. Jianfei Cai在我于南洋理工

大學交流期間的指導和幫助，給予我一個珍貴的提高學術能力和拓寬眼界的機會。

我也要感謝實驗室一起奮鬥的伙伴們。感謝李松楠、馬林、歐陽萬里、劉強、張

茜、薛峰、趙叢、王妙輝等師兄師姐對我的指導。感謝史冉、張志豪、張一馳、趙天

昊、吳凡子、張瑜等師弟師妹對我的支持。同時感謝康愷、肖桐、韓慶龍、李蔚、潘

漢杰、曾星宇、黃之燊、張聰等圖像處理實驗室同窗對我的幫助和照顧。還需感謝在

南洋理工大學交流期間陳杰、徐迪、鄧騰等同學對我的接待和幫助。在與他們在一起

科研和學習的時光里，我感到充實而且快樂，收穫了美好而且難忘的回憶。感謝你們

一路上的陪伴，在此祝你們未來一片光明，生活幸福美滿。

最後，讓我把最真摯的謝意和感激獻給我的家人，他們是我的堅實後盾和力量源

泉。父母的辛勤栽培和殷殷期望，是我漫長求學生涯中不變的支柱。而我的妻子邵

ix

x ACKNOWLEDGMENTS

婧，也是我同實驗室的同學，總是在我困難的時候幫助我，在我軟弱的時候鼓勵我。

在照顧好我們生活的同時和我就科研問題互相參詳、互相討論。我的所有成果都有她

的一份功勞。謝謝你們給我力量和勇氣去面對困難，迎接挑戰。

Publications

Journal Papers

• Lu Sheng and King Ngi Ngan, “Weighted Structural Prior for Structure-preserving

Image and Video Applications”, IEEE Transaction on Image Processing (TIP),

U.S.A., in preparation.

• Lu Sheng, Jianfei Cai and King Ngi Ngan, “A Generative Model for Robust 3D

Facial Pose Tracking”, IEEE Transactions on Image Processing (TIP), U.S.A.,

in preparation.

• Lu Sheng, King Ngi Ngan, Chern-Loon Lim and Songnan Li, “Online Tempo-

rally Consistent Indoor Depth Video Enhancement via Static Structure”, IEEE

Transactions on Image Processing (TIP), U.S.A., vol. 24, no. 7, pp. 2197-2211,

Jul. 2015.

• Songnan Li, King Ngi Ngan, Raveendran Paramesran and Lu Sheng, “Real-

time Head Pose Tracking with Online Face Template Reconstruction”, IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), U.S.A.,

accepted.

Conference Papers

• Lu Sheng, Tak-Wai Hui and King Ngi Ngan, “Accelerating the Distribution Es-

timation for the Weighted Median/Mode Filters”, In Asian Conference on Com-

puter Vision (ACCV), Poster, Singapore, Nov. 1-5, 2014.

• Lu Sheng, Songnan Li and King Ngi Ngan, “Temporal Depth Video Enhance-

ment Based On Intrinsic Static Structure”, In IEEE International Conference on

Image Processing (ICIP), Oral, Paris, France, Oct. 27-30, 2014.

xi

xii PUBLICATIONS

• Lu Sheng, King Ngi Ngan and Songnan Li, “Depth Enhancement Based On

Hybrid Geometric Hole Filling Strategy”, In IEEE International Conference on

Image Processing (ICIP), Poster, Melbourne, Australia, Sep. 15-18, 2013.

• Chi Ho Cheung, Lu Sheng and King Ngi Ngan, “A disocclusion filling method

using multiple sprites with depth for virtual view synthesis”, In IEEE Interna-

tional Conference on Multimedia and Expo Workshop (ICMEW), Oral, Turin,

Italy, Jun. 29 - Jul. 3, 2015.

• Songnan Li, King Ngi Ngan and Lu Sheng, “Screen-camera Calibration Us-

ing a Thread”, In IEEE International Conference on Image Processing (ICIP),

Poster, Paris, France, Oct. 27-30, 2014.

• Songnan Li, King Ngi Ngan and Lu Sheng, “A Head Pose Tracking System Us-

ing RGB-D Camera”, In International Conference on Computer Vision Systems

(ICVS), Oral, St. Petersburg, Russia, Jul. 16-18, 2013.

Declaration

I hereby declare that this thesis is composed by myself and all the contents has not

been submitted to this or any other universities for a degree. The materials of some

chapters have been published in the following conferences or journals:

• Chapter 2:

– Lu Sheng, King Ngi Ngan and Songnan Li, “Depth Enhancement Based On

Hybrid Geometric Hole Filling Strategy”, In IEEE International Conference

on Image Processing (ICIP), Melbourne, Australia, Sep. 15-18, 2013.

• Chapter 3:

– Lu Sheng, Tak-Wai Hui and King Ngi Ngan, “Accelerating the Distribution

Estimation for the Weighted Median/Mode Filters”, In Asian Conference on

Computer Vision (ACCV), Singapore, Nov. 1-5, 2014.

• Chapter 4:

– Lu Sheng, Songnan Li and King Ngi Ngan, “Temporal Depth Video En-

hancement Based On Intrinsic Static Structure”, In IEEE International

Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014.

– Lu Sheng, King Ngi Ngan, Chern-Loon Lim and Songnan Li, “Online Tem-

porally Consistent Indoor Depth Video Enhancement via Static Structure”,

IEEE Transactions on Image Processing (TIP), U.S.A., vol. 24, no. 7, pp.

2197-2211, Jul. 2015.

xiii

Contents

Dedication i

Abstract ii

Acknowledgments vii

Publications xi

Declaration xiii

Contents xiv

List of Figures xviii

List of Tables xxvi

1 Introduction and Background 1

1.1 RGB-D Video Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 RGB-D Spatial Enhancement . . . . . . . . . . . . . . . . . . . . 5

1.1.2 RGB-D Temporal Enhancement . . . . . . . . . . . . . . . . . . 7

1.2 RGB-D Video Applications . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 The Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Hybrid Geometric Hole Filling Strategy for Spatial Enhancement 14

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Unreliable Region Detection and Invalidation . . . . . . . . . . . 16

2.3.2 Hybrid Strategy of Geometric Hole Filling . . . . . . . . . . . . . 16

2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

xiv

CONTENTS xv

3 Weighted Structure Filters Based on Parametric Structural Decom-

position 24

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Non-parametric Representations of Local Image Statistics . . . . 26

3.3.2 Correlations across Local Structures . . . . . . . . . . . . . . . . 28

3.3.3 Complexity of the Local Statistics Estimation . . . . . . . . . . . 28

3.4 Accelerating the Distribution Estimation . . . . . . . . . . . . . . . . . . 29

3.4.1 Kernel Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Probability Distribution Approximation . . . . . . . . . . . . . . 31

3.4.3 Gaussian Model for the Proposed Kernel . . . . . . . . . . . . . . 32

3.5 Accelerated Weighted Filters . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5.1 Weighted Average Filter . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.2 Weighted Median Filter . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.3 Weighted Mode Filter . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . 38

3.6.1 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 39

3.6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Temporal Enhancement based on Static Structure 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 A Probabilistic Generative Mixture Model . . . . . . . . . . . . . 51

4.3.2 Variational Approximation . . . . . . . . . . . . . . . . . . . . . 53

4.3.3 Improvement with Color Video . . . . . . . . . . . . . . . . . . . 55

4.3.4 Layer Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.5 Online Static Structure Update Scheme . . . . . . . . . . . . . . 58

4.3.6 Temporally Consistent Depth Video Enhancement . . . . . . . . 60

4.4 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4.1 Numerical Evaluation of the Static Structure Estimation By Syn-

thesized Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.2 Evaluation of the Static Structure Estimation By Real Data . . . 68

4.4.3 Temporally Consistent Depth Video Enhancement . . . . . . . . 71

4.5 Limitations and Applications . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xvi CONTENTS

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 A Generative Model for Robust 3D Facial Pose Tracking 80

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 Probabilistic 3D Face Parameterization . . . . . . . . . . . . . . . . . . 83

5.3.1 Multilinear Face Model . . . . . . . . . . . . . . . . . . . . . . . 84

5.3.2 A Statistical Prior . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4 Probabilistic Facial Pose Tracking . . . . . . . . . . . . . . . . . . . . . 87

5.4.1 Robust Facial Pose Tracking . . . . . . . . . . . . . . . . . . . . 88

5.4.2 Online Identity Adaptation . . . . . . . . . . . . . . . . . . . . . 95

5.5 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 99

5.5.1 Datasets And System Setup . . . . . . . . . . . . . . . . . . . . . 99

5.5.2 Quantitative and Qualitative Evaluations . . . . . . . . . . . . . 102

5.5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 Conclusions and Future Work 107

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

A Approximation for the Gaussian Kernel 110

B Generative Model for Static Structure 111

B.1 Probabilistic Generative Mixture Model . . . . . . . . . . . . . . . . . . 111

B.1.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B.1.2 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.1.3 Joint Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.1.4 Data Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.1.5 Posteriors with First-order Markov Chain . . . . . . . . . . . . . 113

B.2 Derivations of the Results in Variational Approximation . . . . . . . . . 114

B.2.1 Approximated Joint Distributions . . . . . . . . . . . . . . . . . 114

B.2.2 Approximated Data Evidence For The Observation . . . . . . . . 115

B.2.3 Parameter Updating for the Approximated Static Structure . . . 115

B.2.4 Parameter Updating for the Approximated State Frequencies . . 116

B.2.5 Approximated Posterior for the State Frequencies . . . . . . . . . 117

C The Choice of Depth Noise Standard Deviation 119

C.1 Depth Map from Stereo or Kinect . . . . . . . . . . . . . . . . . . . . . 119

C.2 Depth Map from Other Sources . . . . . . . . . . . . . . . . . . . . . . . 120

CONTENTS xvii

Bibliography 121

List of Figures

1.1 (a)-(b) Illustration of RGB-D image pairs. (c) Texture-rendered point

clouds. Data is captured from Kinect. . . . . . . . . . . . . . . . . . . . 2

1.2 Applications based on RGB-D data. . . . . . . . . . . . . . . . . . . . . 3

1.3 Spatial distortions in raw depth images from Kinect version 1. (a)-(b)

Raw RGB-D image pairs. (c) Depth mesh generated from the raw depth

image, illustrating the noise and outliers. (d) Depth holes from various

sources. The blue box indicates depth holes from occlusions, while the

green box shows the depth holes from light reflection and absorption. . . 4

1.4 Temporal distortions in raw and spatially enhanced depth videos. The

videos were captured by Kinect version 1. (a) Raw depth videos suffer

the temporal flickering problem due to the inconsistent noise, outliers

and depth holes. (b) Spatially enhanced depth videos still contain tem-

poral artifacts from the blurs around object boundaries and inconsistent

spatial filtering operations between neighbor frames. . . . . . . . . . . . 5

2.1 Framework of the proposed method. . . . . . . . . . . . . . . . . . . . . 14

2.2 Align the depth map into color image coordinate and then partition the

hole region into Ωs and Ωf . Test depth map comes from the Middlebury

dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Illustration of patch matching process. The left image is segmented

color image, the right one is a close-up of local region marked blue in

left image. Pu is the query patch, Pv is in candidate patch sets. Detail

description is in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Middlebury datasets employed for the experimental comparsions. Test

scene (from left to right) are Raindeer, Midd2 and Teddy. . . . . . . . . 21

2.5 Visual comparison on the Middlebury datasets. From up to bottom:

color images, results by [1], [2] and the proposed method. Test scene

(from left to right) are Raindeer, Midd2 and Teddy. . . . . . . . . . . . 22

xviii

LIST OF FIGURES xix

3.1 Illustration of correlations among structures in local patches. (a) is the

sample image. Four patches A,B,C and D were selected from the area

in the black box. (b) shows the histograms of the four patches, which

were fitted by the kernel regression. These revealed modes indicate the

local structures. We labeled the estimated structures as #1 to #4. (c)

indicates the locations of these structures in each patch. These structures

are slowly varying in a local neighborhood and are shared among these

patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Illustration of the proposed kernel. (a) shows a 1D signal and two pixels

x and y. (b) represents the construction of κ(fx, fy), where the mean

values of three models are shown in three different colors. It measures

the similarity of fx and fy by evaluating the summation of the joint

likelihood of them w.r.t. each model. . . . . . . . . . . . . . . . . . . . . 30

3.3 Locally adaptive models (LAM) v.s. uniformly quantized models (UQM).

A 1D signal is extracted from a gray-scale image shown in the left column

and marked by orange. Both the LAM and UQM models (L = 3) are

exploited to represent the signal, which are shown in the right column.

The top row is by UQM models, the bottom one is by LAM models.

The LAM models are adaptive to the local structures and own superior

performance on representing the signal with limited number of models

(e.g.L = 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 h(x, g) and h(x, g) of the patches C and D (from the image shown in

Figure 3.1) different conditions. The window size |N (x)| = 11 × 11

and only the spatial weights are exploited. (a) h(x, g) are estimated

by the smoothed local histogram [3] under different data variance σ2n.

σn = 10−1, 10−2 and 10−3. (b) h(x, g) are estimated by the proposed

kernel under different data variances as in (a). (c) h(x, g) are estimated

under different number of models L and the data variance is sticked as

σn = 10−2. The y-axis is rescaled to show the subtle differences between

different curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5 Execution time comparison on the distribution construction w.r.t. the

number of models. The input is a 8-bit single-channel image and the

guidance is a 3-channel image. The reference method is brute-force and

traverses 256 discretized bins. . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6 The distribution of the number of necessary local adaptive models in

BSDS300 dataset. Left : the window size is 21 × 21. Right : the window

size is 11× 11. The smaller the window size, the fewer number of locally

adaptive models is necessary. . . . . . . . . . . . . . . . . . . . . . . . . 40

xx LIST OF FIGURES

3.7 Depth map enhancement on tsukuba. The first row shows the raw

input disparity map, the ground truth, results by CT-median [4] and

BF-mode [5] respectively, from left to right. Disparity maps in the 2nd

and 3rd rows were obtained by the proposed weighted median filter and

weighted mode filter, under different number of models. The models

were generated by the LAM models. The error was evaluated on bad

pixel ratio with the threshold 1. GF weights were chosen and related

parameters were fairly configured. . . . . . . . . . . . . . . . . . . . . . 41

3.8 Results of the weighted mode filter with 7 models. . . . . . . . . . . . . 41

3.9 JPEG compression artifact removal results by the weighted median filter.

(a) The input degraded eyes image. (b) CT-median [4]. (c) The proposal

weighted median filter with the LAM models and (d) is with the UQM

models. The second row shows the corresponding zoomed-in patches.

The DF weights were chosen and all the related parameters were fairly

configured. Best viewed in electronic version. . . . . . . . . . . . . . . . 43

3.10 Detail enhancement by the proposed weighted median filter under the

LAM models. From left to right, the original rock image, after edge-

preserving smoothing, and the detail enhanced image. GF weights were

chosen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.11 Joint depth map upsampling. The input disparity map was 8× upsam-

pled by the proposed weighted median filter and the weighted mode filter

under the LAM models. The raw input diparity map is shown in the

top-left corner of the leftmost image. GF weights were chosen. . . . . . 44

4.1 The illustration of the static structure in comparison with the input

depth frame. (a) shows the input depth frame (in blue curve) lies on the

captured scene, (b) represents the static structure (in black curve). The

depth sensor is above the captured scene. The static structure includes

the static objects as well as the static background. . . . . . . . . . . . . 49

4.2 Flowchart of the overall framework of the proposed method on the esti-

mation of static structure and depth video enhancement. Please refer to

the text for the detailed description. . . . . . . . . . . . . . . . . . . . . 50

4.3 Illustration of three states of input depth measurements with respect to

the static structure on one line-of-sight. The current static structure

refers to the blue stick in the middle. Decision boundaries are marked

as blue dot lines. The depth measurement d is categorized into state-I

when it is placed around the static structure. When d is in front of

this structure, we denote it as state-F. While it is far behind the static

structure, the state is state-B. . . . . . . . . . . . . . . . . . . . . . . . 51

LIST OF FIGURES xxi

4.4 Variational approximation of the parameter set of the static structure

for a 1D depth sequence. The number of frames is T = 500. (a) The

expected depth sequence of the static structure versus the raw depth

sequence, where the ideal Zx = 50. (b) The confidence interval of

Ztx, the interval is centered μt

x and between μtx ± 2σt

x with 95% con-

fidence. (c) The evolution of the portions (defined by the expected value

of ωx at frame t, denoted by [ωI,tx , ωF,t

x , ωB,tx ]) of the three states. The

ideal portions are ωx = [0.89, 0.1, 0.01]. (d) The estimated distribution

qT (dx|PD,Tx ) versus the normalized histogram estimated by DT

x when

T = 500. The estimated depth of the static structure goes to the ideal

value only with a few samples. Its confidence interval shrinks rapidly,

which means the uncertainty is reduced very fast. The portion of each

state is evolved with the raw depth sequence, and they match their ideal

value with enough depth samples. When T = 500, the estimated data

distribution fits the data histogram compactly. . . . . . . . . . . . . . . 54

4.5 One toy example illustrates the layer assignment. The cyan dot line

indicates the current estimated depth structure of the static structure,

and the red solid line is from the input depth frame. If color frames are

available, they provide additional constraints to regularize the assign-

ment, where the upper line corresponds to the current estimated texture

structure of the static structure, and the lower one refers to the input

color frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Sample frames of the input depth video with two types of noise and

outliers. (a) is the sample color frame, (b) and (c) are the contaminated

depth frames with σn = 2 and ωn = 10−2. (b) is type-I but (c) is type-II.

Type-II error is worse than type-I error with the same parameters. . . . 62

4.7 RMSE maps with varying u and σ under different noise and outlier

parameter pairs (ωn, σn). (a)-(c) were contaminated by type-I, while

(d)-(e) were contaminated by type-II. . . . . . . . . . . . . . . . . . . . 63

4.8 Performance comparisons between the constant and depth-dependent ξx

under different type-II noise and outlier parameter pairs (ωn, σn). The

red curve is by depth-dependent ξx, and the blue curve is by constant

ξx. Each curve is obtained at its own optimal parameter pair (u, σ), as

shown in the legends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.9 Comparison with other methods on static structure estimation of the

synthetic static scenes. Three levels of noise and outlier parameter pairs

(ωn, σn) were tested. (a), (c) and (e) were of type-I. (b), (d) and (f) were

of type-II. The x-axis marks the frame order, and y-axis is the RMSE

score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xxii LIST OF FIGURES

4.10 Visual evaluation on real indoor static scenes. (a) is the result of a

real indoor scene Indoor Scene 1. The first row shows the raw depth

sequences and color sequences. The second row is the selected results

of the estimated static structures without spatial enhancement at frame

t = 0, 5, 10 respectively. The third row shows corresponding spatially en-

hanced static structure without texture information, while the last row

exhibits the results with the guidance of texture information. The yellow

color in the second row marks missed depth values (holes). Gray rep-

resents depth value, lighter meaning a nearer distance from the camera.

Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.11 Visual evaluation on real indoor static scenes. (b) is the results of a

real indoor scene Indoor Scene 2. The first row shows the raw depth

sequences and color sequences. The second row is the selected results

of the estimated static structures without spatial enhancement at frame

t = 0, 5, 10 respectively. The third row shows corresponding spatially en-

hanced static structure without texture information, while the last row

exhibits the results with the guidance of texture information. The yellow

color in the second row marks missed depth values (holes). Gray rep-

resents depth value, lighter meaning a nearer distance from the camera.

Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.12 Reliability maps of two test sequences of indoor static scenes. . . . . . . 69

4.13 Static structure estimation on dyn kinect tl. (a) and (b) are the first

five frames of the input sequence. (c) shows the layer assignment results.

Red, green, blue denote liss, ldyn, locc, respectively. (d) represents the

depth map of the static structure, and (e) shows the corresponding color

map. The first frame is for initialization. . . . . . . . . . . . . . . . . . . 70

4.14 Static structure estimation on dyn tof tl. (a) shows the first 5 frames of

the input sequence. (b) shows the layer assignment results. Red, green,

blue denote liss, ldyn, locc, respectively. (c) represents the depth map of

the static structure. The first frame is for initialization. . . . . . . . . . 71

4.15 Comparison on depth video enhancement. (a) and (b) are selected

frames from the test RGB-D video sequences. From left to right: the

113rd, 133th, 153th, 173th, 193th and 213th frame. (c) shows the results by

CSTF [1], and (d) by WMF [5]. (e) by Lang et. al. [6] (f) is generated

by the proposed method. (g) compares the performances among these

methods in the enlarged sub-regions (shown in raster-scan order). Best

viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

LIST OF FIGURES xxiii

4.16 Comparison on depth video enhancement. (a) are selected frames from

an RGB-D video sequence dyn kinect 2. From top to bottom: the RGB

frames, the raw depth frames, results by Lang et. al. [6] and results by

the proposed method. Best viewed in color. . . . . . . . . . . . . . . . . 73

4.17 Comparison on depth video enhancement. (b) are selected frames from

an RGB-D video sequence dyn kinect 3. From top to bottom: the RGB

frames, the raw depth frames, results by Lang et. al. [6] and results by

the proposed method. Best viewed in color. . . . . . . . . . . . . . . . . 74

4.18 Failure cases of the proposed method. (a) and (b) are two representa-

tive results. From left to right: color frame, raw depth frame and the

enhanced depth frame. Artifacts are bounded by the red dot boxes. . . 76

4.19 Examples of the background subtraction. Best viewed in color. . . . . . 77

4.20 Examples of the novel view synthesis. (a) and (b) are the input RGB and

depth frames. (c) is the enhanced depth frame by the proposed method.

(d) is the synthesized view by the raw depth frame and the RGB frame.

Image holes in (d) is filled by the static structure, as shown in (e). (f) is

the synthesized view based on the enhanced depth frame and the image

holes are also filled by the estimated static structure. Best viewed in color. 78

5.1 Sample face meshes in the FaceWarehouse dataset. This dataset contains

face meshes from a comprehensive set of expressions and a variety of

identities including different ages, genders and races. . . . . . . . . . . . 83

5.2 Illustration of the generic multilinear face model trained by the Face-

Warehouse dataset [7]. (a) The mean face f . (b) Illustration of per-

vertex shape variation caused jointly by wid and wexp. (c)–(d) Illustra-

tion of per-vertex shape variation with respect to wid and wexp, respec-

tively. The shape variation is represented as the standard deviation of

the marginalized per-vertex distribution. The shape variations in (b)–

(d) are overlaid on the same neutral face model μM. Best viewed in

electronic version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 System overview. We propose a unified probabilistic framework for ro-

bust facial pose estimation and online identity adaption. In both threads,

the generative face model acts as the key intermediate and it is updated

immediately with the feedback of the identity adaptation. The input

data is the depth map while the output is the rigid pose parameter θ(t)

and the updated face identity parameters {μ(t)id ,Σ

(t)id } that encode the

identity distribution p(t)(wid). . . . . . . . . . . . . . . . . . . . . . . . . 87

5.4 Samples of the occluded faces. The occlusions are caused by multiple

factors. For instance, the face is occluded by itself, or the face is occluded

by other objects like hair, accessories, hands and etc. . . . . . . . . . . . 89

xxiv LIST OF FIGURES

5.5 Illustration of the ray visibility constraint. A profiled face model and a

curve in the surface of the input point cloud are presented in front of

a depth camera. Three cases are presented. (a) Case-I: a partial face

region is fitted to the input point cloud, while the rest facial regions are

occluded. (b) Case-II: the face model is completely occluded. (c) Case-

III: a part of face region is visible and in front of the point cloud, and

the rest face regions are occluded. Best viewed in electronic version. . . 91

5.6 Examples of the proposed rigid pose estimation. (a) and (b) are the

color images and the corresponded point cloud. (c) shows the initial

alignment provided by the head detection method [8], and (d) visualizes

the proposed rigid pose estimation results. Notice that only generic

face model is applied. It robustly estimates difficult face poses from the

partial scans with heavy occlusions by hands and hairs, as well as the

profiled faces with strong self-occlusions. Best viewed in electronic version. 93

5.7 Comparison of the rigid pose estimation methods. (a) and (b) show the

color image and its corresponded point cloud. (c) depicts two views of the

initial alignment between the generic face model and the point cloud. (d)

visualizes the result by ICP [9], and (e) reports the result by maximizing

the likelihood that modeled by the ray visibility constraint (RVC). (f) is

the proposed recursive method for the minimization of the ray visibility

score (RVS), (g) is the augmented RVS method by the particle swarm

optimization (RVS+PSO). Details refer to the text and notice that only

the generic face model is applied. Best viewed in electronic version. . . . 95

5.8 Examples of face model adaptation. The proposed method can success-

fully personalize the face model to identities with different gender and

races. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.9 We continuously adapt the identities of the face model to different users.

(a)-(c) are two examples showing that the face model can be gradually

personalized when the facial depth data from different poses are captured

during the tracking process. The face model is initialized with the generic

face model as shown in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . 98

5.10 Tracking results on the Biwi dataset with the personalized face mod-

els. Our system is robust to profiled faces due to large rotations and

occlusions from hair and accessories. The 1st and 2nd rows show the

corresponded color and depth image pairs. The third row visualizes the

extracted point clouds of the head regions and the overlaid personalized

face models. Best viewed in electronic version. . . . . . . . . . . . . . . 100

5.11 Tracking results on the ICT-3DHP dataset. The proposed system is also

effective to the expression variations. Best viewed in electronic version. . 101

LIST OF FIGURES xxv

5.12 The proposed system can automatically adapt a face model from one

identity to another. Top: Three identities are presented successively

in adjacent three frames. Bottom: The tracking face models that are

adaptive to the current identity. Please note the differences of head and

nose shapes among the visualized face models. . . . . . . . . . . . . . . . 103

List of Tables

2.1 Comparison of bad pixel rate (%) . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Comparison of mean absolute difference . . . . . . . . . . . . . . . . . . 23

4.1 Per-frame running time comparison (MATLAB platform) . . . . . . . . 67

5.1 Facial Pose Datasets Summarization . . . . . . . . . . . . . . . . . . . . 100

5.2 Evaluations on Biwi dataset . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3 Evaluations on ICT-3DHP dataset . . . . . . . . . . . . . . . . . . . . . 104

xxvi

Chapter 1

Introduction and Background

With the prevalence of various three-dimensional applications in manufacturing and

the entertainment industry, automatically acquiring dense and high-quality depth data

from the real world scenarios has been an essential requirement in 3D reconstruc-

tion, virtual reality and augmented reality (VR/AR), 3D and free-viewpoint televisions

(3DTV and FTV), human-computer interaction (HCI), robot vision, as well as a host

of high-level 3D learning tasks like 3D object/scene understanding and analysis.

Unlike most research works in computer graphics that rely on synthesizing the scene-

level or object-level depth data, the computer vision community focuses on measuring

the depth data from the real world. Recently a variety of systems have been proposed

to obtain depth information of a real scene, from passive stereo vision and shape-from-

X systems to active sensors like real-time structured-light depth sensors (e.g., Kinect),

Time-of-Flight (ToF) cameras or laser scanners. The passive systems mostly require

simple or artificial environmental conditions (i.e., constant lighting, simple background,

static or slowly-moving objects, and etc.) for the captured scenes, so as to keep their

performances as stable as possible. Fortunately, recent commodity active depth cameras

are able to capture standard-resolution depth maps in video frame rate, making the

low-cost but real-time 3D applications possible.

Even thought the depth data acquired by the recent commodity depth sensors are of

low quality, they provide a more convenient and explicit way to model and understand

the geometric structures of the 3D world instead of the implicit inferences from the 2D

texture information offered by the RGB images and videos. A lot of 3D image/video

processing and computer vision tasks benefit from the usage of these depth sensors. To

name a few, 3DTV and FTV adopt “RGB + depth” video pairs from either dense or

sparse viewpoints to seamlessly render immersive and user-plausible novel viewpoint

1

2 CHAP. 1. INTRODUCTION AND BACKGROUND

(a) RGB image (b) Depth image (c) Point cloud with texture

Figure 1.1: (a)-(b) Illustration of RGB-D image pairs. (c) Texture-rendered pointclouds. Data is captured from Kinect.

videos. VR/AR and HCI employ the streaming depth data to determine the user’s

head pose, facial expression, body pose and actions in real time. The 3D geometrical

data also give the researchers a new modality of cues in addition to the conventional

2D texture patterns, enabling a more thorough high-level analysis and understanding

of the 3D real world both from the viewpoints of appearance and geometry. As for the

field of robot vision, one example is the simultaneous localization and mapping (SLAM)

algorithms that explicitly utilize the point clouds from the depth sensors mounted on

the robots to concurrently reconstruct the 3D layouts of the scanned scenes and localize

the trajectories of the robots. As a consequence, the introduction of depth information

facilitates a lot of tasks that were once difficult or intractable by using the texture

information alone and make them much easier and way more accessible.

However, despite the advantages listed above, RGB-D video and enhancement is

still an urgent issue since the poor quality of the captured depth data, for example

from Kinect version 1, will more or less impede the depth-based tasks to give full play

to their potential performances. Moreover, the depth data accompanied by the texture

information suggests that we need specific treatment compatible to the 3D geometrical

properties other than the conventional methods particularly for the texture patterns.

It means that methods dedicated to the depth data are necessary and essential for

3D image and video processing, as well as various 3D computer vision applications.

Therefore, on one hand, this thesis aims to propose reliable solutions of RGB-D video

enhancement for Kinect version 1, as a faithful preprocessing for various 3D applica-

tions. On the other hand, taking the 3D facial pose tracking as an example, this thesis

explores novel depth-based techniques to model 3D geometrical relationships and re-

construct 3D structures in an online fashion. This thesis unifies the tool for all the tasks

§ 1.1. RGB-D Video Enhancement 3

(a) Immersive 3DTV and FTV (b) 3D facial expression reenactment (c) RGBD SLAM

(d) 3D facial pose estimation (e) 3D body pose estimation

Figure 1.2: Applications based on RGB-D data.

based on the parametric generative models, which are not only effective to model these

problems with reliable uncertainty (or noise) compensation and faithful 2D/3D struc-

ture and motion predictions, but also efficient in computing complexity for real-time

performances.

1.1 RGB-D Video Enhancement

Most commodity depth sensors only offer low quality depth data and usually suffer

from various systematic distortions depending on the mechanisms behind them.

The spatial distortions of depth videos can be roughly classified into three categories:

• Noise and outliers. For Kinect and structured-light sensors, noise usually comes

from the quantization errors from the disparity-to-depth conversion [10]. How-

ever, outliers always stem from the strong light reflection from the non-lambertian

materials, or light attenuation because of the light-absorbable materials, or inter-

ference across multiple depth sensors or ambient light. For ToF sensors, the noise

and outliers are usually from the light absorptions from different materials. For

both types of depth sensors, complex geometrical structures often produce un-

stable outliers since the depth measurements around the discontinuities between

distinct structures are always erratic.


(a) RGB image (b) Depth image

(c) Noise and outliers (d) Depth holes from various sources

Figure 1.3: Spatial distortions in raw depth images from Kinect version 1. (a)-(b) RawRGB-D image pairs. (c) Depth mesh generated from the raw depth image, illustratingthe noise and outliers. (d) Depth holes from various sources. The blue box indicatesdepth holes from occlusions, while the green box shows the depth holes from lightreflection and absorption.

• Holes without depth measurements. A part of the depth holes are caused by

the occlusions for the structured-light sensor. And similarly, light attenuation and

reflection also lead to depth holes without reliable depth measurements. Another

kind of holes happens when the captured objects or scenes are out of the effective

range of the depth sensors.

• Low resolution. Although various types of depth sensors are getting to the

market with increasing resolutions, most of their resolutions still cannot compete

with commodity web cameras (usually 1920×1080 or larger). For example, Kinect

version 1 (structured-light) is of 320 × 240 pixels and Kinect version 2 (time-of-

flight) is of 512×424 pixels. While one popular ToF camera like the Swiss Ranger

only has 176× 144 pixels.

• RGB-D mis-alignment. Aiming at a unified framework on both the depth and

its synchronized RGB videos, the misalignment errors between the depth and the

RGB frames are an extra kind of spatial distortions that is critical for reliable

RGB-D video based tasks. It is more severe if there is resolution incompatibility

§ 1.1.1. RGB-D Spatial Enhancement 5

(a) Raw depth video

(b) Spatially enhanced depth video

Figure 1.4: Temporal distortions in raw and spatially enhanced depth videos. Thevideos were captured by Kinect version 1. (a) Raw depth videos suffer the temporalflickering problem due to the inconsistent noise, outliers and depth holes. (b) Spatiallyenhanced depth videos still contain temporal artifacts from the blurs around objectboundaries and inconsistent spatial filtering operations between neighbor frames.

between the depth and RGB videos.

Apart from the spatial distortions, the temporal inconsistency problem is another

type of distortions that occurs in the raw depth videos. Not only the noise, outliers and

depth holes in adjacent frames introduce severe temporal flickering artifacts, but also

the spatial enhancement specific to single depth image aggravates the inconsistency

problem between neighboring frames.

These shortcomings make it difficult to use the raw depth of RGB-D videos directly.

1.1.1 RGB-D Spatial Enhancement

To tackle these limitations, the spatial enhancement of the depth video require extensive

research efforts. A pioneer work in this field was done by Diebel et. al. [11]. They

modeled the enhancement problem as a pixel-wise Markov Random Field (MRF) guided

by the RGB image with the assumptions that

• structure and texture discontinuities are co-aligned in the color and depth images;

• pixels with similar texture patterns have similar geometrical structures.


Under similar assumptions, several augmented models were also proposed to handle in-

painting and super-resolution [12–16], with special choices of the data and smoothness

terms as well as additional regularization terms. For instance, effective image-guided

regularizations like TV-�1 norm [16], anisotropic total generalized variation [17], mu-

tual structure [18], as well as a regularization term even without texture informa-

tion [16]. Modern global optimization methods also attempt joint static and dynamic

guidance [19], or employ statistical inference for the local structures [20; 21], or ex-

plicitly enforce the local geometric structures [22; 23]. But the high computational

cost of these methods hinders real-time applications except some carefully designed

accelerations [24].

With similar assumptions as above, Kopf et. al. [25] proposed the Joint Bilateral

Filter (JBF), which is a kind of high-dimensional filters, to efficiently filter the noisy and

low resolution depth image under the guidance of the corresponded RGB image. It is an

extension of the famous Bilateral Filter (BF) [26] by modifying the structural guidance

from external feature maps, whose weights are defined by the spatial nearness and

feature proximity. To solve the texture copying and edge blurring artifacts underlying

the JBF, and further enforce its power of structural filtering, a list of variants of the JBF

have been proposed [1; 27–32] in recent literatures. The features can be texture/depth

intensities or patches [27; 31] and other specifically defined ones.

Another variant uses the median of the weighted depth candidate histogram [3; 4;

33] instead of the mean of this histogram as what JBF does, producing much more

robust results but also suffering from quantization error and slower speed. Weighted

mode filtering [3; 5; 34] otherwise looks for the histogram’s global mode, and has similar

artifacts. For a satisfactory performance with lower computational requirement, instead

of the introduction of parallel computation units like GPGPU for a brute-force imple-

mentation, acceleration techniques should be able to consider some theoretical ways to

approximate the distribution (or histogram) estimation by parametric formulations or

other non-parametric but efficient means.

In addition, the spatial enhancement, especially super-resolution and inpainting,

can be performed by patch matching throughout the depth image, which achieved

satisfactory visual results [35; 36] but with high computational complexity.

§ 1.1.2. RGB-D Temporal Enhancement 7

The depth hole filling problem is strongly related to the image inpainting and oc-

clusion handling in stereo vision. According to a recent work of Richardt et. al. [1],

standard JBF (actually similar for a series of global optimization methods) can effi-

ciently and seamlessly interpolate these depth holes, but it is vulnerable to produce

artifacts when a depth hole is too large or its corresponded texture patterns imply

unreliable structure inference. Moreover, the joint bilateral filtering operation cannot

preserve the high-order surface structures if no extra specific setting is involved. Be-

cause the extrapolated depth will always be piece-wise constant in a large hole. Many

stereo algorithms simply fill the depth holes from the background content, which all

suffer from significant artifacts when the scene is too complex. Our work on hole filling

is related to work of Wang et. al. [2] about stereoscopic inpainting, they over-segmented

stereo images and fitted a plane to each segment with estimated disparity, then prop-

agated the parametric planes into holes by matching segments in a greedy way. Their

segment matching cost function heavily relied on the stereo images that could not be

exploited in general case, while the plane regression technique is not precise enough to

estimate local surface structure.

In this thesis, the spatial enhancement is explored in two aspects. On one hand, a

hybrid strategy is proposed to upsample the raw depth image with interpolation and

faithfully complete the depth holes through structure preparation under the guidance

of the accompanied RGB image. On the other hand, a parameterized probabilistic

model is designed for the approximation of the weight distribution, the derived weighted

mode filter and weighted median filter have similar performances as the state-of-the-art

methods but only require a fraction of runtime and computational complexities.

1.1.2 RGB-D Temporal Enhancement

Even though the spatial enhancement of depth maps have been extensively studied as

discussed in the previous section, the temporal inconsistency problem is nevertheless

neglected in recent state-of-the-art methods, thus resulting in severe flickering artifacts

because the necessary temporal relationship between adjacent frames has not been

taken into consideration. However, due to various complex and even unpredictable

dynamic contents, as well as spatial distortions in a depth video, it is not easy to exactly

locate the regions where temporal consistency should be enforced. Several existing


methods [1; 5] employ the temporal texture similarity to extract 2D motion information,

but correct depth variation cannot always be maintained causing severe motion blur

artifacts. In addition, typical treatments always apply temporal consistency over a

short-length sequence (usually 2 ∼ 3 frames), which is insufficient to generate stable and

temporally consistent results over hundreds of frames. Furthermore, over-smoothing

around the boundaries between dynamic objects and static scenes should be eliminated

to produce high quality and well-defined depth video.

This thesis presents an alternative method to enhance a depth video both spatially

and temporally by addressing two aspects of these problems:

• efficiently and effectively enforcing the temporal consistency where it is necessary,

• and enabling online processing.

A common fact is that regions in one frame with various motion patterns (e.g., static,

slowly/fast moving and etc.) belonging to different objects or structures require tempo-

ral consistencies with different levels. For instance, the static region needs a long-range

temporal enhancement to ensure that it is static over a long duration, while dynamic

regions with slow/rapid motions expect short-term or no temporal consistency. How-

ever, it is difficult to accurately enhance arbitrary and complex dynamic contents in

the temporal domain without apparent motion blurs or depth distortions. Thus an

intuitive compromise does without the temporal enhancement in the dynamic region as

long as its spatial enhancement is done sufficiently well, in which the necessary depth

variation will not be distorted while the temporal artifacts are not easily perceived in

the static region. Therefore, we aim at strengthening long-range temporal consistency

around the static region whilst maintaining necessary depth variation in the dynamic

content. To accurately separate the static and dynamic regions, we track and incre-

mentally refine a probabilistic model called static structure in an online fashion, which

acts as a medium to indicate the region that is static in the current frame. By online

fusing the static region of the current frame into the static structure with an efficient

variational fusion scheme, this structure has implicitly gathered all the temporal data

at and before the current frame that belong to it. Substituting the static region by

the updated static structure, it is thus temporally consistent and stable in a long time

frames. Moreover, it is also suitable for online processing the streaming depth videos

§ 1.2. RGB-D Video Applications 9

(3D teleconference, 3DTV and etc.) without the necessity of storing long sequence of

adjacent frames, which is memory and computationally efficient.

1.2 RGB-D Video Applications

Provided with the RGB-D video sequences, many tasks that were once difficult using

only the RGB sequences become possible and simpler. The depth videos either act

as an explicit data source for tasks that focus on the interpretation, manipulation or

inference of the geometrical structure of the captured content, for instance, 3D facial

pose estimation and tracking, 3D scene reconstruction and etc, or provide implicit

geometrical cues for high-level learning or analyses of the 3D real world. In this thesis,

novel methods with respect to the geometrical manipulation and inference have been

explored, such as a depth-based robust 3D facial pose tracking system with online facial

model personalization, and two by-products of the proposed temporal enhancement –

novel view synthesis and background subtraction.

Conventional three-dimensional television (3DTV) systems require the binocular or

multi-view stereo RGB videos as its input. With the wide popularity of the RGB-D

cameras, the modern systems have been compatible with one or several synchronized

high-quality RGB-D video pairs for the synthesis of a new frame on the screen from a

novel viewpoint. It is further extended to the free-viewpoint television (FTV) system if

it can synthesize the novel video from any viewpoint in front of the screen. However, a

trade-off between the transmitted data and storage budget, and the complete coverage

of the captured 3D scene suggests a sparser RGB-D cameras system setup. To facili-

tate a visually plausible novel view synthesis, sufficient accurate depth videos should

be provided based on the raw RGB-D videos and the texture holes in the novel view

should be faithfully recovered. In this thesis, the novel view synthesis is performed as

an application for the proposed spatio-temporal depth video enhancement, which inher-

ently includes an online 3D static scene reconstruction. The progressively updated 3D

static scene offers reliable inference of the content in the texture holes. In addition, the

resultant structure-optimized depth videos greatly eliminate the misalignment errors

between the depth and RGB frames, and structure errors when filling the depth holes

while smoothing the noise and outliers. These advantages enable structure-optimized

novel view synthesis with reduced spatial distortions.


Robust tracking of the 3D facial pose is an essential task in the fields of com-

puter vision and computer graphics, with applications in facial performance capture,

human-computer interaction, immersive 3DTV and FTV, as well as VR and AR sys-

tems. Traditionally, the facial pose tracking has been successfully performed on RGB

videos [37–45] for optimally constrained scenes, but illumination variations, shadows,

and large and severe occlusions hampers the RGB-based facial pose tracking systems

from being employed in unconstrained scenarios. However, unconstrained scenarios,

on the other hand, are much more common in numerous consumer applications, e.g.,

interactive games in VR/AR, virtual chat and etc. Fortunately, driven by the emer-

gence of commodity real-time range sensors, utilizing the depth information has been

a new trend for robust 3D facial pose tracking, since the depth data explicitly tells

the spatial relationship and gives additional cues for the occlusion reasoning. Albeit

promising results have been proposed by leveraging both the RGB video and depth data

to facilitate unconstrained facial pose tracking, they cannot reliably handle occlusion

when the RGB data alone is inadequate due to inconsistent or poor lighting conditions.

Therefore, exploring the depth data alone for the robust 3D facial pose tracking is mean-

ingful as an alternative and is complementary to the traditional tracking systems. In

unconstrained scenarios with depth cameras as the input, there are new challenges: (1)

complex self-occlusions and object-to-face occlusion caused by hair, accessories, hands

and etc.; (2) the facial pose tracking algorithm should always be available and online

adaptive to any user without manual calibration; (3) the tracking should be stable over

time and not vulnerable to user’ expression variations. Unlike previous depth-based

approaches based on discriminative or data-driven methods [46–52] that require sophis-

ticated training or manual intervention, we leverage a parameterized generative face

model and a robust occlusion-aware pose estimation to facilitate the robust 3D facial

pose tracking system. It is designed to handle large and complex occlusions in uncon-

trolled scenes under the inconsistent illumination changes or poor lighting conditions,

and enable simultaneous facial pose tracking and face model personalization on-the-fly.

1.3 The Probabilistic Models

By interpreting the target problems listed in this thesis from the probabilistic view, not

only can we handle the uncertainties raised in each task, e.g., noise, outliers and other

§ 1.4. Thesis Contributions 11

artifacts, but the probabilistic models also encourage the formulations of compact and

learnable models with reliable predictions as long as proper models have been selected.

Furthermore, the generative model in probability is a complete probabilistic model

for generating the distribution of the observations as well as those for the underlying

prior models. Therefore, the generative model’s advantages over the discriminative

model are that it is a full probabilistic model of all variables and it simulates the

inherent (or hidden) prior distributions and randomly samples the observations. In

contrast, the discriminative model only focuses on the posterior but does not really

care what the inherent model is. In particular, for the tasks of the temporal RGB-D

video enhancement and 3D facial pose tracking, the generative model expresses more

complex relationship between the observed depth data and the hidden probabilistic

models like the static structure and the 3D multi-linear morphable face model [7].

In addition, the generative model can faithfully predict itself if no observations are

available.

The parameter estimation techniques with respect to the probabilistic generative

model vary from case to case. In this thesis, online variational Bayesian methods are

employed because of its effectiveness and memory-efficiency for online model adaptation

as well as the feasibility of handling intractable integrals arising in Bayesian inference.

1.4 Thesis Contributions

This thesis carries out research on spatial and temporal RGB-D video enhancement,

and robust 3D facial pose tracking with online face model personalization. The applied

RGB-D video data were captured by Kinect version 1 and low resolution time-of-flight

camera. The contributions of the thesis are as follows:

1. In the part of spatial enhancement of RGB-D videos, this thesis presents two

practical solutions: (a) One approach is a hybrid strategy combining the segment-

based parametrized structure propagation and the depth interpolation with high-

dimensional guided filtering. A new arbitrary-shape patch matching method is

proposed to robustly extend neighboring patches’ structures into the query patch.

Experiments show that the proposed method outperforms the state-of-the-art

methods with respect to the depth hole filling problem. (b) The other approach

is a novel parameterized probabilistic model for the acceleration of the weighted


distribution. Different from the conventional methods that need quite a number of

filtering operations to estimate a sufficiently accurate distribution, the proposed

method only requires a finite and a small amount of filtering operations based

on the structure of the input image. The derived weighted mode and median

filters are much faster but still effective as the state-of-the-art methods in various

applications like the spatial enhancement of RGB-D videos, and detailed contrast

enhancement, as well as JPEG compression artifact removal.

2. The temporally consistent RGB-D video enhancement is performed by introduc-

ing the static structure of the captured scene, which is estimated online by a

probabilistic generative mixture model with efficient variational parameter ap-

proximation, spatial enhancement and update scheme. Based on this special

probabilistic static structure, the proposed enhancement aims at strengthening

the long-range temporal consistency around the static region whilst maintaining

necessary depth variation in the dynamic content. The proposed framework is

compatible with online streaming RGB-D videos so that there is no necessity of

storing long sequence of adjacent frames, thus is memory and computationally

efficient.

3. This thesis unifies the 3D facial pose tracking and online identity adaptation

based on a parameterized generative face model that integrates the descriptions

of shape, identity and expression. This face model does not only effectively model

the identity but also provide the statistical interpretation for the expression.

By tracing the identity distribution in a generative perspective, the face model

can be gradually adapted to the user with sequentially inputted depth frames.

The occlusion-aware pose estimation is achieved by minimizing an information-

theoretic ray visibility score that regularizes the visibility of the face model in the

current depth frame. This method does not need explicit correspondence detec-

tion, but it both accurately estimates the facial pose and robustly handles the

occlusion problem.

1.5 Outline

This thesis is organized into six chapters.

§ 1.5. Outline 13

Chapter 2 provides the detailed algorithm about the spatial depth enhancement based

on a hybrid strategy combining the segment-based parameterized depth structure prop-

agation and depth interpolation based on high-dimensional guided filtering.

Chapter 3 presents a parameterized probabilistic approximation method for the ac-

celeration of the weighted median/mode filtering, which are much faster with barely

no sacrifice of their performances.

Chapter 4 proposes a temporally consistent depth video enhancement method based

on the online estimation of a probabilistic generative model called static structure.

Specifically, this chapter describes a two-stage procedure designed separately for the

static and dynamic regions of the current depth frame, both enabling long-term tem-

poral consistency and preserving necessary depth variations.

Chapter 5 presents an unified framework for robust 3D facial pose tracking and online

face model personalization. The facial tracking thread consists of a novel correspondence-

free and occlusion-aware rigid pose tracking method, while the generative face model

in the online personalization thread effectively depicts the identity and is robust to the

shape variations caused by expression changes.

Chapter 6 provides conclusions for the works listed above and suggests a number of

areas to be pursued as future work.

Chapter 2

Hybrid Geometric Hole Filling Strategy for

Spatial Enhancement

2.1 Introduction

Assume the raw depth image captured from a commodity depth sensor is with lower

resolution in comparison with the corresponded color image, and contains noise, out-

liers and severe depth holes. This chapter tries to tackle the low-resolution, noise and

outliers in a raw depth image together with a special treatment to the large hole filling

problem. In particular, the depth holes are originated from depth upsampling, unreli-

able depth removal along with the depth missing regions. In the first step, we invalidate

and remove unreliable depth pixels that are within vulnerable regions around complex

discontinuities or structures, then align the depth map with the color image and map it

into the color image’s coordinate. In the second step, a hybrid strategy is proposed to

fill in the depth hole by the combination of segment-based structure propagation and

depth interpolation. After that, a standard joint bilateral filter is applied to refine the

depth image. The overall framework is shown in Figure 2.1.

Hole RegionPartition

Filtering-basedDepth Interpolation

Segment-basedDepth Inference

Depth MapRefinement

Alignmentof Depthand ImagePair

Invalidationof LowReliableDepth

Depth Map

Color Image

Figure 2.1: Framework of the proposed method.

14

§ 2.2. Related Work 15

2.2 Related Work

Spatial depth map enhancement has been extensively studied for years. The most

studied work is the upsampling and smoothing problem. A pioneer work in this field

was done by Diebel et. al. [11]. They model the depth upsampling problem as a Markov

Random Field (MRF) with the assumptions that 1) discontinuities in color image and

corresponding depth map should be co-aligned, and 2) pixels with similar texture should

have similar depth. Under similar fashion, many researchers [12; 13] also use MRF

model or Auto-Regression model to upsample and smooth depth surface while preserve

the discontinuities. Differences among their works mainly come from the smoothness or

regularization terms in their objective functions. But such kind of energy minimization

methods is always computationally expensive, which hinders a variety of real-time

applications.

Based on the similar assumptions as depicted above, Kopf et. al. [25] proposed the

Joint Bilateral Upsampling (JBU) for fast and effective upsamping and smoothing of the

low resolution and noisy depth maps, as an extension of the famous bilateral filter [26].

To solve the artifacts occur in JBU, e.g., texture copying and edge blurring, plentiful

modified filters have been proposed [27; 29; 30] in recent years.

The depth hole filling problem is related to the tasks about image inpainting and

occlusion handling in stereo vision. According to the recent work of Richardt et. al. [1],

the standard joint bilateral filter (JBF) can efficiently fill depth holes, but it is vul-

nerable to produce artifacts when a hole region is too large. Moreover, the depth

interpolation by filtering methods cannot preserve the geometrical surface structures,

because the extrapolated depth surfaces can only be piecewise constant. On the other

hand, many stereo algorithms simply fill the depth holes from the background under the

assumption that the occlusions are usually located at the background regions, which

all suffer from significant artifacts when the captured scene is too complex. Our work

on hole filling is related to work of Wang et. al. [2] about stereoscopic inpainting, they

over-segment stereo images and fit a plane to each segment with the estimated dis-

parities, then propagate the 3D planes into holes by matching segments in a greedy

way. Their cost function for segment matching heavily relies on the stereo images thus

could not be exploited in general cases, and their plane fitting procedure is not precise

enough to estimate lthe ocal surface structures.

16 CHAP. 2. HYBRID GEOMETRIC HOLE FILLING STRATEGY FOR SPATIAL ENHANCEMENT

Figure 2.2: Align the depth map into color image coordinate and then partition thehole region into Ωs and Ωf . Test depth map comes from the Middlebury dataset.

2.3 Proposed Method

We take an image I and its corresponding depth map D as inputs. Define the set of

invisible (hole) pixels as Ω, and the set of visible pixels as Ψ.

2.3.1 Unreliable Region Detection and Invalidation

Before transforming the depth map into image’s coordinate, we need to invalidate

unreliable depth pixels. The reliability can be measured by the depth gradient as men-

tioned in [1], because unreliable pixels always occur along the depth discontinuities or

in a neighbourhood with high depth variance according to the fact that depth camera

cannot accurately capture depth in such regions. What’s more, most real-time depth

sensors have the mismatching errors between color and depth edges due to calibration

error between color camera and depth sensor. Invalidating such kinds of low reliable re-

gions and filling in depth with the guidance of image will diminish the edge mismatching

problem and increase the depth value reliability.

In detail, sobel approximation is applied to compute the depth gradient, while we

invalidate pixels that have larger gradient value than a given threshold τ .

2.3.2 Hybrid Strategy of Geometric Hole Filling

After invalidating the unreliable regions, and transforming it into color image’s coor-

dinate, the resultant depth map contains three types of holes in the depth map: holes

from occlusion and/or specular regions Ωo, invalidation Ωd and sparse upsampling Ωu.

Therefore, we define the hole set Ω as

Ω = {p | p ∈ Ωo ∪ Ωu ∪ Ωd}, (2.1)

§ 2.3.2. Hybrid Strategy of Geometric Hole Filling 17

where p indicates pixel coordinate. Our proposed hybrid strategy is a combination

of filtering and surface structure propagation. Filtering-based approaches are quite

efficient to interpolate depth values if the hole region is small, but it will possibly

fail when dealing with large holes. However, we can exploit the widely used segment

constraint [2] to infer the structure, i.e., segment a hole and its neighbors into several

small patches according to the guided color image, and we assume each patch has a

smooth surface structure without sudden depth variation. Then a patch with enough

depth samples can be modeled by a plane or curved surface and it is reasonable to

propagate its surface parameters into its neighbor patches with similar textures in the

hole.

Our hole filling process firstly partitions the hole set Ω into two subsets Ωf and

Ωs, and then employs the depth interpolation in Ωf and the depth inference in Ωs, see

Figure 2.2.

Hole Region Partition

A pixel q is considered in the region Ωf when its local w × w window has enough

informative samples to interpolate its depth. We dilate visible region Ψ by a square of

width w, as Ψw = Dilation(Ψ, w), then pixel p ∈ Ψw ∩ Ω will always have one sample

at least. Then the set Ωs and Ωf are

Ωs = Dilation ((Ω−Ψw ∩ Ω) , w) ∩ Ω (2.2)

Ωf = Ω− Ωs (2.3)

The dilation operation in Equation (2.2) is to safely exclude pixels that have insufficient

depth samples in their neighbor from Ωf .

Depth Interpolation by filtering

To fill Ωf , a standard joint bilateral filtering [25] is utilized. For each pixel p ∈ Ωf , and

its visible local neighbours q ∈ Np ∩Ψ in a w × w window, its estimated depth is

Dp =1

Np

∑q∈Np∩Ψ

Gs (p, q)Gr (Ip, Iq)Dq (2.4)


where Gs and Gr are Gaussian kernel functions with standard deviations σs and σr,

measuring the spatial similarity and range (color) similarity, respectively. Np is the

normalization factor that ensures the summation of weights is equal to zero.

Depth Inference under segment constraint

Many successful super-pixel segmentation methods have been published recently, in this

application we use a fast method called simple linear iterative clustering (SLIC) [53] to

group pixels into a set of color patches, in which pixels share similar color or texture.

Then patches that overlap Ωs will be sorted into two sets Sv and Su. Sv means the set

where each patch has enough visible pixels (e.g., more than 50%) to infer its surface

structure and patches in Su are not.

Surface model estimation for patches in Sv. For simplicity, we can just model

the surface by

D(u, v) = a0 + a1u+ a2v, or (2.5)

D(u, v) = a0 + a1u+ a2v + a3u2 + a4v

2 + a5uv (2.6)

where Equation (2.5) is the linear form, and Equation (2.6) is the quadratic form.

We use RANSAC to robustly estimate each patch’s surface model. What’s more, for

the sake of accuracy, we can alternatively transform the depth map into 3D metric

coordinate (X,Y, Z), and model function Z(X,Y ) under a similar way. In this case,

recovering pixel p’s depth is to find the intersection of the surface and the line-of-sight

along p.

After estimating surface models for patches in set Sv, their invisible pixels can

be efficiently inferred. At the same time, the surface models of visible patches are

also estimated. We can further refine them by merging patches with similar surface

structure, and then re-calculate their surface models.

Surface propagation for patches in Su. It turns out to be a patch matching

problem. Here we propose a greedy algorithm that robustly find two most similar

patches according to a novel matching cost.

Our algorithm firstly selects candidate patches set CSu = {Pv} against Su, where Pv

has an estimated surface model and it is chosen near hole Su, because surface structure

will be more consistent and reliable near the hole boundary. Thus filling process will

§ 2.3.2. Hybrid Strategy of Geometric Hole Filling 19

Figure 2.3: Illustration of patch matching process. The left image is segmented colorimage, the right one is a close-up of local region marked blue in left image. Pu is thequery patch, Pv is in candidate patch sets. Detail description is in the text.

be under an order from outer to inner patches. In each iteration, find the best matched

patch in CSu and assign its surface model to query patch Pu and fill in the depth, then

Pu will be added into CSu . This process will continue until all patches in Su is filled.

Given patch Pv ∈ CSu , and Pu in the set Su, we want to measure their similarity.

Since each patch has arbitrary shape, then commonly used MSE is inapplicable, and

mean intensity is not enough distinctive. Our proposed method randomly selects n

pixels in Pu as pju, j = 1, . . . , n and k pixels in Pv as piv, i = 1, . . . , k, and defines a m×m

square-sized sub-patch to each selected pixel, as Biv in Pv and Bj

u in Pu respectively. If

two patches are similar, their sub-patch matching cost should be minimal. Sub-patch

matching is valid because it considers the color and spatial distributions of texture

while is able to handle patch with arbitrary shape.

To robustly estimate their similarity and not introduce mismatch, we propose a

shape-adapted sum-of-square to measure the similarity between Biv and Bj

u.

EBj

u

(Bi

v

)=

‖Kiv ◦

(Bi

v −Bju

)‖2F

N iv

+‖Kj

u ◦(Bi

v −Bju

)‖2F

N ju

(2.7)

where Kiv and Kj

u are bilateral kernels centred at pixel piv and pju, which are similar

as Equation (2.4), measuring the color similarity and spatial similarity of the center

pixel against its neighbours. ◦ represents element-wise multiplication. N iv and N j

u are


normalization factors similar in Section 2.3.2. Then cost between Bju and patch Pv is

EBj

u(Pv) =

1k

∑ki=1 EBj

u(Bi

v).

Therefore, given CSu and a query patch Pu, to each Bju in Pu, we can find the best

patch Pv∗ that has the smallest cost. Then we can form a histogram that each bin

indicates a candidate patch, whose bin value is the number of sub-patches in Pu that

matches referred candidate patch. Then the bin with largest value refers to the most

similar patch. We normalize the histogram and denote it as HPu(Pv), where Pv ∈ CSu .

It is possible to find more than one patches that similar in color, we further add

spatial constraint into our framework. In detail, we measure the Euclidean distance be-

tween center pixels of two patches d (Pu, Pv), and normalize the distance by exponential

function, then the overall cost function is

TPu (Pv) = HPv (Pu) · exp(−d (Pu, Pv)

2/(2× σ2

d

))(2.8)

The maximum value of TPu (Pv) shows the optimal patch pair. Because patches under

similar texture may have different surface structures, just choose the best matched

patch may inevitably introduce errors. To eliminate it, we fill the query patch from

the most similar one to the least one. Once the filled patch is consistent with local

neighbours, this process will stop.

Depth Map Refinement

After filling in all the missed pixels in depth map, we can further refine it to reduce

noise and artifact, as well as enhance depth structure according to the guided image.

Recently we find standard joint bilateral filtering is sufficient to provide effective and

efficient results.

2.4 Experiments

In this section, we evaluate the performance of our proposed algorithm, and compare it

with other existing methods. Since the main contribution of our work is the hole filling

strategy, we compare its performance with other hole filling methods, e.g., algorithms

presented by Richardt et. al. [1] so-called multi-resolution joint bilateral upsampling

(MR-JBU), and Wang et. al. [2]. Test scenes are from the Middlebury datasets 1. We

1http://vision.middlebury.edu/stereo/data/

§ 2.4. Experiments 21

(a) Color Images

(b) Raw Depth Maps

(c) Ground Truth

Figure 2.4: Middlebury datasets employed for the experimental comparsions. Testscene (from left to right) are Raindeer, Midd2 and Teddy.

choose linear form to model the surface similar as that in [2] for fair comparison. The

noisy depth map is construct by introducing occlusion according to cross-checking of

stereo images, down-sampling(2×) and adding Gaussian noise.

Visual comparison is present in Figure 2.5. Obviously, JBU undergoes texture

mapping and blurring artifact, while Wang’s greedy patch matching algorithm pro-

duces apparent mismatching errors as well since stereo constraint is not applicable.

Representative artifacts are shown in red boxes. Quantitative comparisons are done

via measuring the average percentage of bad pixels (BPR, error ≥ 1) and mean absolute

difference (MAD), results on three test scenes are listed in table 2.1 and 2.2 and our

method outperforms the rest algorithms with least BPR rate and MAD score (in bold


(a) Color Images

(b) MR-JBU [1]

(c) Wang’s [2] Method

(d) Proposed Method

Figure 2.5: Visual comparison on the Middlebury datasets. From up to bottom: colorimages, results by [1], [2] and the proposed method. Test scene (from left to right) areRaindeer, Midd2 and Teddy.

font). According to quantitative and qualitative comparisons, our proposed method

performs satisfactory and better than the other methods.

§ 2.5. Summary 23

MR-JBU [1] Wang’s [2] Ours

Raindeer 8.35 3.65 3.33

Midd2 14.10 3.10 2.51

Teddy 7.23 4.09 3.66

Table 2.1: Comparison of bad pixel rate (%)

MR-JBU [1] Wang’s [2] Ours

Raindeer 1.13 0.98 0.47

Midd2 1.67 0.62 0.31

Teddy 0.68 0.64 0.40

Table 2.2: Comparison of mean absolute difference

2.5 Summary

In this chapter, a new depth image enhancement approach is proposed as a hybrid

strategy combining the filtering- and segment-based structure propagation. Specifically,

this thesis has presented a new arbitrary-shape patch matching method to robustly

extend neighbor patches’ structure into the query patch. Experiments show that the

proposed method outperforms the reference methods with respect to the depth hole

filling problem. In the future, we will pay more attention on improving the robustness

of the depth inference model so that the filled regions will be seamless and contain

fewer mismatching errors.

Chapter 3

Weighted Structure Filters Based on Parametric

Structural Decomposition

3.1 Introduction

A variety of popular image filters in computer vision are related to the local statistics

of the input image. For example, the median filter outputs the point that reaches

half of the local cumulative distribution [4; 54; 55]. The weighted mode filter [5; 56;

57] tries to find the global mode of the local distribution. Not only that, the widely

popular bilateral filter [26], can be expressed as the mean of the local distribution that is

estimated by a Gaussian kernel density estimator [58]. Provided a guidance feature map

e.g., image intensity, patch and etc.), the weighted local distribution can be modified

to jointly reflect the statistics of both the input image and the feature map, which in

addition contributes to several kinds of structure- or style-transfer applications, like

depth or disparity refinement in the stereo matching [4; 5] and joint filtering [25].

Not explicitly estimating the local distribution, there are a certain number of ap-

proaches that are designed for accelerating the bilateral filter or similar weighted av-

erage filter, such as the domain transform filter [32], adaptive manifolds filter [31] and

the guided filter [59]. However, efficient methods for immediate estimation of the local

distributions need further attention because many applications require direct opera-

tions on these distributions. Although the brute-force implementation is still adopted

in many computer vision systems, its high complexity limits its popularity and ham-

pers real-time systems and applications. Constant time algorithms for the estimation

of the local distributions (or histograms) have been proposed in the literature. For in-

stance, the constant time weighted median filter [4] and the smoothed local histogram

filters [3]. The complexity of these methods relies on the number of bins to generate the

histograms as well as the complexity of the filtering operation that calculates the value

24


of each bin. Even though the complexity of filtering operations have been reported as

O(1) in the literature, an 8-bit single channel gray-scale image usually needs 256 bins

to produce a sufficiently accurate result, not to mention continuous or high-precision

images.

Related to but different from these methods, in this chapter we proposed a novel

distribution estimation method for the sake of efficiency to accelerate various image

filters. It is based on the kernel density estimation with a new separable kernel defined

by a weighted combination of a series of probabilistic generative models. The resultant

distribution has a much reduced number of filtering operations which are also inde-

pendent of the values of the bins. The number of filtering operations is exactly the

number of models used, and is usually smaller than the number of bins so as to abate

the computational complexity. The required models can be the uniform quantization

of the domain of the input image, or locally adaptive to the structures of the inputs.

Since it is always the case that a local patch of an image can be decomposed into a

limited number of distinct local structures, only a small amount of the locally adaptive

models are necessary, thus the complexity is further reduced. We also accelerated the

weighted mode filter and the weighted median filter by leveraging the proposed distri-

bution estimation method. They own comparable performance in various applications

but a faster speed in comparison with current state-of-art algorithms.

3.2 Related Work

Weighted average filters, like the bilateral filter [25; 26], implicitly reflect properties of

the local distribution. The brute-force implementation generally suffers the issue of in-

efficiency. In [60] an approximated solution was proposed by formulating the bilateral

filtering as a high-dimensional low-pass filter, and can be accelerated by downsam-

pling the space-range domain. Following this idea, different data structures have been

proposed afterwards to further speedup the filters [31; 61–63], in which the adaptive

manifolds [31] caught our attention and inspired our research to construct the locally

adaptive models. Guided filter [59] is a popular and efficient constant-time alterna-

tive. It can imitate a similar filter response as that of bilateral filter, but enforces local

linear relationship between the filtering output and the guidance image. Domain trans-

form filter [32] also produces a similar constant-time edge-preserving filter and earns

26CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL

DECOMPOSITION

real-time performance without quantization or coarsening.

Median filter might be the first image filter that explicitly applies the local histogram

(a discretized distribution). Unlike the weighted median filter, which has no abundant

work focusing on its acceleration, the unweighted counterpart receives several constant

time solutions. One kind of these algorithms was present in the literature to lessen

the histogram update complexity [54; 55]. Another version introduced by Kass and

Solomon [3] drawn the isotropic filtering into the construction of a so-called smoothed

local histogram, which is a special case of the kernel density estimation, and the median

and mode of this histogram are thus estimated by a look-up table.

The weighted median filter as well as the weighted mode filter, however, cannot

directly duplicate the success in the previous discussion, since the weights are spatially

varying for each local window. Min et. al. [5] proposed a weighted mode filtering that

adopts bilateral weights for the depth video enhancement, but it lacks an efficient

implementations. The constant time weighted median filter [4] for disparity refinement

is one of the most recent works that try to accelerate the local distribution construction.

This method performs edge-preserving filtering to produce the probability of each bin in

the local histogram. The number of bins determines the number of filtering operations

applied. Thus it is less effective when hundreds of intensity levels are required, especially

for the processing of the natural images.

3.3 Motivation and Background

3.3.1 Non-parametric Representations of Local Image Statistics

Given an input grayscale image1 f and its corresponding feature map as its guidance,

the intensity distribution h(x, ·) in a patch centered at pixel x can be represented

non-parametrically by anisotropic kernel regression [64] as

h(x, g) =1

Z(x)

∑y∈Ωx)

w(x,y)φx (g, fy) , (3.1)

Ωx is a local neighborhood centered at x, whose area is the same as the target patch.

The kernel φ(·, ·) varies in different applications, a common choice is the Gaussian

kernel as φ(u, v;λ) =√

λ2π exp{−λ

2‖u−v‖2}, λ indicates its bandwidth and controls the

1A color image stacks red, green and blue intensity maps, and each of which owns similar non-parametric representation as that of a grayscale image.

§ 3.3.1. Non-parametric Representations of Local Image Statistics 27

Figure 3.1: Illustration of correlations among structures in local patches. (a) is thesample image. Four patches A,B,C andD were selected from the area in the black box.(b) shows the histograms of the four patches, which were fitted by the kernel regression.These revealed modes indicate the local structures. We labeled the estimated structuresas #1 to #4. (c) indicates the locations of these structures in each patch. Thesestructures are slowly varying in a local neighborhood and are shared among thesepatches.

distribution smoothness. It is worth noting that λ → ∞ results in φ(u, v;∞) = δ(u−v),

δ(·) is the Kronecker Delta function, which renders h(x, ·) weighted histogram [4].

The normalized factor Z(x) =∑

y∈Ωxw(x,y), whilst the weight w(x,y) measures the

spatial nearness and guidance feature affinity between x and y, which controls the

impact of pixel y to the center pixel x. Thus this distribution is not only controlled by

the intensity distribution but also adjusted by the guidance feature affinity. Despite the

huge amount of data prepared to non-parametrically describe the local image statistics,

it owns the flexibility to compactly fit the distribution almost in any patch of a natural

image.

Generally speaking, a small patch of a natural image does not contain a large num-

ber of distinct structures so that the local distribution is generally sparse. As shown

in Figure 3.1(b), the multi-modal distribution depict a small number of distinct struc-

tures in one pixel’s local neighborhood. Since each mode represents a subpopulation of

intensities that one structure may possess, thus a number of structure-preserving oper-

ations over a patch can be conducted by analyzing and manipulating its distribution.


DECOMPOSITION

For instance, a variety of structure-preserving image smoothers are related to this non-

parametric description. The weighted median filter (WMed) [4] outputs the median

fmed that lets the cumulative distribution of h(x, fmed) equal to 0.5. The weighted

mode filter (WMod) [5] tries to seek the maximum mode of h(x, f). Not only that, the

widely popular bilateral filter [26], is to estimate the mean of h(x, ·) [58].

3.3.2 Correlations across Local Structures

Figure 3.1(a) shows four patches A,B,C and D extracted from a natural grayscale

image, their histograms were fitted by the kernel regression (3.1) and shown in Fig-

ure 3.1(b). Even though these patches lie in different locations, they actually share

similar structures. For example, patches A and patches B own the same structure

#4 referring to “white lighthouse”. The structure #3 represents “cloud”, and it oc-

curs at the patches A and C. Likewise, #2 indicates the “sky” and is shared by the

patches B,C and D. Observed the similarity between the distributions generated for

these patches, notably we find that two pixels x and y in a local neighborhood share

similar responses to each structure since the structures change subtly over a small

neighborhood. For the sake of constructing coherent representation that accounts for

both the local and global statistics of structures, the global consistencies or correlations

among structures should be taken into account. We propose a parametric approach to

explicitly represent the spatially varying structures as a series of low-dimensional man-

ifolds [31], and formulated the weighted distribution by Gaussian mixture models with

weights adjusted by guided feature maps. Therefore, we can successfully utilize the lo-

cal image statistics and constrain it with the global correlation among local structures,

which enables a simple and effective image/video cue for various structure-preserving

applications.

3.3.3 Complexity of the Local Statistics Estimation

Common local image statistics are the mean, mode and median of the weighted local

distributions. However, as discussed in Section 3.3.1, despite the calculation of the mean

value is trivial as a bilateral filtering operation, the estimation of mode or median is

of high computational budget. The approximated probability distribution immediately

gets involved in the weighted median filter or the weighted mode filter since it replaces

§ 3.4. Accelerating the Distribution Estimation 29

the value of a pixel by the median or the global mode of h(x, ·). The median is usually

estimated by tracing the cumulative distribution [3]:

C(x, g) =∫ g

−∞h(x, g)dg =

1

Z(x)

∑y∈Ωx

w(x,y) ·∫ g

−∞φx (g, fy) dg (3.2)

until it meets 0.5. Because it involves a high dimensional filtering operation in estimat-

ing C(x, g) at each g, too many samples of g will bring about heavy computational cost.

On the other hand, typical ways to find the mode are the fixed-point iteration [56] or

sampling by a look-up table and interpolation [3]. The key element in either method

is the gradient of h(x, g) as

∂h(x, g)

∂g

∣∣∣g=g

=1

Z(x)

∑y∈Ωx

w(x,y) · ∂φx (g, fy)

∂g

∣∣∣g=g

, (3.3)

which is also the output after filtering. Similar problem occurs since the number of

filtering operations depends on the number of iterations to converge or the sampling

density of the look-up table.

To eliminate this issue, in the following sections we define a novel separable kernel

as a weighted combination of a series of probabilistic generative models to decrease the

number of filtering operations required to represent the distribution, and exploit the

constant time filters [32; 59] to reduce the complexity of the filtering operation.

3.4 Accelerating the Distribution Estimation

In this chapter, we propose a novel approach to approximate the probability distribution

by defining a new kernel based on a series of probabilistic generative models, which

can be factorized explicitly so as to extract the filtering operations in advance before

the distribution construction. With the proposed kernel, we introduce the accelerated

versions of the weighted mode filter and the weighted median filter. We will show it

later that they have excellent performance in terms of both quality and efficiency in

various applications.

3.4.1 Kernel Definition

Assume the input image is modeled by several (e.g., L) models throughout the whole

pixel domain, each of which is governed by a distribution as p(ηx|l), l ∈ L = {1, 2 . . . , L}


DECOMPOSITION

Figure 3.2: Illustration of the proposed kernel. (a) shows a 1D signal and two pixelsx and y. (b) represents the construction of κ(fx, fy), where the mean values of threemodels are shown in three different colors. It measures the similarity of fx and fy byevaluating the summation of the joint likelihood of them w.r.t. each model.

at each pixel x. These models actually act as prior knowledge to represent distinct local

structures in the input image. Two pixels x and y are similar if they both have high

probabilities to agree with the lth model (see Figure 3.2) as the following kernel:

κl(fx, fy) = px(fx|l)py(fy|l) (3.4)

=

∫ηx∈Hx

p(fx|ηx)p(ηx|l)dηx ·∫ηy∈Hy

p(fy|ηy)p(ηy|l)dηy, (3.5)

where px(fx|ηx) is the data likelihood. Hx and Hy are the domains of ηx and ηy,

respectively.

When all the L models are available, the overall kernel can be further defined as

their weighted combination:

κ(fx, fy) =L∑l=1

κl(fx, fy)px,y(l) =L∑l=1

px(fx|l)py(fy|l), (3.6)

κ(fx, fy) achieves the maximum value when p(fx|l) = [p(fx|l = 1), . . . , p(fx|l = L)]�

and p(fy|l) = [p(fy|l = 1), . . . , p(fy|l = L)]� are linearly dependent because of the

Cauchy-Schwarz inequality. Therefore, the fact that similar likelihood p(fx|l) and

p(fy|l) with respect to each model advises fx and fy are similar in the proposed kernel,

as suggested in Section 3.3.2.

What’s more, we can prove that κ(fx, fy) is a valid kernel since it is the inner

product of the feature vectors p(fx|l) and p(fy|l), which act as the non-linear mapping

from f onto the feature space defined by the L models. Not only that, it is able to

reliably approximate some popular kernels like Gaussian kernel [31] or Kronecker delta

kernel [4]2.

2Please refer to the appendix for a detailed derivation

§ 3.4.2. Probability Distribution Approximation 31

3.4.2 Probability Distribution Approximation

The approximated distribution can be written similarly as Equation (3.1) by replacing

φx(g, fx) with the proposed kernel as

h(x, g) ∝∑y∈Ωx

w(x,y)L∑l=1

px(g|l)py(fy|l) =L∑l=1

px(g|l) · ψx(l). (3.7)

The filtering operation ψx(l) =∑

y∈Ωxw(x,y)py(fy|l) is independent of g, and thus the

approximated distribution becomes a mixture of L densities. Instead of immediately fil-

tering φx(g, fy) for each g to obtain h(x, g), the proposed method can precompute ψx(l)

by merely L filtering operations in total and then estimate h(x, g) provided the priors

p(g|l). The proposed kernel approximates the distribution by extracting the filtering

operations independent of g and therefore reduces the complexity of the distribution

construction.

The cumulative distribution is hence C(x, g) ∝ ∑Ll=1 ψx(l)

∫ g−∞ p(g|l)dg and the

gradient ∂h(x,g)∂g |g=g ∝ ∑L

l=1 ψx(l)∂p(g|l)∂g |g=g, both of which do not contain additional

filtering operations except those for ψx(l), and thus have the potential to accelerate the

weighted median and mode filters.

Relationship with the Constant Time Weighted Median Filter [4] (CT-

Median)

Let the L models be equally quantized levels μl, l ∈ L of the intensity space, and denote

p(ηx|l) = δ(ηx − μl), p(fx|ηx) = δ(fx − ηx). We have the distribution as h(x, g) ∝∑Ll=1 δ(g − μl) · ∑y∈Ωx

w(x,y)δ(fy − μl), which is exact the form introduced in CT-

median.

Relation with the bilateral weighted mode filter [5] (BF-mode)

Similarly as the setup of the CT-Median, the L models are equally quantized lev-

els μl, l ∈ L. But we denote p(ηx|l) = N (ηx|μl,Σn) and p(fx|ηx) = δ(fx − ηx),

where Σn is the data variance. Therefore we estimate the distribution h(x, g) ∝∑Ll=1N (g|μl,Σn)ψx(l), where ψx(l) =

∑y∈Ωx

w(x,y)N (fy|μl,Σn). However, the his-

togram exploited in BF-mode is hBF-mode(x, g) ∝ ∑Ll=1 δ(g − μl)ψx(l). They share

the same coefficients ψx(l) but the proposed distribution employs the Gaussian kernel


DECOMPOSITION

Figure 3.3: Locally adaptive models (LAM) v.s. uniformly quantized models (UQM).A 1D signal is extracted from a gray-scale image shown in the left column and markedby orange. Both the LAM and UQM models (L = 3) are exploited to represent thesignal, which are shown in the right column. The top row is by UQM models, thebottom one is by LAM models. The LAM models are adaptive to the local structuresand own superior performance on representing the signal with limited number of models(e.g.L = 3).

instead of the Kronecker delta kernel applied in BF-mode.

3.4.3 Gaussian Model for the Proposed Kernel

An essential element of the proposed kernel is to determine and estimate the mod-

els as the priors to represent the input image. In particular, we apply the Gaussian

distribution to define these models for its convenience and efficiency in various image

processing applications.

Locally Adaptive Models

A simple strategy to define the models is just to equally quantize the domain f , named

as Uniformly Quantized Models (UQM). The mean of each model represents a quan-

tization level μl and the diagonal elements in Σl is set as the square of half of the

quantization interval. For a multi-dimensional image, each channel shares the same

process. Specifically, μlx = μl,Σl

x = Σl at the lth model for all x. It can well represent

cartoon style images and disparity maps from frontal parallel stereos. However, more

quantization levels are required to present a local complex structure under a sufficient

accuracy, as shown in Figure 3.3.

§ 3.4.3. Gaussian Model for the Proposed Kernel 33

Locally adaptive models (LAM) ought to be a superior idea since they tend to

describe the local structures by fewer models. The idea behind it is that we assume

a Gaussian mixture model in any local patch. Each model actually represents a local

mean estimator. Therefore, we just need the number of models is a few more than

the number of modes in the local distribution. For example, a natural image shown in

Figure 3.3 can be well represented by the LAM models. On the contrary, the UQM

models cannot well fit the local distribution if its number is insufficient.

The popular EM algorithm [64] is abandoned for the training of the LAM models

due to its high complexity and instability to ensure a good estimation. In this chapter,

we exploit an alternative and more efficient way to train the required models. Similarly

as [31], we also use a hierarchical segmentation approach to iteratively separate pixels

from distinct structures, which act as local clusters, into different models. We set the

segments as Sl, l ∈ L. This method involves simple low-pass filtering and fast PCA

operations, thus is efficient in implementation [31]. The mean and variance of each

pixel x for the lth model are

μlx =

1

Wlx

∑y∈Ωx

θlyfy, (3.8)

Σlx =

1

Wlx

∑Ωx

θlyfyf�y − μl

xμlx�, (3.9)

where θy = 1[y∈Sl] means the mask indicating pixels inside Sl. 1[·] is the indicator

function that equals to 1 when the input argument is true. The neighborhood Ωx is

set as the same local window as Equation (3.7). Wlx =

∑y∈Ωx

θly is the normalization

factor.

Kernel Specification

The prior probability for the lth model is p(ηx|l) = N (ηx|μlx,Σ

lx). Assume the data

likelihood p(fx|ηx) = N (fx|ηx,Σn), where Σn = σ2nI

d denotes the noise variance. Id

is the identity matrix where d is the number of channels in the input image. Thus we

have the kernel κ(fx, fy) accordingly as

κ(fx, fy) =1

L

L∑l=1

N (fx|μlx,Σn +Σl

x)N (fy|μly,Σn +Σl

y). (3.10)


DECOMPOSITION

Distribution Approximation

The approximated probability distribution to each pixel x is

h(x, g) =1

Z(x)

L∑l=1

N (g|μlx,Σn +Σl

x)ψx(l), (3.11)

where ψx(l) =∑

y∈Ωxw(x,y)N (fy|μl

y,Σn +Σly) and Z(x) =

∑Ll=1 ψx(l). The coeffi-

cients ψx(l), l ∈ L are estimated by filtering N (fy|μly,Σn +Σl

y) characterized by the

properties of w(x,y). This weight defines a joint filtering with the guidance of the

guided image. In this chapter, we choose two kinds of filters: Guided filter (GF) [59]

and Domain-transform filter (DF) [32]. They both have O(1) complexity and approxi-

mate the bilateral weight. GF has better performance on transferring local structures

from the guided feature map to the target image while DF is natural to process higher

dimensional images. Different applications exploit different weights. Here we denote

the parameters of the filtering operation as ω. ω = {r, ε} for GF, where r is the spatial

radius and ε denotes the fitting variance. And ω = {σs, σr} for DF, where σs is the

spatial standard variance and σr is the range standard variance.

The overall algorithm about the distribution approximation acceleration based on

the locally adaptive models is summarized in Algorithm 1.

Algorithm 1: Distribution Approximation Acceleration for the Locally Adaptivemodels

Input : Input image Fi, guided image Fg, parameter set {Lth, r, σn,ω};Output: Approximated distribution h(x, g);// 1. model generation

1 {Sl| l ∈ L} ← hierarchical segmentation [31] of Fi given Lth and r, σn;2 for l ← 1 to L do3 θly = 1[y∈Sl], Wl

x =∑

y∈Ωxθly;

4 μlx ← 1

Wlx

∑y∈Ωx

θlyfy, Σlx = 1

Wlx

∑y∈Ωx

θlyfyf�y − μl

xμlx�;

5 Ml ← {μlx,Σ

lx| ∀x}, l ∈ L // model parameters

// 2. distribution approximation

6 ψx(l) ←∑

y∈Ωxw(x,y)N (fy|μl

y,Σn +Σly), ψx(l) ← ψx(l)/

∑Ll=1 ψx(l);

7 h(x, g) ← ∑Ll=1N

(g|μl

x, σ2nI

d +Σlx

)ψx(l);

§ 3.4.3. Gaussian Model for the Proposed Kernel 35

0 0.5 10

0.05

0.1

0 0.5 10

0.05

0.1

σn = 10−3

σn = 10−2

σn = 10−1

σn = 10−3

σn = 10−2

σn = 10−1

Patch D Patch C

(a) h(x, g) under different data variances σ2n

0 0.5 10

0.05

0.1

0 0.5 10

0.05

0.1

σn = 10−3

σn = 10−2

σn = 10−1

σn = 10−3

σn = 10−2

σn = 10−1

Patch D Patch C

(b) h(x, g) under different data variances σ2n, L = 31

0 0.5 10

0.02

0.04

0 0.5 10

0.02

0.04

L = 7L = 15L = 31L = 63

L = 7L = 15L = 31L = 63

Patch D Patch C

(c) h(x, g) under different number of models L, σn = 10−2

Figure 3.4: h(x, g) and h(x, g) of the patches C and D (from the image shown inFigure 3.1) different conditions. The window size |N (x)| = 11×11 and only the spatialweights are exploited. (a) h(x, g) are estimated by the smoothed local histogram [3]under different data variance σ2

n. σn = 10−1, 10−2 and 10−3. (b) h(x, g) are estimatedby the proposed kernel under different data variances as in (a). (c) h(x, g) are estimatedunder different number of models L and the data variance is sticked as σn = 10−2. They-axis is rescaled to show the subtle differences between different curves.

Parameters

The proposed kernel needs two parameters σn and L. A larger σn suggests a smaller

number of necessary LAM models so as to reduce the overlapping intervals between

different models. While a smaller one requires more models to cover all the available

local structures. Therefore, we choose an automatic criterion [31] to stop generating

the LAM models when a high percentage of pixels are close to at least one model. In

detail, the criterion of closeness is set as ‖fx − μlx‖Σn ≤ 1. Together with a user-given

threshold Lth, L is determined when either the criterion or Lth is reached.


DECOMPOSITION

Figure 3.4(a) shows h(x, g) obtained by the smoothed local histogram [3] under

the same window size but different σns. Only spatial weights were adopted in w(x,y).

It shows that the larger σn is, the smoother the distribution will be. But h(x, g)

with a small σn = 10−3 doesn’t only report the structures but also the subtle texture

variations. However, in the case that σn = 0.1, even modes once had been referred

to different structures were merged into a common one. On the contrary, since the

proposed LAM models actually try to fit the local distribution by the estimated local

structures, systematically h(x, g) cannot sensitively record the textures. Thus h(x, g)’s

own similar shapes when σn = 10−3 and 10−2. But h(x, g) with a large σn = 10−1 also

inclines to merge nearby modes like that in h(x, g), as shown in Figure 3.4(b).

On the other hand, we estimated h(x, g) under different number of models with a

fixed data variance σ2n = 10−4, as shown in Figure 3.4(c). h(x, g) with a small L tries

to catch the main structure in the local window as much as possible, but fails to extract

the detail structures. However, those with a large L are capable to describe the detail

structures with the added models, thus the distribution is more similar to h(x, g) under

the same configuration.

To conclude what we have observed, the proposed kernel prefers to describing the

local structures rather than the total information that the local patch conveys. The

parameters σn and L are complementary with each other. The more the number of

models is, the more similar h(x, g) is as h(x, g). But the a large σn avoids to introduce a

large L because too many overlaps between different models will lose the identification

of each model. Therefore, by drawing the automatic stopping criterion and a manual

threshold Lth into the LAM model generation, the resultant distribution h(x, g) is both

efficient and effective.

3.5 Accelerated Weighted Filters

In this section, we propose the accelerated version of the weighted median and mode

filters based on the kernel discussed previously. We will also show it later that they have

excellent performance in terms of both quality and efficiency in various applications.

§ 3.5.1. Weighted Average Filter 37

3.5.1 Weighted Average Filter

The weighted average filter estimates the mean of h(x, g) at each pixel. The solution

is straightforward as

gavgx = E

[h(x, g)

]=

1

Z(x)

L∑l=1

ψx(l)μlx, (3.12)

according to the property of the GMM models [64].

This filter is closely related to the adaptive manifolds filter (AM-average) [31],

which is a fast approximation of the bilateral filter [26]. It computes the filter’s re-

sponse at a reduced set of sampling points and interpolating them to obtain the output

image [31]. Only a small number of low-cost filtering operations (equal to the amount

of the sampling points) is required, which shares a similar idea behinds the proposed

distribution approximation. However, AM-average mimics the exponential range ker-

nel in the weight by the Gauss-Hermite quadrature [31] given the sampling points. In

contrast to that, our method can incorporate various kernels (not only bilateral) as the

weight and otherwise approximates the data kernel φx(·, ·). The filter response of our

method is a weighted combination of the local structures, thus keeps local structures

and behaves more like a robust filter to prevent outliers.

3.5.2 Weighted Median Filter

The weighted median filter wants to find the median value throughout the given prob-

ability distribution. Since the resultant distribution is actually a mixture of Gaussian

models, an accelerated method is proposed by estimating the cumulative probability

C(x, μlx) at the mean value μl

x of each model. The median value is approximated

by interpolating two adjacent cumulative probabilities C(x, μkx) and C(x, μk+1

x ), where

C(x, μkx) ≤ 0.5 and C(x, μk+1

x ) ≥ 0.5. In detail,

gmedx ≈ 0.5− C(x, μk

x)

C(x, μk+1x )− C(x, μk

x)(μk+1

x − μkx) + μk

x. (3.13)

In practice we find the proposed method is simple and effective after all. However,

please notice that the median should be tracked per-channel for UQM models.


DECOMPOSITION

3.5.3 Weighted Mode Filter

The weighted mode filter is to find the global mode of h(x, g). Simple fixed-point

iteration is sufficient for the proposed Gaussian models. Let the gradient ∂h(x, g)/∂g =

0, we have the fixed-point iteration as

gn+1x =

(L∑l=1

Blx(g

nx)

(Σn +Σl

x

)−1)−1( L∑

l=1

Blx(g

nx)

(Σn +Σl

x

)−1μlx

), (3.14)

where Blx(g

nx) = N(gnx|μl

x,Σn+Σlx)ψx(l). Equation (3.14) recursively goes to the closest

mode and thus a good initialization g0x is necessary to avoid being stuck in wrong local

mode. In practice, let g0x = μm�

x where m� = argmaxm∑L

l=1 Blx(μ

mx ) is both effective

and reasonable.

3.6 Experimental Results and Discussions

3.6.1 Implementation Notes

We have implemented the proposed weighted mode filter and the weighted median filter

on a MATLAB platform. The results reported were measured on a 3.4 GHz Intel Core i7

processor with 16 GB RAM.

Parameter Definition

All input images and guidance images were normalized into [0, 1] for the convenience of

parameter definition. The data variance Σn = σ2nI

d, where σn is the standard variance

of the noise, Id is an identity matrix and d is the dimension of the input image. The

guided filter (GF) and the domain transform filter (DF) share the same parameter

setting, i.e., r = σs and ε = σ2r (ω = {r, ε} for GF, ω = {σs, σr} for DF). r and σs

was measured in pixels. For fair comparisons, the number of iterations in the weighted

mode filter was set as 10 for all the experiments.

Number of Models

An automatic criterion [31] stops generating the LAM models when a high percentage

of pixels are close to at least one model. In detail, the criterion of closeness is set as

‖fx−μlx‖Σn ≤ 1. Together with a user-given threshold Lth, the LAM models generation

will be stopped when either the criterion or Lth is reached. In addition, the number of

§ 3.6.2. Performance Evaluation 39

10 20 30 40 50 600

0.5

1

Number of models (L)

Runtim

eratio

LAM (ours)UQM (ours)Brute−force

Figure 3.5: Execution time comparison on the distribution construction w.r.t. thenumber of models. The input is a 8-bit single-channel image and the guidance is a3-channel image. The reference method is brute-force and traverses 256 discretizedbins.

the UQM models shared the same threshold Lth, and no automatic stopping criterion

was applied.

Compared Methods

We compared our proposed filters with two popular filters: the constant time weighted

median filter (CT-median) [4] and the bilateral weighted mode filter (BF-mode) [5].

The parameters of CT-median were given by the authors [4] and those of BF-mode

were optimized by exhaustive search. The number of bins in the reference methods was

fixed to 256 per-channel [3–5].

3.6.2 Performance Evaluation

Runtime Comparison

Figure 3.5 shows the execution time comparison between our method and the brute-

force constant time algorithm (cf. Equation (3.1) with GF weights to construct the

distribution. Both LAM and UQM models were under evaluation. Related parameters

were fairly configured. The y-axis is the ratio of runtime of the proposed method w.r.t.

the reference method, which assumed 256 discretized bins. L was defined manually

without automatic stopping criterion. Both the two proposed methods only possess

a fraction of the runtime against the reference one and are nearly proportion to the

number of models. But the LAM spends a little bit more time because of additional

filtering operations at the model generation step. Notice that when L is around 50, the

execution time of the proposed methods becomes half of that of the reference one.


DECOMPOSITION

The Number of Necessary LAM models

In fact, natural images, no matter color images or disparity/depth maps, are always

locally smooth. There is little necessity to generate so many LAM models (e.g., more

than 60) to fit the local distribution. To validate this observation, we estimated the

LAM models for all the color images in a published image dataset BSDS300 [65] with the

threshold of Lth = 64 and examined the distribution of necessary number of models.

The automatic stopping criterion was triggered when no less than 99.9% pixels were

fulfilled the constraint in Section 3.6.1.

Results are illustrated in Figure 3.6, where the left one was obtained by a window

size 21× 21 (i.e., r = 10) and the right one was 11× 11 (i.e., r = 5). Σn = 0.01× I3 for

both cases. The majority of images generally required at most 50 models to meet the

criterion. What’s more, the smaller the window size is, the fewer number of necessary

models are required, which verifies the discussions in Section 3.4.1. Based on these

results, we conclude that for the general case, the number of LAM models required for

a natural image merely exceeds a certain value under a given window size. As a typical

case, let the window size be 21×21 or smaller, we can safely constrain the threshold to

Lth = 64, and the runtime on the probability distribution construction is always fewer

than half of the brute-force implementation, as shown in Figure 3.5.

As a conclusion, the gain of the proposed method is generally 2 ∼ 3× faster than

the brute-force one for the gray-scale images. And it can be increased to 6 ∼ 9× for

color images as the number of channels is increased. For disparity/depth maps and

cartoon images, the number of necessary models can be reduced even further because

of their high structure homogeneity.

10 20 30 40 50 600

0.05

0.1


|N (x)| = 21× 21

10 20 30 40 50 600

0.05

0.1


|N (x)| = 11× 11

Figure 3.6: The distribution of the number of necessary local adaptive models inBSDS300 dataset. Left : the window size is 21× 21. Right : the window size is 11× 11.The smaller the window size, the fewer number of locally adaptive models is necessary.

§ 3.6.3. Applications 41

Raw input Ground truth CT-median,

Err. 2.76

BF-mode,

Err. 2.37

L = 7,

median,

Err. 4.96

L = 15,

median,

Err. 3.96

L = 31,

median,

Err. 3.34

L = 7,

mode,

Err. 2.62

L = 15,

mode,

Err. 2.36

L = 31,

mode,

Err. 2.41

Figure 3.7: Depth map enhancement on tsukuba. The first row shows the raw inputdisparity map, the ground truth, results by CT-median [4] and BF-mode [5] respectively,from left to right. Disparity maps in the 2nd and 3rd rows were obtained by the proposedweighted median filter and weighted mode filter, under different number of models.The models were generated by the LAM models. The error was evaluated on bad pixelratio with the threshold 1. GF weights were chosen and related parameters were fairlyconfigured.

L = 7,

LAM

L = 7,

UQM

L = 7,

LAM

L = 7,

UQM

Figure 3.8: Results of the weighted mode filter with 7 models.

3.6.3 Applications

Depth Map Enhancement

Depth maps with low resolution and poor quality, e.g., structural outliers, depth holes,

noise and etc, can be enhanced with the guidance of the registered high resolution

texture images [4; 5]. It is a popular and practical post-processing for acquiring visual

plausible and high accurate depth map from various depth acquisition techniques, like

stereo, ToF-camera or Kinect. Two state-of-the art approaches that take advantage of

the statistics information of the depth map are BF-mode [5] and CT-median [4]. Our

methods, both the weighted mode filter and the weighted median filter, gain similar


DECOMPOSITION

performance against them and require much less cost.

Figure 3.7 shows the results of a disparity map named tsukuba. The raw input

was generated by a simple box-filter aggregation [66] followed by left-right check and

hole-filling. LAM models were adopted for all these results and we fixed the number

of models utilized. Small L (e.g., L = 7) limits the LAM to define enough models to

cover all the local structures, thus tended to output slightly blurred results or assign

incorrect values in comparison with the referenced methods. Fortunately, by adopting

a few more models, the results become stable and similar to the reference results. For

instance, the BF-mode in our implementation required 15.09 sec to process the tsukuba

image, but the proposed weighted mode filter with 31 LAM models only cost 5.23 sec.

What’s more, the bad pixel ratio of the proposed method (Err. 2.41) is similar as that

(Err. 2.37) of BF-mode, but the PSNR is otherwise higher (25.28dB) against that of

the BF-mode (25.09dB).

Although a small L of the LAM models cannot cover all the details of the input

image, it still has a superior performance against the UQM models with the same L.

As shown in Figure 3.8, when L = 7, the LAM models captured more details of the

two test disparity maps and produced smoother outputs than the UQM models The

staircase artifact of the UQM models also occurs at BF-mode and CT-median, since

both of them are based on a discretized weighted histogram. When the bin number

is not sufficient, the quantization artifact will happen around the smooth and slanted

surfaces.

JPEG Artifact Removal

JPEG compression is a lossy compression scheme that usually brings about quantiza-

tion noise and block artifact. CT-median has been proven effective in eliminating this

compression artifact in clip-art cartoon images [4]. However, since CT-median encour-

ages piecewise constant intensities/colors, its drawback is apparent when processing

natural images.

As shown in Figure 3.9(b) and its zoomed-in patch, CT-median forced the image

eyes into several distinct layers, pixels inside one layer seemed constant everywhere.

Contrary to it, exploiting the LAM models, our method represented a piecewise smooth

result, as shown in Figure 3.9(c). Not only the compression artifact was removal, but

§ 3.6.3. Applications 43

(a) Input (b) CT-median (c) Ours-LAM (d) Ours-UQM

Zoom in of (a) Zoom in of (b) Zoom in of (c) Zoom in of (d)

Figure 3.9: JPEG compression artifact removal results by the weighted median filter.(a) The input degraded eyes image. (b) CT-median [4]. (c) The proposal weightedmedian filter with the LAM models and (d) is with the UQM models. The second rowshows the corresponding zoomed-in patches. The DF weights were chosen and all therelated parameters were fairly configured. Best viewed in electronic version.

the structure of the input image was still preserved. The UQM models, unfortunately,

had a slightly worse performance than that of LAM. The reason is straightforward

as it also tried to recover piecewise constant colors. In terms of runtime comparison,

both the LAM and UQM models only spent a small fraction of the runtime owned

by CT-median (i.e., 88.134 sec) to obtain Figure 3.9(b). The LAM models required

L = 15,Σn = 0.072 × I3 and |N (x)| = 11 × 11, it cost 16.74 sec in total. The UQM

models also owned L = 15, and the runtime was a little faster as 15.54 sec.

More Applications

We show two additional applications to indicate the potential of the proposed weighted

median filter and the weighted mode filter. Figure 3.10 shows the detail enhance-

ment for a natural rock image by the proposed weighted median filter under the LAM

models. The result is plausible for naked eyes without apparent artifact. Figure 3.11

presents the joint upsampling of a low-resolution and noisy disparity map with the

guidance of a registered high-resolution image. Both of the proposed filters generated

satisfactory results but the result by the weighted median filter tended to be smoother

and introduced a little blurring artifact, while that by the weighted mode filter was

sharper and contained a slight of staircase artifact.


DECOMPOSITION

Figure 3.10: Detail enhancement by the proposed weighted median filter under theLAM models. From left to right, the original rock image, after edge-preserving smooth-ing, and the detail enhanced image. GF weights were chosen.

Ground truth Ours-median Ours-mode

Figure 3.11: Joint depth map upsampling. The input disparity map was 8× up-sampled by the proposed weighted median filter and the weighted mode filter underthe LAM models. The raw input diparity map is shown in the top-left corner of theleftmost image. GF weights were chosen.

3.7 Summary

In this chapter, we propose a novel distribution construction method for accelerating

the weighted median/mode filters by defining a new separable kernel based on the

probabilistic generative models. Different from traditional methods that need quite

a number of filtering operations to estimate a sufficiently accurate distribution, the

proposed approach only requires a finite and a small amount of filtering operations

based on the structure of the input image. The accelerated weighted median filter and

weighted mode filter are thus introduced and utilized into various applications from

depth map enhancement, joint depth upsampling, outlier removal, detail enhancement

and so on.

As a part of the future work, the extension for video processing is interesting and

meaningful. A more robust and efficient way to estimate the locally adaptive models

shall be a great benefit. Moreover, increasing the efficiency on the median tracking and

mode seeking can further accelerate the proposed filters.

Chapter 4

Temporal Enhancement based on Static

Structure

4.1 Introduction

In this chapter, we present a novel method to enhance a depth video both spatially and

temporally by addressing two aspects of these problems: 1) efficiently and effectively

enforcing the temporal consistency where it is necessary, and 2) enabling online pro-

cessing. A common fact is that regions in one frame with various motion patterns e.g.,

static, slowly/fast moving and etc.) belong to different objects or structures and re-

quire temporal consistencies with different levels. For instance, the static region needs

a long-range temporal enhancement to ensure that it is static over a long duration,

while dynamic regions with slow/rapid motions expect short-term or no temporal con-

sistency. However, it is difficult to accurately enhance arbitrary and complex dynamic

contents in the temporal domain without apparent motion blurs or depth distortions.

Thus we propose an intuitive compromise to cancel the temporal enhancement in the

dynamic region as long as its spatial enhancement is sufficiently satisfactory, in which

the necessary depth variation will not be distorted while the temporal artifacts are not

as easy as those in the static region to be perceived. Therefore, we aim at strengthening

long-range temporal consistency around the static region whilst maintaining necessary

depth variation in the dynamic content. To accurately separate the static and dynamic

regions, we online track and incrementally refine a probabilistic model called static

structure, which acts as a medium to indicate the region that is static in the current

frame. By online fusing the static region of the current frame into the static struc-

ture with an efficient variational fusion scheme, this structure has implicitly gathered

all the temporal data at and before the current frame that belong to it. Substituting

the static region by the updated static structure, it is thus temporally consistent and

45

46 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE

stable in a long range accordingly. Moreover, it is also suitable for online processing

the streaming depth videos (3D teleconference, 3DTV and etc.) without the necessity

of storing amounts of adjacent frames, thus is memory and computationally efficient.

Overall, the temporally consistent depth video enhancement is performed at two

layers: 1) the static region of the input frame revealing the static structure is enhanced

spatially and temporally by an online fusion technique combining it with the static

structure, and 2) the dynamic content is enhanced spatially without temporal smooth-

ness. In addition to the advantages stated aforementioned, enhancing the static and

dynamic regions separately also effectively eliminates artifacts that frequently occur

in conventional depth video enhancements, like the blurring artifacts or the unreli-

able depth propagation, across the boundaries between dynamic objects and static

objects/background. Furthermore, when the depth video contains severe holes, the

static structure can fill static holes convincingly and leave the rest holes filled by the

dynamic content so as to avoid the inpainting artifacts. Since fully dynamic depth

videos usually have weak temporal consistency thus our proposed algorithm is rele-

gated to a spatial enhancement approach, which does not force the enhanced depth

video to bear unnecessary temporal smoothness.

The rest of the chapter is organized as follows. Section 4.2 reviews existing works

in spatial and temporal depth video enhancement, as well as approaches on static scene

reconstruction, which is indeed related to our formulation of the static structure. Sec-

tion 4.3 describes our proposed framework of online estimation of the static structure

and the approach regarding temporally consistent depth video enhancement. Experi-

mental results and discussions of our method can be found in Section 4.4. Discussions

about its limitations and applications are presented in Section 4.5. Concluding remarks

and discussion on future work are given in Section 4.6.

4.2 Related Work

Spatial enhancement On the aspect of global optimization, the pioneering work was

done by Diebel et. al. [11] utilizing the pixel-wise MRF model with the guidance of tex-

ture to denoise the depth map. Several augmented models were also proposed to handle

inpainting and super-resolution [12–16], with special choices of the data and smoothness


terms as well as additional regularization terms [16–24], enabling a reasonable perfor-

mance even without texture information [16]. But the high computational cost of these

methods hinders real-time applications. Another choice is high-dimensional filtering.

One variant is high-dimensional average filtering [1; 25; 27; 28; 30], whose weights are

defined by the spatial nearness and feature proximity. The feature can be texture/depth

intensities or patches [27; 31] and other user-defined ones. The main problems here are

edge blurring and texture mapping. Another variant uses the median of the depth

candidate histogram instead [4; 33], producing more robust results but also suffering

from quantization error and slower speed. Weighted mode filtering [5; 34] otherwise

looks for the histogram’s global mode, and has similar artifacts. In addition, spatial

enhancement, especially super-resolution and inpainting, can be performed by patch

matching throughout the depth map, which achieved satisfactory visual results [35; 36]

but with high computational complexity.

Temporal enhancement Existing temporal enhancement approaches usually em-

ploy the guidance of temporal texture consistency, especially by fusing the previous

depth frame onto the current one according to the motion vectors estimated between

the corresponding adjacent color frames [1; 5]. However, the neglect of additional

motion vectors in z-axis reduces the warping accuracy. 3D motion estimation is typ-

ically adopted to solve the problem in [67–69]. Following them, the temporal fusion

between current and warped previous frames are usually based on weighted average

or weighted median filters, and energy minimization as well [1; 5; 70; 71]. Therefore

the performance, on one hand, relies heavily on the accuracy of motion estimation,

which is difficult to be satisfied. On the other hand, the temporal continuity is only

preserved among few adjacent frames, which does not meet the demand of constrain-

ing long-range temporal consistency. To fix such an issue, Lang et. al. [6] proposed to

offline filter the paths which are the vectors of all the pixels that correspond to the

motion of one scene point over time. It provides a practical and remarkable solution

to enhance a depth video with long-range temporal consistency both effectively and

efficiently. Our work is related to, but has essential differences from the layer denoising

and completion proposed by Shen et. al. [72], which offline trained background layer

models beforehand to label the foreground and background of the input depth frame,

and no temporal consistency was strengthened. Conversely, our method estimates the


static structure in an online fashion and there is no need to have a series of depth frames

capturing purely static scenes. Moreover, the temporal consistency is maintained where

it is required. That aside, only the spatial enhancement is taken into consideration as

presented in [72].

Static scene reconstruction The static structure estimation is related to the static

scene reconstruction by fusing a series of depth maps. A majority of these works are

offline methods [73–77] which fuse a set of depth maps to output a single geomet-

ric structure, while the rest are online approaches that receive depth measurements

sequentially and incrementally estimate the current geometric structure. Offline meth-

ods always extract a batch of depth frames together so that the complexity becomes

unbearable when the number of frames is large. One of the offline approaches by Zit-

nick et. al. [77] employed the consistency of both the multiple view color and disparity,

which is analogous to our constraint of temporal consistency, to regularize the disparity

space distribution so as to bring about the refined disparity map. Most online methods

quantize the 3D space into grids [78–81] to reduce the memory and computational cost.

Thus they are always deficient in sub-grid accuracy, but one additional approach ex-

ploits a weighted sum of truncated signed distance function (TSDF) [79; 80] over depth

measurements. However it is sensitive to outliers and thus not robust to estimate a

static scene containing dynamic objects and heavy outliers. To robustly estimate the

static scene captured by noisy and cluttered data, some researchers have proposed a

variety of measurement models with parameters describing the nature of the noise and

outliers. Several methods [78; 82] need parameters learned from ground truth data

or those tuned empirically. One successful model that requires fewer manually tuned

parameters is the generative model, which has the ability to derive the model of the

noise and clutter characteristics from the input data. Vogiatzis et. al. [83] proposed a

generative Gaussian plus uniform model that simultaneously infers the depth and out-

lier ratio per pixel using an efficient online variational scheme, which meets the clutter

characteristics of depth maps generated by stereo. Our static structure estimation is

similar as an online generative model considering both noise and outliers as well as a

special treatment of the dynamic scenes.

§ 4.3. Approach 49

A moving objectA static object

Static background

Input depth frame

A moving objectA static object

Static background

Static structure

(a)

(b)

Figure 4.1: The illustration of the static structure in comparison with the input depthframe. (a) shows the input depth frame (in blue curve) lies on the captured scene, (b)represents the static structure (in black curve). The depth sensor is above the capturedscene. The static structure includes the static objects as well as the static background.

4.3 Approach

The static structure can be regarded as an intrinsic depth structure (and texture struc-

ture when the registered color video is available) underneath the captured scene1, which

always lies on or behind the surface of the input depth frame. As shown in Figure 4.1,

any moving or foreground object stays in front of the static structure whereas the static

objects or visible static background are usually on it, i.e., the depth value of the static

structure at one pixel is always deeper than that of a dynamic object at the same

place. But it is different from the “background” of a scene, because we focus more

on the “static” geometric structure rather than the distance from the camera. Since

the temporal consistency around static or slowly moving regions are required to be

enforced, the “static” nature is more useful than the idea of “background”.

To handle artifacts like noise, outliers and holes as well as complex dynamic con-

tents in the input depth frame, we propose a probabilistic generative mixture model

to describe the static structure as well as the characteristics of noise and outliers (Sec-

tion 4.3.1). We also define an efficient layer assignment leveraging dense conditional

random fields to accurately label input depth frame into dynamic and static regions

1Within the scope of this chapter, we assume the target depth video is captured by a static depthsensor hence the captured scene is static except the dynamic objects. Although the enhancement ofdepth video captured by moving cameras is a more general topic, we will refer it to our future work.


Input data

Layer Assignment

VariationalApproximation

Spatial Enhancement Static Structure

Static Structure

Temporally Consistent Depth

Video Enhancement

Online Static Structure Updating Scheme Enhanced depth

frame

Figure 4.2: Flowchart of the overall framework of the proposed method on the esti-mation of static structure and depth video enhancement. Please refer to the text forthe detailed description.

(Section 4.3.4). For the sake of memory and calculation efficiency, as well as the ability

to process streaming data, the static structure is online updated (Section 4.3.5) via

a variational approximation (Section 4.3.2) governed by a first order Markov chain,

which effectively fuses the labeled static region in the current depth frame with the

previous estimated structure. It is further refined spatially to fill holes and regularize

the structure (Section 4.3.5). The updated static structure in turn substitutes the static

region of the input depth frame, resulting in a temporally consistent depth video en-

hancement (Section 4.3.6). The framework of the online static structure update scheme

and temporally consistent depth video enhancement is referred to in the flowchart in

Figure 4.2.

Notation The data sequence is denoted as S and formed by a depth video D =

{Dt|t = 1, 2, . . . , T} as S = D, or as a pair of aligned depth plus color videos as

S = {D, I}, where I = {It|t = 1, 2, . . . , T}. The data in each frame is St = Dt or

{Dt, It}. The pixel location is defined as x, and its depth value at t is dtx and its

corresponding color is Itx. The parameter set for the probabilistic model at each frame

t is denoted as PS,t, and PS,tx is defined for each pixel x, whose elements are defined in

§ 4.3.1. A Probabilistic Generative Mixture Model 51

Static structure

Camera

centerd

State-FState-B

State-I

Figure 4.3: Illustration of three states of input depth measurements with respectto the static structure on one line-of-sight. The current static structure refers to theblue stick in the middle. Decision boundaries are marked as blue dot lines. The depthmeasurement d is categorized into state-I when it is placed around the static structure.When d is in front of this structure, we denote it as state-F. While it is far behind thestatic structure, the state is state-B.

detail in the following sections.

4.3.1 A Probabilistic Generative Mixture Model

At the very beginning, we only consider the case where S = D. Denote the se-

quentially incoming depth samples of pixel x on and before time t as forming a set

Dtx = {dτx|τ = 1, 2, . . . , t}. The depth value of the static structure in the pixel x is

Zx, whose noise is conveniently governed by a Gaussian distribution. We also propose

two individual outlier distributions to describe the outliers before and after the static

structure respectively. Hence, they do not only describe the depth distribution but also

provide evidence to indicate the state to which the current depth sample belongs.

State Description

The three states Ψ = {I, F,B} are illustrated in Figure 4.3 and listed as follows.

State-I: Fitting the static structure If dtx belongs to the static structure, we

assume that it follows a Gaussian distribution centered at Zx as N(dtx|Zx, ξ

2x

), where

ξx denotes the noise standard deviation, and is predefined based on the systematic error

of the depth sensor. For instance, the noise variance of Kinect is actually related to

the depth so it is appropriate to set ξx depth-dependently.

State-F: Forward outliers On the other hand, the depth measurements from moving

objects or outliers in front, follow a clutter distribution like Uf (dtx|Zx) = Uf · 1[dtx<Zx],

where 1[·] is an indicator function that equals to 1 when the input argument is true,


and 0 otherwise. This state is activated when dtx is smaller than Zx, and switched off

if it is larger than Zx. It can be inferred from this state that not only are the outliers

in front, but also that dynamic objects are at the given location.

State-B: Backward outliers Furthermore, it is possible that the input depth mea-

surements are outliers lying behind the current estimation of the static structure. An-

other similar indicator distribution is introduced as Ub

(dtx|Zx

)= Ub · 1[dtx>Zx]. It can

naturally represent outliers that have larger depth values than a given structure. Mean-

while, it provides a cue to infer the risk whether current static structure estimation is

incorrect.

An additional hidden variable mx =[mI

x,mFx ,m

Bx

]�is introduced as the state

indicator to represent these states, where mkx ∈ {0, 1}, k ∈ Ψ. In this case, only one

specific state mkx = 1 and the rest are 0s, thus

∑k∈Ψmk

x = 1.

A Generative Model

The reason to introduce the generative model is that it can simulate the static structure

as well as its noise and outliers, thus in case there are no observed measurements at

the current frame (e.g., depth holes), we can still give a reasonable static structure.

Moreover, given suitable parametric forms of these distributions, the generative model

can be online estimated and refined by updating the parameters with sequentially

incoming depth samples.

Likelihood Appending the state indicator mx, the likelihood of dtx conditioned on

mx and the static structure Zx is a product of the distributions of the three states as

p(dtx|mx, Zx) = N (dtx|Zx, ξ2x)

mIxUf

(dtx|Zx

)mFx Ub(d

tx|Zx)

mBx . It equals to one required

state distribution by triggering off this state indicator mkx = 1, k ∈ Ψ.

Prior Let the prior for Zx also be a Gaussian distribution with the mean μx and the

standard deviation σx, written as p(Zx) = N(Zx|μx, σ

2x

). σx is different from ξx since

it represents the possible range of the static structure rather than its noise level. The

prior of the chance to activate one state is a categorical distribution Cat(mx|ωx) [64],

where ωx =[ωIx, ω

Fx , ω

Bx

]�and

∑k∈Ψ ωk

x = 1, ωkx ∈ (0, 1). This parameter reveals

the opportunities to induce these states in advance of the input depth samples. And

ωx is further modeled by a Dirichlet distribution p(ωx) = Dir (ωx|αx), where αx =

[αIx, α

Fx , α

Bx ]

�, αkx ∈ R

+ and corresponds to ωkx.

§ 4.3.2. Variational Approximation 53

Posterior Two posteriors are in fact essential for the static structure estimation. One

is p(Zx,ωx|Dtx), which jointly presents the depth distribution of the static structure and

the popularity densities of these three states given the current and all previous depth

frames. The other is the posterior of the state indicator p(mx|Dtx), which represents

the possible states at the current frame. Based on the estimated posteriors, we can

evaluate the most probable depth values of the static structure by calculating the

expectation of p(Zx|Dtx) as Ep(Zx|Dt

x)[Zx]. The reliability of current estimation refers

to Ep(ωx|Dtx)

[ωIx

], which means that the larger the portion of input depth samples that

agree with the model, the more reliable the estimation is. The most possible state that

dtx should occupy is calculated straightforward from argmaxmx p(mx|Dtx).

4.3.2 Variational Approximation

However it is almost unfeasible to solve these posteriors analytically because it is not

independent between Zx and ωx for p(Zx,ωx|Dtx), and p(Zx|Dt

x) and p(ωx|Dtx) do not

exactly follow Gaussian and Dirichlet distributions any more. Therefore, variational

approximation [64] of the posteriors is introduced to provide sufficiently accurate ap-

proximated posteriors efficiently. It minimizes the Kullback-Leibler divergence between

the approximated and the original posteriors. The variationally approximated posteri-

ors are required to own the same parametric forms as the priors thus they also produce

analytical solutions to approximate Ep(Zx|Dtx)[Zx] and Ep(ωx|Dt

x)

[ωIx

]. The approxi-

mation starts from factorizing the posterior p(Zx,ωx|Dtx) into the product of inde-

pendent Gaussian distribution qt(Zx) = N(Zx|μt

x, (σtx)

2)and Dirichlet distribution

qt(ωx) = Dir(ωx|αtx) as

qt (Zx,ωx) = qt(Zx)qt(ωx) ∼ p(Zx,ωx|Dt

x). (4.1)

Not only that, but the exact estimation also depends on all the previous depth

samples Dtx. Too many frames will bring about unbearable complexity and memory

requirement. We admit a first order Markov chain into our framework so as to favor

the online estimation. It means that we can estimate the current posterior just based

on the current likelihood and the posterior of the last frame, therefore it is memory-

and computationally efficient. We reformulate the posterior as a sequential parameter


0 100 200 300 400 5000

20

40

60

80

100

No. of Frames

(a) Ztx v.s. Dt

x

20 40 60 80 10045

50

55

No. of Frames

approximated confidence interval

approximated μtx

(b) Confidence interval w.r.t. Ztx

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

No. of Frames

approximated ωI,tx

approximated ωF,tx

approximated ωB,tx

ideal ωIx

ideal ωFx

ideal ωBx

(c) Evolution of each states’ portions

0 20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

Depth Range

histogram of raw data

approximated distribution

(d) qT (dx|PD,Tx ) v.s. data histogram

Figure 4.4: Variational approximation of the parameter set of the static structurefor a 1D depth sequence. The number of frames is T = 500. (a) The expected depthsequence of the static structure versus the raw depth sequence, where the ideal Zx =50. (b) The confidence interval of Zt

x, the interval is centered μtx and between μt

x ±2σt

x with 95% confidence. (c) The evolution of the portions (defined by the expectedvalue of ωx at frame t, denoted by [ωI,t

x , ωF,tx , ωB,t

x ]) of the three states. The idealportions are ωx = [0.89, 0.1, 0.01]. (d) The estimated distribution qT (dx|PD,T

x ) versusthe normalized histogram estimated by DT

x when T = 500. The estimated depth of thestatic structure goes to the ideal value only with a few samples. Its confidence intervalshrinks rapidly, which means the uncertainty is reduced very fast. The portion of eachstate is evolved with the raw depth sequence, and they match their ideal value withenough depth samples. When T = 500, the estimated data distribution fits the datahistogram compactly.

estimation problem

qt (Zx,ωx) ∼ p(Zx,ωx|Dt

x

)∼ p(dtx|Zx,ωx)q

t−1 (Zx,ωx)/qt(dtx)

= Q(Zx,ωx|dtx),

(4.2)

§ 4.3.3. Improvement with Color Video 55

where the parameters of the left-hand side are estimated by matching moments between

the distributions of left- and right-hand sides [64]. This only considers the current data

samples and the previous estimated parameters to approximate the current parameters.

We define the parameter set estimated at t − 1 is PD,t−1x = {μt−1

x , σt−1x ,αt−1

x }, whilethe required parameter set is PD,t

x . By matching the first and the second moments

between Q(Zx|dtx) and qt(Zx) as well as those between Q(ωx|dtx) and qt(ωx|dtx) [84],

we can obtain a closed-form solution for any parameter in PD,tx . Please refer to the

supplementary materials for their detailed derivations.

Hence, recall the problem addressed in Section 4.3.1, the approximated posterior

with respect to the state indicator mx is qt(mkx = 1|dtx), k ∈ Ψ, which is a suitable

approximation of p(mx|Dtx) and also has a closed-form solution.

Apart from that, the most probable depth value of the static structure at pixel x

and time t is

Ztx = Ep(Zx|Dt

x)[Zx] � μt

x, (4.3)

and the reliability of current estimation of the static structure is the expectation of ωIx

as

rtx = Ep(ωx|Dtx)

[ωIx

]� αI,t

x /∑k∈Ψ

αk,tx . (4.4)

As shown in Figure 4.4, an example of the variational approximation of the param-

eter set for a 1D depth sequence illustrates the potential of the proposed method to

capture the nature of the input depth sequence.

4.3.3 Improvement with Color Video

The above discussion only considers the estimation and update of the static structure

with the depth video. A more complete treatment is together with the registered color

video, in which case an improved probabilistic generative model can be formulated as

follows.

Prior We introduce another prior over Cx, the color value of the static structure at x

as p(Cx) = N (Cx|Ux,Σx) with two parameters: the mean Ux and the variance Σx.

Likelihood The likelihood of input depth and color samples dtx and Itx conditioned


on mx given Zx and Cx is

p(dtx, Itx|mx, Zx,Cx) = Uf

(dtx|Zx

)mFx Ub

(dtx|Zx

)mBx

×[N

(dtx|Zx, ξ

2x

)N

(Itx|Cx,Ξx

)]mIx , (4.5)

where Ξx denotes the variance matrix for the color noise. A step further we have

the likelihood of dtx and Itx conditioned on Zx and Cx accordingly. This formulation

improves the inference since the input depth sample will belong to the static structure

only when both the depth and color samples agree with the previous model. Therefore,

the risk of false estimation is reduced.

Posterior and variational approximation In a similar fashion in Section 4.3.2,

we can derive the approximated posterior when color video exists. The parameter set

PS,tx =

{μtx, σ

tx,U

tx,Σ

tx,α

tx

}, S = {D, I} can also be estimated online and analytically.

Furthermore, the most probable depth Ztx and color Ct

x of the static structure are

achieved based on μtx and Ut

x. The approximate posteriors qt(mkx|dtx, Itx), k ∈ Ψ are

also derived accordingly.

4.3.4 Layer Assignment

In this section, we would like to find the static region of the input depth frame so

as to robustly update the model of the static structure and find the dynamic region.

Specifically, we label the input depth frame in three layers L = {liss, ldyn, locc}:

• liss: agree with estimated static structure;

• ldyn: belong to dynamic objects in its front; or

• locc: refer to the once occluded structure behind it.

The additional label locc is essential because the regions belonging to the once occluded

structure do not fit the current model, but they reveal the hidden structure behind the

current estimated static structure. It also points out that current estimation produces

bias at these regions, in which the depth structure from the input depth frame Dt

would be a more reasonable substitution to rectify the previous estimation.

One toy example is shown in Figure 4.5, where Dt provides a different layout from

the current static structure. Intuitively, locc occurs when the input depth frame provides

§ 4.3.4. Layer Assignment 57

Static structure

Input depth frame

Color frames

Figure 4.5: One toy example illustrates the layer assignment. The cyan dot lineindicates the current estimated depth structure of the static structure, and the redsolid line is from the input depth frame. If color frames are available, they provideadditional constraints to regularize the assignment, where the upper line correspondsto the current estimated texture structure of the static structure, and the lower onerefers to the input color frame.

larger depth values and exposes the hidden static structure. ldyn, on the contrary,

encourages smaller depth values. Furthermore, the failure of inference due to depth

holes, noise and outliers can be eliminated by the introduction of texture information,

which also provides additional cues to regularize their spatial layout.

To improve the expressive power to label complex structures that is employed

frequently in our case, we exploit a fully connected conditional random field (fully-

connected CRF) [85] to strengthen the spatial long-range relationship. Assume a ran-

dom field L = {lx ∈ L | ∀x} conditioned on the input data St and the previous model

parameter set M = PS,t−1. The Gibbs energy of a label assignment L is

E(L|St,M) =∑x

ψu

(lx|St,M

)+

1

2

∑x �=y

ψp

(lx, ly|St,M

), (4.6)

where x and y are pixel locations. ψu(·) and ψp(·, ·) indicate the unary and pairwise

potentials. St = Dt or {Dt, It}.

Definition of unary and pairwise potentials

We define the unary potentials and pairwise potentials as follows:

Unary potentials The unary potentials are negative logarithms of the approximated

posteriors qt(mx|Stx), indicating the chance that the current depth samples should

follow the previous estimation (i.e., liss requires mIx = 1), or in its front (i.e., ldyn

needs mFx = 1) or at its back (i.e., locc refers to mB

x = 1). In detail, we have

ψu

(lx = lk|St,M

)= − ln qt(mk

x = 1|Stx), and lk and mk

x follow the correspondences


listed above.

Pairwise potentials The pairwise potential between pixels x and y is a weighted

mixture of Gaussian kernels as

ψp(lx, ly|Stx,Mx) = 1[lx �=ly] ·

{ws exp

(−τα‖x− y‖2/2

)+ wr exp

(−‖Δtfx −Δtfy‖2Σβ

/2− τγ‖x− y‖2/2)}

. (4.7)

We define Δtfx = f I,t−1x −f tx to measure the difference between the features of the static

structure and those of the input data. When St = Dt, f tx and f I,t−1x are the normalized

dtx and Zt−1x , by a whitening process of the overall variance (ξtx)

2 = (σt−1x )2 + ξ2x. If

St = {Dt, It}, let f tx and f I,t−1x be the concatenations of the normalized vectors

[dtx; I

tx

]and

[Zt−1x ;Ct−1

x

]. The color features are normalized with the variance Ξt

x = Ξx+Σt−1x .

The indicator function 1[lx �=ly] lets the pairwise potentials be Potts model. It en-

courages a penalty for nearby pixels that are assigned different labels but they have

similar features. The first kernel is a smoothness kernel that removes small isolated

regions and is adjusted by τα. The second kernel is a range kernel trying to force nearby

pixels with similar depth and/or color variation to share the same label, with a given

parameter τγ to set the degree of nearness. ‖Δtfx−Δtfy‖2Σβis the Mahalanobis distance

between Δtfx and Δtfy, where the covariance matrix Σβ encodes the feature proximity.

The weight of the range kernel is set as wr. If we only have the range kernel, the result

tends to be noisy, while if we only have the smooth kernel, the structure cannot be well

regularized.

Inference

We exploit an efficient mean field inference method for fully-connected CRF when the

pairwise potentials are Gaussian [85]. It turns out to be an iterative estimation process

convolving several runs of real-time high dimensional filtering characterized by the

pairwise potentials (4.7).

4.3.5 Online Static Structure Update Scheme

The online static structure updating scheme is actually a sequential variational pa-

rameter estimation problem with a layer assignment to exclude the dynamic objects

and include the once occluded static structure. A spatial enhancement is appended to

§ 4.3.5. Online Static Structure Update Scheme 59

Algorithm 2: Online Static Structure Update Scheme

Input : Data sequence S = {Sτ |τ = 0, 1, 2, . . .};Initial parameter set PS

init;Output: Current parameter set PS,t ;

// initialization

1 t ← 0, PS,0 ← param init(S0, PSinit);

2 while S �= ∅ do3 t ← t+ 1;

// 1.layer assignment

4 M ← PS,t−1, L ← argminLE(L|St,M);// 2.parameter update

5 for ∀x do

6 if lx = liss then PS,tx ← vari approx(St

x, PS,t−1x ) ;

7 else if lx = locc then PS,tx ← param init(St

x,PSinit) ;

8 else if lx = ldyn then PS,tx ← PS,t−1

x ;

// 3.spatial enhancement

9 Ztx ← μt

x, ∀x;10 Zt ← spatial enhance(Zt,PS,t), μt

x ← Ztx, ∀x;

regularize the spatial layout of the structure. The sketch of the algorithm is given in

Algorithm 1.

An initialization of the parameter set PS is necessary. We set the initial μ0x = d0x,

where d0x ∈ D0 from the first frame of the depth video. Similarly, let U0x = I0x, where

I0x ∈ I0 from the color video. The noise parameters ξx and Ξx are user-specified

constants which should be large enough to enable sufficient variance of input data.

σ0x and Σ0

x will be initialized as large values as well. The parameters of ωx are also

set up with given constants α0x. A convenient setup is αI,0

x = αF,0x = αB,0

x . The

user-given initialization parameter set is PSinit = {ξx, σ0

x,α0x | ∀x} when S = D and

PSinit = {ξx, σ0

x,Ξx,Σ0x,α

0x | ∀x} when S = {D, I}. In addition, the layer assignment

is not applied in the initialization step.

At the tth frame, the layer assignment is applied at first based on the previous

parameter set PS,t−1 and the input data St. The region in which lx = liss will perform

the variational parameter estimation to obtain a renewed PS,tx . If lx = ldyn, it belongs

to a dynamic object so that PS,tx = PS,t−1

x . But on the other hand, if lx = locc, the

parameter set of this pixel is re-initialized as in the initialization step, but μtx = dtx,

Utx = Itx. Furthermore, it is a common phenomenon that the input depth frames


contain holes without depth measurements. In this case, μtx and λt

x will not be updated

in these special regions.

The spatial enhancement, including hole filling, smoothing and regularization, is

necessary to generate a spatially refined static structure. It is performed after the

parameter estimation in each frame, where we have obtained the most probable depth

map Zt (Ztx ∈ Zt). A variational inpainting method incorporating a TV-Huber norm

and a data term by Mahalanobis distance with the variance (ξtx)2 is employed for spatial

enhancement, which is iteratively solved by a primal-dual approach [16]. Since the

solver requires hundreds of runs to converge, a trade-off between speed and accuracy is

adopted by fixing the number of iterations and borrowing the spatially enhanced result

in the last frame Zt−1 as the initialization. To reduce error propagation, unreliable

pixels in the input depth map Zt are deleted according to the reliability check rtx >

0.5 (c.f., equation (4.4)). Given the most probable color image of the current static

structure Ct, the spatial enhancement in Zt can absorb the texture information to

guide the propagation of the local structures. In the end, the enhanced depth map Ztx

will substitute μtx in PS,t

x .

4.3.6 Temporally Consistent Depth Video Enhancement

Apart from spatial enhancement, it is preferred to employ temporal enhancement to

produce a flicker-free depth video. To enable long-range temporal consistency and allow

online processing, we exploit the static structure of the captured scene as a medium to

find the region in the input frame exhibiting long-range temporal connection. The static

region is enhanced by fusing the input depth measurements with the static structure

according to the online static structure update scheme in Section 4.3.5. Thus the static

regions are well-preserved and incrementally refined over time. The idea behind this

is that we restrict the temporal consistency to be enforced only around static region

or slowly moving objects. This assumption is somewhat restrictive but is still suitable

to process normal depth videos. One additional advantage of the proposed method is

that it can prevent bleeding artifacts that propagate depth values from moving objects

into the static background as long as the layer assignment is robust.

Given the resulting layer assignment of the current frame, the static region is where

§ 4.4. Experiments and Discussions 61

lx ∈ {liss, locc}, including the regions referring to the static structure and those belong-

ing to the once occluded static structure. They both expose the current visible static

structure of the captured scene, thus shall be enhanced separately from the dynamic

objects. The enhanced version is obtained by substituting it with its counterpart in

the static structure, which has already been updated in the temporal domain and en-

hanced in the spatial domain (see Section 4.3.5). The dynamic region can be enhanced

by various approaches explored in the literature, while in this chapter we exploit a con-

ventional joint bilateral filter, both to fill holes and to perform edge-preserving filtering

in the dynamic region.

The proposed method is both memory- and computationally efficient. The memory

requested for the proposed method only goes to storing the parameter set for each pixel,

thus is efficient to process streaming videos or long sequences of high quality. Excepting

the cost of the spatial enhancement, the complexity for temporal enhancement hinges on

that of the online static structure update scheme, in which all the required parameters

have analytical solutions whilst the layer assignment is efficient thanks to the constant-

time implementations in solving the fully-connected CRF model. Provided with an

efficient spatial enhancement approach, for example, the domain transform filter [32]

or the proposed one with the help of multi-thread techniques or GPGPUs [86], the

entire temporally consistent depth video enhancement procedure can be achieved in

real-time.

4.4 Experiments and Discussions

In this section, we present our experiments on synthetic and real data to demonstrate

the effectiveness and robustness of our static structure estimation and depth video

enhancement.

Section 4.4.1 numerically evaluates the performance of our method for static struc-

ture estimation using synthetic depth videos2 generated from the Middlebury dataset [87;

88]. Our method is not sensitive to the user-given parameters, and outperforms vari-

ous methods about static scene estimation with running time comparable to temporal

median filtering.

2The depth of one pixel in the depth frame is proportional to the reciprocal of the disparity at thesame place in the corresponding disparity frame.


(a) Reindeer (b) I : (10−2, 2) (c) II : (10−2, 2)

Figure 4.6: Sample frames of the input depth video with two types of noise andoutliers. (a) is the sample color frame, (b) and (c) are the contaminated depth frameswith σn = 2 and ωn = 10−2. (b) is type-I but (c) is type-II. Type-II error is worse thantype-I error with the same parameters.

In Section 4.4.2, we evaluate the performance on real data captured by Kinect and

ToF cameras. Both static and dynamic indoor scenes are taken into consideration.

Apart from the estimation of static structure, we also evaluate the performance of

the static scene reconstruction and most importantly, the temporally consistent depth

video enhancement in Section 4.4.3.

Initial parameters are simply set as α0x = [1, 1, 1]�, σ0

x is the 10% of the depth range

of the input scene. And initial parameter Σ0x is a diagonal matrix with each diagonal

entity the square of 10% of the color range.

4.4.1 Numerical Evaluation of the Static Structure Estimation By Synthe-

sized Data

We used two types of noise and outliers, which are illustrated in Figure 4.6, to contam-

inate the depth video so that we could evaluate the performance of our method with

respect to different kinds of errors from different types of depth sensors.

Type-I: We contaminated the depth map via p(dx|Zx) = (1 − ωn)N(dx|Zx, σ

2n

)+

ωnU (dx), where U(dx) is the reciprocal of the depth range. It is a general model of

noise and outliers.

Type-II:We damaged the disparity map by p(ddispx |Zdispx ) = (1−ωn)N (ddispx |Zdisp

x , σ2n)+

ωnU(ddispx ) and rounded it. The disparity map was transformed into the depth map.

U(ddispx ) was the reciprocal of the disparity range. It mimicked the outliers in common

depth videos captured by stereo or Kinect.

§ 4.4.1. Numerical Evaluation of the Static Structure Estimation By Synthesized Data 63

outlier param u

stdparam

σ

1e−5 1e−3 1e−1

0

10

20 1e0

1e1

1e2

1e3

(a) I : (10−3, 1)

outlier param u

stdparam

σ

1e−5 1e−3 1e−1

0

10

20 1e0

1e1

1e2

1e3

(b) I : (10−2, 2)

outlier param u

stdparam

σ

1e−5 1e−3 1e−1

0

10

20 1e0

1e1

1e2

1e3

(c) I : (10−1, 4)

outlier param u

stdparam

σ

1e−5 1e−3 1e−1

0

10

20 1e0

1e1

1e2

1e3

(d) II : (10−3, 1)

outlier param u

stdparam

σ

1e−5 1e−3 1e−1

0

10

20 1e0

1e1

1e2

1e3

(e) II : (10−2, 2)

outlier param u

stdparam

σ

1e−5 1e−3 1e−1

0

10

20 1e0

1e1

1e2

1e3

(f) II : (10−1, 4)

Figure 4.7: RMSE maps with varying u and σ under different noise and outlierparameter pairs (ωn, σn). (a)-(c) were contaminated by type-I, while (d)-(e) were con-taminated by type-II.

0 50 100100

101

102

103

Frame Order

RMSE

(10−3.5

, 20)

(10−3.3 , 3.2)

(a) (10−1, 4)

0 50 100100

101

102

103

Frame Order

RMSE

(10−3.7

, 20)

(10−3.7 , 2.2)

(b) (10−2, 2)

0 50 100100

101

102

103

Frame Order

RMSE

(10−5, 20)

(10−4.8 , 2.2)

(c) (10−3, 1)

Figure 4.8: Performance comparisons between the constant and depth-dependent ξxunder different type-II noise and outlier parameter pairs (ωn, σn). The red curve is bydepth-dependent ξx, and the blue curve is by constant ξx. Each curve is obtained atits own optimal parameter pair (u, σ), as shown in the legends.

Analysis of user-given parameters

We first evaluated the user-given parameters for the outlier parameters Uf , Ub and the

noise standard deviation ξx. In case-I, we set ξx = σ as a constant throughout the pixel

domain. For case-II, the choice of ξx should be suitable to dispose of the non-uniform


quantization error due to disparity-depth conversion as ξx = σ d2xfB .3 Meanwhile, we set

Uf = Ub = u. The experiments were evaluated by the RMSE score with varying u and σ

under different levels of noise (σd) and outliers (ωd). The results are shown in Figure 4.7,

where the test video had 100 frames. We set σ ∈ [0, 20] and u ∈ [10−5, 10−1]. Notice

that the tested scene was static thus there was NO need to perform layer assignment.

The spatial enhancement was also skipped.

The proposed method achieves satisfactory performances and is insensitive to ξx,

but a slightly bigger ξx turns out to be more robust. On the other hand, we obtain low

RMSE scores when u is around or smaller than the reciprocal of the depth range (≤ 10−3

in the test depth videos). Although smaller u can still achieve good performance, its

range tends to be narrower when noise level is increased. In practice, setting the Uf

and Ub to be the reciprocal of the depth range is sufficient and convenient, since it

actually means that the outliers may uniformly occur inside the depth range.

In addition, the depth-dependent noise parameter ξx performs superior to the con-

stant ξx in dealing with type-II error. A shown in Figure 4.8, comparisons of the results

by optimal parameter pairs (u, σ) of both cases4 reveal that a larger constant ξx is re-

quired to catch severer noise presented at larger depth values due to the property of

type-II error. In comparison with the depth-dependent noise, constant ξx might be

sufficient for slightly noisy depth videos as shown in Figure 4.8(c), but lacks capability

to catch severe noise, as shown in Figure 4.8(a) and (b).

Comparison of synthetic static scenes

As some online 3D scene reconstruction methods can also successfully perform the

static scene estimation in an online fashion, we numerically compared several state-

of-the-art candidates, i.e., the truncated signed distance function (TSDF) [79; 80] in

KinectFusion, the temporal median filter (t-MF) and the generative model for depth

fusion (g-DF) [81], with our method. The grid number per pixel was set as 100, for

both TSDF and g-DF. The temporal window size of t-MF was 5 in our experiments.

As shown in Figure 4.9, as with all other methods, our methods tend to decrease the

RMSE progressively with more frames included. However, our method is robust to the

3f is the focal length and B is the baseline, both of which are provided in the Middlebury dataset.The conversion relationship is derived in the supplementary materials.

4The optimal results were obtained by exhaustive search of 400 uniformly-sampled parameter pairsin the range σ ∈ [0, 20] and u ∈ [10−5, 10−1].


0 20 40 60 80 10010−1

100

101

102

103

Frame Order

RMSE

Ours

TSDF

t -MF

g -DF

Input

(a) I : (10−3, 1)

0 20 40 60 80 10010−1

100

101

102

103

Frame Order

RMSE

Ours

TSDF

t -MF

g -DF

Input

(b) II : (10−3, 1)

0 20 40 60 80 10010−1

100

101

102

103

Frame Order

RMSE

Ours

TSDF

t -MF

g -DF

Input

(c) I : (10−2, 2)

0 20 40 60 80 10010−1

100

101

102

103

Frame Order

RMSE

Ours

TSDF

t -MF

g -DF

Input

(d) II : (10−2, 2)

0 20 40 60 80 10010−1

100

101

102

103

Frame Order

RMSE

Ours

TSDF

t -MF

g -DF

Input

(e) I : (10−1, 4)

0 20 40 60 80 10010−1

100

101

102

103

Frame Order

RMSE

Ours

TSDF

t -MF

g -DF

Input

(f) II : (10−1, 4)

Figure 4.9: Comparison with other methods on static structure estimation of thesynthetic static scenes. Three levels of noise and outlier parameter pairs (ωn, σn) weretested. (a), (c) and (e) were of type-I. (b), (d) and (f) were of type-II. The x-axis marksthe frame order, and y-axis is the RMSE score.

noise and outliers for both the type-I and type-II errors, and has a faster rate, i.e., uses a

smaller number of frames to converge and achieve a stable performance. The severer the

noise is, the more superior the proposed method can be. Because TSDF is always slower

to converge and g-DF suffers from quantization errors, they cannot usually achieve the

same performance our method was able to achieve. In fact with a very large window


t

Static structure w/o spatial enhancement

Static structure w/ spatial enhancement (w/o texture)

Static structure w/ spatial enhancement (w/ texture)

t = 0 t = 5 t = 10

Raw depth sequence Raw color sequence

(a) Indoor_Scene_1

Figure 4.10: Visual evaluation on real indoor static scenes. (a) is the result of a realindoor scene Indoor Scene 1. The first row shows the raw depth sequences and colorsequences. The second row is the selected results of the estimated static structureswithout spatial enhancement at frame t = 0, 5, 10 respectively. The third row showscorresponding spatially enhanced static structure without texture information, whilethe last row exhibits the results with the guidance of texture information. The yellowcolor in the second row marks missed depth values (holes). Gray represents depthvalue, lighter meaning a nearer distance from the camera. Best viewed in color.

size, t-MF might obtain RMSE scores lower even than those of our method, but would

require more memory and will tend to be slower. Furthermore, t-MF does not provide

confidence of its output as our method does. Due to the quantization artifact of g-

DF, even in an optimal setting, g-DF will generally exhibit a lower performance than

that of the proposed method. The occupancy grid forbids g-DF to obtain a sub-grid

accuracy [81].


Static structure w/o spatial enhancement

Static structure w/ spatial enhancement (w/o texture)

Static structure w/ spatial enhancement (w/ texture)

t = 0 t = 5 t = 10

Raw depth sequence Raw color sequence

t

(b) Indoor_Scene_2

Figure 4.11: Visual evaluation on real indoor static scenes. (b) is the results of a realindoor scene Indoor Scene 2. The first row shows the raw depth sequences and colorsequences. The second row is the selected results of the estimated static structureswithout spatial enhancement at frame t = 0, 5, 10 respectively. The third row showscorresponding spatially enhanced static structure without texture information, whilethe last row exhibits the results with the guidance of texture information. The yellowcolor in the second row marks missed depth values (holes). Gray represents depthvalue, lighter meaning a nearer distance from the camera. Best viewed in color.

Algorithms t-MF (w=5/10) g-DF TSDF Ours

Running time (s) 0.0188 / 0.0309 1.9186 0.6847 0.0223

Table 4.1: Per-frame running time comparison (MATLAB platform)

The per-frame running time comparison is listed in table 4.1, where our method


is comparable with t-MF. The t-MF with window size 5 has a slightly smaller com-

putational cost, but when the window size is 10, its running time exceeds that of our

method. g-DF and TSDF require much more time to process a single frame, but their

performances are still not comparable to our method.

4.4.2 Evaluation of the Static Structure Estimation By Real Data

To validate our algorithm with the real data, we picked several depth video sequences

captured by Kinect and ToF cameras. Both static and dynamic scene were tested.

Static scenes

Figures 4.10 and 4.11 show the results of two real indoor scenes captured by Kinect. The

first row shows the raw depth and color video sequences. Notice that there are severe

holes presented, and fine details of the scene are susceptible to be missed or in fault

depth values. Nevertheless, their corresponding color frames are always well-defined

everywhere to provide enough cues to regularize the structures.

We first estimate the static structure just by raw depth frames without spatial

enhancement. See the second rows in Figures 4.10 and 4.11. Our method can robustly

fill holes as long as sufficient depth samples in previous frames are available. In the

case where only depth video is applicable, spatial enhancement is only constrained

by the depth information. Even though the results are more spatially regular than

those without spatial enhancement, the inpainting artifacts occur inside sufficient large

holes, and edges are blurred. Furthermore, wrong measurements in the depth frames

will be retained in the static structure and cannot be eliminated. As illustrated in

the last roww of Figures 4.10 and 4.11, spatial enhancement based on both depth and

texture information produces refined static structures which are both reliable and user-

acceptable. The results in green boxes show the differences between two types of spatial

enhancements.

Directly employing spatial enhancement in raw depth frames cannot obtain stable

results since randomly occurring holes and outliers destroy the consistency between

frames and prevent the regularizing of the depth map into a temporally stable one.

The static structure, in contrast, enforces the long-range temporal connection and

incrementally refines the static scene. As shown in red circles in Figures 4.10 and 4.11,

§ 4.4.2. Evaluation of the Static Structure Estimation By Real Data 69

(a) Indoor Scene 1

(b) Indoor Scene 2

Figure 4.12: Reliability maps of two test sequences of indoor static scenes.

the missed structures cannot be inferred satisfactorily just by conventional methods,

but they are refined and converged as time goes on.

The reliability of the estimated static structure (shown in Figure 4.12) is measured

by the proportion of samples that agree with the static structure as per equation (4.4),

which indicates that flat or smooth surfaces in the static structure are of high reliabil-

ity. Simply marking unreliable pixels by rtx ≤ 0.5, many unreliable pixels are around

discontinuities or occlusions. It is reasonable that measurements around such regions

tend to be unreliable due to the systematic limitations of Kinect and related depth

sensors. The static structure can be spatially regularized further in conjunction with

the reliability map by reducing the data confidence in the unreliable region. Our reli-

ability map is data-driven unlike those by heuristic methods [30] that need user-tuned

parameters.


(a)

(b)

(c)

(d)

(e)

#1 #2 #3 #4

#1 #2 #3 #4

#1 #2 #3 #4

#1 #2 #3 #4

#1 #2 #3 #4

#5

#5

#5

#5

#5

Figure 4.13: Static structure estimation on dyn kinect tl. (a) and (b) are thefirst five frames of the input sequence. (c) shows the layer assignment results. Red,green, blue denote liss, ldyn, locc, respectively. (d) represents the depth map of thestatic structure, and (e) shows the corresponding color map. The first frame is forinitialization.

Dynamic Scenes

Our method can effectively extract the dynamic content from a static scene and further

estimate and refine the static structure in the static region. Two videos were evalu-

ated. One was captured by Kinect, a real indoor scene with people moving around

(dyn kinect tl). The second was a hand sequence by a ToF camera (dyn tof tl).

Kinect sequence dyn kinect tl is a time-lapse (30×) Kinect sequence. Figure 4.13

shows the results of the first five frames. The parameter set for layer assignment:

wr = 5, ws = 10, τα = 16−2, τγ = 3−2,Σβ = I. Our proposed method can rapidly

capture the static structure (both the depth and color) with very few frames. The

artifact in Figure 4.13(d) is partially due to unreliable initialization, and partially

because of the limited number of iterations of hole filling in the spatial enhancement.

§ 4.4.3. Temporally Consistent Depth Video Enhancement 71

#1 #2 #3 #4 #5

(a)

(b)

(c)

#1 #2 #3 #4 #5

#1 #2 #3 #4 #5

Figure 4.14: Static structure estimation on dyn tof tl. (a) shows the first 5 framesof the input sequence. (b) shows the layer assignment results. Red, green, blue denoteliss, ldyn, locc, respectively. (c) represents the depth map of the static structure. Thefirst frame is for initialization.

The latter one can be solved gradually after a few frames, as shown in the 3rd and 4th

frames in (d). The former problem will be relieved by deleting unreliable area in the

future frames according to the reliability map.

ToF sequence The ToF sequence dyn tof tl [1] is time-lapse (10×) and has no

color sequence embedded, as shown in Figure 4.14. The parameter set for layer as-

signment: wr = 20, ws = 10, τα = 5−2, τγ = 1−2,Σβ = I. Similar to the results

from dyn kinect tl, the layer assignment can effectively exclude depth values from

dynamic foregrounds (lx = ldyn) and include those from once occluded static structures

(lx = locc). Nevertheless, the blurs around boundaries and high noise level in the raw

depth frames lead to halo artifacts in the resultant static structures at the first few

frames, because in this case the layer assignment cannot definitively point out the ex-

act boundaries between layers. Fortunately later frames provide more reliable depth

samples in such regions, thus eliminating these artifacts. See the difference from the

3rd to the 5th frame in Figure 4.14 (c).

4.4.3 Temporally Consistent Depth Video Enhancement

Our depth video enhancement works in conjunction with the online static structure up-

date scheme. The quality of the static structure determines the resulting performance


t

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 4.15: Comparison on depth video enhancement. (a) and (b) are se-lected frames from the test RGB-D video sequences. From left to right: the113rd, 133th, 153th, 173th, 193th and 213th frame. (c) shows the results by CSTF [1], and(d) by WMF [5]. (e) by Lang et. al. [6] (f) is generated by the proposed method. (g)compares the performances among these methods in the enlarged sub-regions (shownin raster-scan order). Best viewed in color.

from enhancing the tested frame spatially and temporally. Thanks to the robustness

and effectiveness of our proposed method, this temporally consistent enhancement out-

performs most existing representative approaches and shows comparable results with

current state-of-the-art long-range temporally consistent depth video enhancement [6].

We tested several RGB-D sequences to verify our conclusion and highlight the advan-

tages of the proposed method. These videos and their results by the proposed method

and the reference approaches are available in the supplementary materials.

As shown in Figure 4.15, the selected frames from the sequence dyn kinect 1 are


(a) dyn_kinect_2

Figure 4.16: Comparison on depth video enhancement. (a) are selected frames froman RGB-D video sequence dyn kinect 2. From top to bottom: the RGB frames, theraw depth frames, results by Lang et. al. [6] and results by the proposed method. Bestviewed in color.

113th, 133rd, 153rd, 173rd, 193rd and 213th, from left to right. Severe holes occurring in

each frame are partially because of occlusion and partially due to the absorbent or

reflecting materials in the captured scene. Worse still, the depth values around the

boundaries of captured objects tend to be erratic. The raw depth and color frames

are shown in Figure 4.15(a) and (b). The reference methods are the coherent spatio-

temporal filtering [1] (CSTF), the weighted mode filtering [5] (WMF) and temporally

consistent depth upsampling by Lang et. al. [6]. Their parameters were set up as their

default values as shown in their papers. The reference results are shown in (c), (d) and

(e) of Figure 4.15 and the results of the proposed method are listed in Figure 4.15(f).

CSTF is inclined to be more blurring than the rest of the methods, especially inside

the holes around the boundaries between the foreground objects and the background

scene. WMF needs to quantize the depth frame into finite bins (in this experiment, 256


(b) dyn_kinect_3

Figure 4.17: Comparison on depth video enhancement. (b) are selected frames froman RGB-D video sequence dyn kinect 3. From top to bottom: the RGB frames, theraw depth frames, results by Lang et. al. [6] and results by the proposed method. Bestviewed in color.

bins were applied), thus resulting in quantization artifacts even though it encourages

sharper boundaries without blurring. Referring to any frame in Figure 4.15(c) and

Figure 4.15(d), neither of these two methods can fill the depth holes with satisfactory

accuracy, and the latter one performs worse in stabilizing these holes. On one hand,

the reason is that they are not able to fill large holes without propagating wrong

depth structure when the texture is less informative. On the other hand, the temporal

consistency is enhanced only within a small temporal window, thus the structure insides

the holes cannot be preserved over a long time.

A recent practical and remarkable improvement attributable to Lang et. al. [6] is

a practical long-range temporal consistency enhancement. Its results shown in Fig-

ure 4.15(e) present its superiority both in structure regularization as well as temporal


stabilization over the previous two reference methods. Not only does the method by

Lang et. al.temporally stabilize the static objects and/or background, but also enforces

the long-range temporal consistency of the dynamic objects. In comparison with it, the

proposed method cannot preserve the temporal consistency inside the dynamic objects.

However, the bleeding artifacts in the hole regions still cannot be eliminated immedi-

ately and are vulnerable to be propagated over the adjacent frames. Although this

method is efficient in calculation thanks to the approximation solver by constant-time

domain transform filtering [32], this method is globally optimized thus it often requires

to store all frames into memory.

In comparison with the prior arts, the proposed method outperforms CSTF and

WMF both spatially and temporally. Furthermore, it generally has a performance

comparable to that of Lang et. al., sometimes even superior around static holes be-

tween dynamic objects and the static background, and in stabilizing the static region

of each frame. Figure 4.15(g) compares the results of the enlarged sub-regions denoted

by the red boxes in the original frames, in which our method features superior per-

formance in regularizing these depth structures. In addition, by observing the static

background behind the moving people, the proposed method offers much more stable

results around regions where there were large holes, e.g., the black computer cases and

monitors placed on and under the white tables. It both preserves the long-range stabil-

ity of the depth structure in the holes of the static region and at the same time prevents

depth propagation from the dynamic objects to the static background. Meanwhile, the

spatially enhanced static structure by the proposed method can incrementally refine it-

self by following the guidance of the corresponding color map, and gradually converges

to a stable output, just as discussed in Section 4.4.2.

Two additional results by the proposed method and Lang et. al. [6] are presented

in Figures 4.16 and 4.17, in which the proposed method provides comparable quality

while encourages even more delicate details around the hands and heads, as well as blur-

free boundaries between the human and the background, owing to the success of layer

assignment in Section 4.3.4. However, because the proposed method cannot extract a

static foreground object from the static background, blurring artifacts or false depth

propagation may happen around their boundaries, just as with the aforementioned

state-of-the-art method by Lang et. al.and the filtering-based approaches like CSTF


(a)

(b)

Figure 4.18: Failure cases of the proposed method. (a) and (b) are two representativeresults. From left to right: color frame, raw depth frame and the enhanced depth frame.Artifacts are bounded by the red dot boxes.

and WMF. As referring to the standing person near the background in Figure 4.17:

both the proposed method and that by Lang et. al.falsely propagated the depth values

from his left arm into the computer case in the background.

4.5 Limitations and Applications

4.5.1 Limitations

One limitation is that the proposed method has only been tested with indoor Kinect

and ToF depth videos. To verify the reliability and generality of the proposed method,

more diverse sources of depth data, e.g., depth videos capturing indoor or outdoor

scenes, by Kinect, ToF or laser scanners, as well as stereo vision, should be evaluated

thoroughly.

For RGB-D video enhancement, the proposed method is constrained by the as-

sumption that the static structure is “static” both in the depth and color channels.

The static structure estimation may thus fail if the captured scene has varying illu-

mination, in which case, the spatio-temporal enhancement turns into a conventional

spatial enhancement approach. Another possible drawback of the proposed method is

that the false estimation in the static structure cannot be eliminated if future frames

cannot provide enough reliable depth samples at the same location. For example, the

§ 4.5.1. Limitations 77

(a) RGB frame (b) Raw depth frame (c) Ours

(d) Lang et. al. [6] (e) CSTF [1] (f) WMF [5]

Figure 4.19: Examples of the background subtraction. Best viewed in color.

artifacts marked by the red dotted boxes in the enhanced depth frames (c.f. Fig-

ure 4.18) correspond to the holes in the input depth frames. The input depth frames

cannot provide effective and reliable depth samples at these regions thus the artifacts

cannot explicitly be detected by the proposed model. One possible improvement might

heuristically define a threshold to delete such regions from the static structure when

no reliable depth samples are received within a sufficient long time.

The proposed method only models the captured scene with dynamic and static

layers, and is not capable to immediately extend to multiple (e.g., more than 3) layers.

Although it is a tough question to define and model these layers properly, we believe

that more accurate results are possible by introducing such extension. For instance, the

relationship between different dynamic objects can be well-defined if multiple dynamic

layers compactly represent the local statistics of these objects. In this case, the spatial

enhancement of each object can be handled separately and/or hierarchically, while

the temporal enhancement can be adjusted to fit their distinctive motion patterns.

Therefore, this meaningful extension is worthy being explored in depth as a future

topic.


(a) (b) (c)

(d) (e) (f)

Figure 4.20: Examples of the novel view synthesis. (a) and (b) are the input RGBand depth frames. (c) is the enhanced depth frame by the proposed method. (d) isthe synthesized view by the raw depth frame and the RGB frame. Image holes in (d)is filled by the static structure, as shown in (e). (f) is the synthesized view based onthe enhanced depth frame and the image holes are also filled by the estimated staticstructure. Best viewed in color.

4.5.2 Applications

A high quality depth video improves various applications in the fields of image and

graphics processing, and computer vision as well. In the following two successful appli-

cations, the enhanced depth videos by the proposed method act as an effective cue to

improve performance.

Background Subtraction

We can use the processed RGB-D videos to improve the segmenting of the foreground

objects from the background. As shown in Figure 4.19, we tested one pair of RGB-D

frames for background subtraction by simply extracting the region with depth values

smaller than a constant threshold (in this case, we set the threshold as 1500mm) and

replacing the background by blue color. Note that there was no boundary matting

applied in all the cases. The proposed method (c.f. Figure 4.19(c)) shows a much more

refined and complete foreground segment than those by the reference methods.

§ 4.6. Summary 79

Novel View Synthesis

A variant of novel view synthesis, named depth image-based rendering (DIBR) [89]

applies the depth information to guide the warping of the texture map of one view to

another synthesized view. It is a popular technique for immersive telecommunication

or 3D and freeview TVs. However, the performance is hampered by the quality of

the depth video. As presented in Figure 4.20, the novel view generated by the raw

depth frame and the registered RGB frame contains severe holes and cracks, as well

as structure distortion. The static structure is appropriate to fill the image holes,

but it may replace the structure of the foreground objects by mistake. The enhanced

depth frame by the proposed method can preserve the depth structures well so that

less structure distortion occurs in its synthesized view. Thus the synthesized view is

visually plausible without apparent artifacts.

4.6 Summary

In this chapter, we present a novel method for robust temporally consistent depth en-

hancement by introducing the static structure of the captured scene, which is estimated

online by a probabilistic generative mixture model with efficient parameter estimation,

spatial enhancement and update scheme. After segmenting the input frame with an

efficient fully-connected CRF model, the dynamic region is enhanced spatially while the

static region is substituted by the updated static structure so as to favor a long-range

spatio-temporal enhancement. Quantitative evaluation shows the robustness of the

parameters estimation on the static structure and illustrates a superior performance

in comparison to various static scene estimation approaches. Qualitative evaluation

demonstrates that our method operates well on various indoor scenes and two kinds of

sources (Kinect and ToF camera), and proves that the proposed temporally consistent

depth video enhancement works satisfactory in comparison with existing methods.

As our future work, an extension to deal with moving cameras will be a meaningful

topic for study. Furthermore, we will improve the algorithm to reduce the effect of

wrong estimation and design an efficient reliability check to increase the accuracy of

the estimated static structure. Last but not the least, a more general probabilistic

framework to handle multiple dynamic and static layers is necessary to explore for

inherently increasing the performance of the proposed method.

Chapter 5

A Generative Model for Robust 3D Facial Pose

Tracking

5.1 Introduction

Our approach unifies the 3D facial pose tracking and online identity adaptation based

on a parameterized generative face model. This generative model is parameterized by

a 3D multilinear tensor model [7; 90] integrating the descriptions of shape, identity

and expression, which does not only effectively model the identity but also provide the

statistical interpretation for the expression. Different from the discriminative methods,

the generative model possesses the flexibility to generate and predict the distribution

and uncertainty underlying the face model. By tracing the identity distribution during

the tracking process in a generative way, the face model can be gradually adapted to

the captured user with sequential inputted depth frames. The occlusion-aware pose

estimation is achieved by minimizing an information-theoretic ray visibility score that

regularizes the visibility of the face model in current depth frame. It is induced from

an intuition that the visible face model points must be overlapped with the input point

cloud, otherwise they must be occluded by the input point cloud. This method does not

need explicit correspondence detection, but it both accurately estimates the facial pose

and handles the occlusions well. In each frame, we progressively adapt the face model

to the current user after the facial pose has been successfully collected. In summary,

we make the following contributions:

• A framework that unifies pose tracking and face model adaptation on-the-fly,

offering highly accurate, occlusion-aware and uninterrupted 3D realtime facial

pose tracking.

80


• A generative multilinear face model that both models the identity and expres-

sion, facilitating the on-the-fly face model personalization without the interference

caused by the expression variations.

• A ray visibility score that enables the correspondence-free and occlusion-aware

facial pose tracking.

5.2 Related Work

Conventionally, the facial pose tracking and model regression typically employs the

monocular RGB video sequences due to their availability. These approaches often

tracked the dynamics of sparse 2D or 3D facial features e.g., the face landmarks, optical

flow and etc., that correspond to the parametric 2D or 3D face models [37–40]. Accom-

panied with reliable feature detection methods, the facial pose can be tracked well under

moderate occlusions and motion patterns. Active appearance models (AAM) [41] and

constrained local models (CLM) [42] enabled real-time sparse 2D facial feature track-

ing under the data-driven manner, but they may fail when met with complex motions

or large facial deformations even though a user-specific training phase has been in-

volved. Recent advances about discriminative real-time 2D tracking based on random

forests [43], landmark prediction [44] and supervised descent methods [45] have shown

promising results in comparison with previous methods. In addition, explicit modeling

of the occlusions have been taken into account [43].

With the popularity of the consumer-level depth sensors, a variety of 3D facial pose

tracking and model personalization frameworks have been proposed. One category of

approaches achieved reliable tracking performance without the introduction of a 3D

model or template. Some of these methods employ depth features, such as facial fea-

tures defined by surface curvatures [46], nose detector [47], or triangular surface patch

descriptors [48]. However, these methods fail when the features cannot be detected,

e.g., highly noisy depth data, extreme poses or large occlusions. The other part of

these methods apply the discriminative methods. For instance, Fanelli et. al. used the

random classification and regression forests with depth image patches for face detection

and pose estimation [49; 50]. Riegler et. al. [51] trained a deep Hough network to simul-

taneously detect the face and estimate the facial pose. Another kind of discriminative

variants does not explicitly estimate the facial pose but instead determine the dense

82 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING

correspondence field between the input depth image and a pre-defined canonical face

model. The facial pose is thus estimated by regressing the face model to the input

depth image under this correspondence field. Inspired by the primal works of [91; 92],

the dense correspondence field can be generated according to a random classification

and regression forests with simple depth features, as proposed by Kazemi et. al. [52].

Apart from the random forests, convolutional neural networks (CNN) are also sufficient

for the dense correspondence field estimation, which have already been proven success-

ful in human pose estimation and body reconstruction [93]. Although these methods

are promising to provide sufficient accurate results, they require extensive and sophis-

ticated supervised training with large scale datasets. Moreover, these methods provide

weaker generalization to handle depth data captured from an unfamiliar depth sensor

that is not involved in the training dataset.

Another category matches a 3D face model to the input depth images with rigid

or non-rigid registration methods. For example, a common strategy is to fit a user-

specific face model by 3D morphable models [8; 94–103], or brute-force per-vertex 3D

face reconstruction [104–106]. Although helpful for accurate facial tracking systems,

most of them require offline initialization or user calibration for user-specific face model

generation. In contrast, there are prior arts that gradually refine the 3D morphable

model as more data is being collected, paralleling to the facial pose tracking thread [8;

100–102]. The proposed method falls into this category and the whole pipeline is re-

interpreted in the generative way. The 3D morphable models can be roughly categorized

into three classes: (1) the wireframe model (WFM) [97; 98]; (2) the Basel face model

(BFM) [8; 40; 52; 100; 101]; and (3) the multilinear face model (MFM) [7; 90; 95;

99] that models the identity and expression. Unlike the wireframe model that is too

sparse to produce a detailed face model and the Basel face model that cannot eliminate

biased reconstruction results caused by the expression variations, the multilinear face

model both describes the identity and expression [7; 90]. By treating the multilinear

face model in a generative way, the uncertainty of the expression variations can be

explicitly modeled and the reconstructed face model is thus less vulnerable to these

distortions.

The occlusion-aware registration problem is a long-standing issue for facial pose

§ 5.3. Probabilistic 3D Face Parameterization 83

Figure 5.1: Sample face meshes in the FaceWarehouse dataset. This dataset containsface meshes from a comprehensive set of expressions and a variety of identities includingdifferent ages, genders and races.

tracking. Despite the discriminative ways that label the occlusions through face seg-

mentation [101; 107] or patch-based feature learning [49–52], the rigid or non-rigid

ICP-based face model registration framework suffers the correspondence ambiguities

if the distance or normal vector compatibility criterion [9; 101; 105; 106] is applied.

Possible remedies are to apply the global optimization, e.g., particle swarm optimiza-

tion [108], through delicate objective functions [8]. Assume the multi-view visibility

consistency among partial depth scans, the occlusions and partial registration can also

be well handled [109]. The proposed ray visibility score observes the visibility con-

straint between the face model and the input point cloud, which is similar as what

Wang et. al. [109] had stated in the multi-view visibility consistency. The proposed ray

visibility score is formulated by an information-theoretic manner under the generative

perspective, which is more robust for uncertainties in the 3D morphable face model,

and less vulnerable for local minima that frequently occur in ICP-based methods.

5.3 Probabilistic 3D Face Parameterization

This section introduces the 3D face model with a probabilistic interpretation, which

acts as an effective prior for head pose estimation and face identity adaptation from a

streaming depth video.


5.3.1 Multilinear Face Model

We apply the multilinear model [7; 90] to parametrically generate arbitrary 3D faces

that are adaptive to different identities and expressions. It is controlled by a three

dimensional tensor C ∈ R3NM×Nid×Nexp with each dimension corresponds to shape,

identity and expression, respectively. The multilinear model represents a 3D face f =

(x1, y1, z1, . . . , xNM , yNM , zNM)� consisting of NM vertices (xn, yn, zn)� as

f = f + C ×2 w�id ×3 w

�exp, (5.1)

where wid ∈ RNid and wexp ∈ R

Nexp are linear weights for identity and expression,

respectively. ×i denotes the i-th mode product. f is the mean face in the training

dataset. The tensor C, or called the core tensor, encoding the subspaces that span

the shape variations of faces, is calculated by high-order singular value decomposition

(HOSVD) to the training dataset, i.e., C = T ×2Uid×3Uexp. Uid and Uexp are unitary

matrices from the mode-2 and mode-3 HOSVD to the data tensor T ∈ R3NM×Nid×Nexp .

T is a 3D tensor that collects the offsets against the mean face f from face meshes with

varying identities and expressions in the training dataset.

To produce compact and complete representations for arbitrary faces by equa-

tion (5.1), we train the mean face f and the core tensor C from the well-known Face-

Warehouse dataset [7]. As visualized in Figure 5.1, this dataset contains face meshes

ranging from 150 identities and 47 expressions, and includes different ages, genders,

races and etc. Its diversity enables the subspace of face shape variations which covers

most common identities and expressions.

To represent a face model compactly but efficiently, the core tensor C can be safely

truncated with respect to the dimensions of identity and expression. As the principal

shape variations are stored in the top-left 3NM × Nid × Nexp sub-tensor Cr, the face

model f can still be reconstructed by Cr without apparent distortion.

5.3.2 A Statistical Prior

It is sufficient to treat the multilinear tensor model as a statistical prior. Instead of

the conventional methods, we do not discriminatively employ one exact face template

(which may disagree with the face of the current user or be incompatible with local

§ 5.3.2. A Statistical Prior 85

(b) Variance by (c) Variance by

mm

(a) Mean face (d) Variance by

Figure 5.2: Illustration of the generic multilinear face model trained by the FaceWare-house dataset [7]. (a) The mean face f . (b) Illustration of per-vertex shape variationcaused jointly by wid and wexp. (c)–(d) Illustration of per-vertex shape variation withrespect to wid and wexp, respectively. The shape variation is represented as the stan-dard deviation of the marginalized per-vertex distribution. The shape variations in(b)–(d) are overlaid on the same neutral face model μM. Best viewed in electronicversion.

variations by user’s expression) to fit the target face point cloud or track its motion

with a set of heuristic parameters. With the help of statistical modeling, the face shape

and its distribution can be generatively synthesized and the dynamics of the tracked

face can be reliably predicted. In addition, with the introduction of the statistical prior,

there are fewer user-provided parameters in the proposed system than the conventional

discriminative methods.

Identity and Expression Priors

It is tractable to assume the identity weight wid and expression weight wexp follow two

independent Gaussian distributions, wid = μid + εid, εid ∼ N (εid|0,Σid) and wexp =

μexp+εexp, εexp ∼ N (εexp|0,Σexp). These prior distributions can be estimated from the

training data. Indeed, we simply apply μid = U�id1 and μexp = U�

exp1. The variance

matrices are identity matrices with scales, i.e., Σid = σ2idI, and Σexp = σ2

expI. And the

parameters σ2id = 1

Nidand σ2

exp = 1Nexp

are learned from the training set.


Multilinear Face Model Prior

The canonical face model M with respect to wid and wexp is analogically in the form

f = f + C ×2 μid ×3 μexp + C ×2 εid ×3 μexp

+C ×2 μid ×3 εexp + C ×2 εid ×3 εexp.(5.2)

If εid and εexp have smaller magnitudes than μid and μexp, the last term can be elim-

inated from (5.2) since it usually produces much smaller shape variations than those

caused solely by identity or expression variations. Therefore, the face model M ap-

proximately follows a Gaussian distribution as

pM(f) = N (f |μM,ΣM), (5.3)

and its neutral face is μM = f + C ×2 μid ×3 μexp, and its variance matrix is given by

ΣM = PidΣidP�id +PexpΣexpP

�exp. The projection matrices Pid and Pexp for identity

and expression are obtained by permuting (as the operation Π(·)) the tensor expres-

sions into the matrix forms: Pid = Π(C ×3 μexp) ∈ R3NM×Nid ,Pexp = Π(C ×2 μid) ∈

R3NM×Nexp .

Since in this work, we are interested in the facial pose tracking and identity adapta-

tion that is insensitive to expression variations, the joint distribution of the face model

and the identity parameter is necessarily introduced as

p(f ,wid) = pM(f |wid)p(wid)

= N (f |f +Pidwid,ΣE)N (wid|μid,Σid), (5.4)

where the variance of the expression ΣE = PexpΣexpP�exp is absorbed in the likelihood

p(f |wid). It is therefore robust to local shape variations led by expression, and the

posterior of wid will be less affected by the user’s expression in the current frame.

Moreover, the expression variance ΣE is adjusted by the identity, which is adapted to

current user and increases the robustness in pose estimation.

As shown in Figure 5.2, the joint shape variations by ΣM is varying from vertex to

vertex, but the facial region bears the most frequent shape distortions than the rest of

head regions. By decomposing ΣM into the shape variance by identity ΣI = PidΣidP�id

and the shape variance by expression ΣE , i.e., ΣM = ΣI+ΣE , we can observe that the

§ 5.4. Probabilistic Facial Pose Tracking 87

Rigid Motion Tracking

Identity Adaptation

Input

Output

Pose Parameters

Identity distribution

Color Image

Point Cloud

Face Model

Figure 5.3: System overview. We propose a unified probabilistic framework for robustfacial pose estimation and online identity adaption. In both threads, the generative facemodel acts as the key intermediate and it is updated immediately with the feedback ofthe identity adaptation. The input data is the depth map while the output is the rigid

pose parameter θ(t) and the updated face identity parameters {μ(t)id ,Σ

(t)id } that encode

the identity distribution p(t)(wid).

majority of shape variations are caused by the identities rather than the expressions.

Conversely, the shape uncertainties caused by the expressions are localized around the

mouth and chin, as well as the regions around the cheek and eyebrow. Meanwhile, the

neutral face μM is nearly the same as the mean face f in the training dataset, it reveals

that the priors of wid and wexp do not bias the face model M for the representation of

the training dataset.

In comparison with the Basel Face Model (BFM) [40] that parameterizes the face

model by principal component analysis on 200 3D face meshes with neutral expression,

and the Blendshapes [110] that encode the face expressions as a linear combination

of user-specific basic expression units from facial action coding system (FACS) [111],

the proposed multilinear model explicitly describes both the identity and expression

variations by a fully generative interpretation. It conveys more description power for

a general human face and introduces robustness to the local shape variations coming

from expression, both for the facial pose tracking and online identity adaptation.

5.4 Probabilistic Facial Pose Tracking

In this section, the pipeline of the proposed probabilistic facial pose tracking is intro-

duced. Our architecture is shown in Figure 5.3. There are two main components in


our system: robust facial pose tracking, and online identity adaptation. The identity

adaptation branch runs concurrently to the facial pose tracking branch. Both branches

are performed under a probabilistic framework.

Robust Facial Pose Tracking. The facial pose tracking is achieved by fitting a

3D face model to every captured 3D point cloud. In this work, the face model is

a probabilistic multilinear model M as depicted in Equations (5.1) and (5.3), with

the prior over the identity parameter is being updated to match the current identity,

but the prior of the expression parameter has to be kept fixed. The rigid motion

is estimated between the input data and the synthesized face model updated in the

previous frame. Outliers and occlusions are robustly eliminated according to a novel

ray visibility constraint, while the pose parameter is obtained by minimizing the ray

visibility score based on the Kullback-Leibler divergence [64] between the face model

and the surface distribution. The pose parameters θ do not only include the rotation

angles ω and translation vector t, but also the scale s for the first few frames, as the

face model may not match the input point cloud because of the scale difference. s will

be fixed when the identity has converged.

Online Identity Adaptation. The face model M is initialized with the generic mul-

tilinear model trained by the FaceWarehouse dataset and described by Equation (5.3).

It is gradually adapted to the user’s identity during tracking. Accounting for the entire

history of identity specification, the posterior with respect to the identity parameterwid

is recursively updated based on the assumed-density filtering and the first order Markov

chain [64]. As the face model takes the local shape variation caused by expression into

consideration as discussed in Section 5.3.2, the identity adaptation automatically alle-

viates these distortions.

5.4.1 Robust Facial Pose Tracking

Prior to tracking, we need to detect the face position for the first frame, or when the

tracking has failed. A variety of methods are applicable in our test scenario. In this

work, we employ a simple head detection method by Meyer et. al. [8], then crop the

input depth map to get a depth patch centered at the detected head center and within

a radius of r = 100 pixels. Denote the point cloud extracted from this depth patch as

P, with NP = |P| as the number of points in P.

§ 5.4.1. Robust Facial Pose Tracking 89

Self-occlusion Occluded by hair

Occluded by accessories Occluded by hand/gesture

Figure 5.4: Samples of the occluded faces. The occlusions are caused by multiplefactors. For instance, the face is occluded by itself, or the face is occluded by otherobjects like hair, accessories, hands and etc.

The pose parameters are θ = {ω, t, α} indicating the rotation angles, translation

vector, and the logarithm of the scale s, i.e., s = eα > 0, ∀α ∈ R. A canonical face

model point fn is rigidly warped into qn, n ∈ {n, . . . , NM} with the encoded orientation,

position and scale,

qn = T(θ) ◦ fn = eαR(ω)fn + t, (5.5)

where R(ω) is the rotation matrix that transformed from ω. The transformation

T(θ) ◦ fn describes this rigid warping. Therefore, the warped face model Q possesses a

similar distribution for each qn ∈ Q, given the same prior as Equation (5.3):

pQ(qn;θ) = N (qn|T(θ) ◦ μM,[n], e2αΣM,[n]), (5.6)

and μM,[n] = fn+(C ×2 μid × μexp)[n] as the nth vertex in the face model. The variance

is adapted by the scale factor and ΣM,[n] is the submatrix in ΣM that corresponds to

point fn. To find an optimal pose parameter that matches the warped face model Qand the input point cloud P, we require the surface distribution of P to be within the

range spanned by the distribution of face model Q.


Ray visibility Constraint

It is well known that occlusions are inevitable in uncontrolled scenarios. For exam-

ple, the occluded human faces are always behind the occluding objects, like hair, fin-

gers/hands, glasses, accessories, as shown in Figure 5.4. Suppose the face model Q and

the input point cloud P are correctly aligned, Q may be partially fitted with a subset

point cloud in P, but other points in Q must be occluded by points in P. In other

words, the only parts of Q that should be visible from the camera view are those that

overlap with P. Therefore, instead of the correspondence-based methods like spatial

distance threshold and normal vector compatibility check [101] that are commonly ap-

plied in 3D registration, we estimate the pose of the face model by proposing a ray

visibility constraint (RVC) to regularize the visibility of each face model point.

Formally, let us define the ray connecting a face model point qn and the camera

center as �v(qn,pn), where pn is a point in P that is nearest to this ray. In this case, pn

can be found by matching the pixel locations with qn, which is a lookup-table search

in the depth map [101; 106]. If qn is visible, it should be around the surface generated

from P, otherwise qn should be behind the surface and occluded. However, if qn is

in front of the surface point along the ray, it should suffer obligatory penalty for the

purpose of pushing the face model Q farther away so as to let qn be around the surface

of P. Eventually, the face model will be tightly and/or partially fitted with a subset of

points in P while leaving the rest of the points as occlusions.

One simple way to describe the surface of a point cloud is the local linear regression,

which is equivalent to fit the points in a local neighborhood by a 3D plane. Thus if

a model point qn is linked to an input point pn through the ray �v(qn,pn), the signed

distance of qm to the surface is

Δ(qn;pn) = n�nqn + bn. (5.7)

nn and bn are the normal vector and the offset of the plane centered at point pn.

Therefore, the signed distance of face model point qn = T(θ) ◦μM,[n] to the surface of

P, follows pQ→P(yn;θ) as

pQ→P(yn;θ) = N(yn|Δ(T(θ) ◦ μM,[n];pn), σ

2o + e2αn�

nΣM,[n]nn

), (5.8)


(a) Case-I (b) Case-II (c) Case-III

Face point is visible Face point is occludedγn = 1 γn = 0

Figure 5.5: Illustration of the ray visibility constraint. A profiled face model and acurve in the surface of the input point cloud are presented in front of a depth camera.Three cases are presented. (a) Case-I: a partial face region is fitted to the input pointcloud, while the rest facial regions are occluded. (b) Case-II: the face model is com-pletely occluded. (c) Case-III: a part of face region is visible and in front of the pointcloud, and the rest face regions are occluded. Best viewed in electronic version.

σ2o is the data noise variance from the surface modeling and the sensor’s systematic

error. This distribution is derived by marginalizing the face model distribution as

pQ→P(yn;θ) =∫fnN (yn|Δ(qn;pn), σ

2o)pQ(qn;θ)dqn.

Therefore, we can classify the point qn based on its visibility according to the ray

visibility constraint and label them γ = {γn}NMn=1, where γn = {0, 1}:

i) The face model point is visible (γn = 1). If the point qn is visible along the

ray �v(qn;pn), the majority of the possible signed distance yn should be around or in front

of the surface centered at pn. We can intuitively assume that Δ(T(θ) ◦ μM,[n];pn) is

within the bandwidth of the distribution pQ→P(yn) or is negative1:

Δ(T(θ) ◦ μM,[n];pn) ≤√

σ2o + e2αn�

nΣM,[n]nn.

ii) The face model point is occluded (γn = 0). Similarly, the point qn is

assumed to be occluded when its signed distance yn is positive and beyond the effective

1We keep nn pointing to the captured scene. Thus the negative signed distance yn means qn is infront of the surface.


region of pQ→P(yn;θ):

Δ(T(θ) ◦ μM,[n];pn) >√

σ2o + e2αn�

nΣM,[n]nn.

The ray visibility constraint is associated with the ray visibility score that measures

the compatibility between the visible face model points and the input point cloud, as

well as to what degree that the face model is occluded.

Ray visibility Score

By applying the ray visibility constraint for the current face model Q with the pose

parameter θ, we form a visibility label set γ = {γn}NMn=1. The ray visibility score (RVS)

measures the compatibility between the distributions of the face model and the input

point cloud.

Consider a ray �v(qn,pn) connecting a face model point qn and input point pn. The

distribution of pn is simply as

pP(yn) = N (yn|0, σ2o)

γnUO(yn)1−γn , (5.9)

where UO(yn) = UO is a pseudo-uniform distribution. pP(yn) is controlled by γn: if qn

is visible, it should be near to the surface centered at pn, i.e., it has to be compatible

with the surface distribution N (yn|0, σ2o). However, if qn is occluded, the position of

pn can be arbitrary as long as it is in front of qn. Thus a uniform distribution UO(yn)

is suitable. Moreover, the projected face model distribution of qn onto pn is

pQ(yn;θ) = N(yn|Δ(T(θ) ◦ μM,[n];pn), e

2αn�nΣM,[n]nn

). (5.10)

Therefore, for all rays {�v(qn,pn)}NMn=1 that intersect the face model, the ray visibility

score S(Q,P;θ) is defined to measure the similarity between pP(y) =∏NM

n=1 pP(yn)

and pQ(y;θ) =∏NM

n=1 pQ(yn;θ). A convenient way is to apply the Kullback-Leibler

divergence as

S(Q,P;θ) = DKL [pQ(y;θ)||pP(y)] (5.11)

so that the more similar pP(y) and pQ(y;θ) are, the smaller S(Q,P;θ) is. It is trivial

to observe that an optimal solution for θ contributes to the minimization of the ray

visibility score. Since the distributions of the visible face model points pQ(yn;θ) are


(a) Color image (b) Point cloud (c) Initial alignment (d) Ours

Figure 5.6: Examples of the proposed rigid pose estimation. (a) and (b) are the colorimages and the corresponded point cloud. (c) shows the initial alignment provided bythe head detection method [8], and (d) visualizes the proposed rigid pose estimationresults. Notice that only generic face model is applied. It robustly estimates difficultface poses from the partial scans with heavy occlusions by hands and hairs, as well asthe profiled faces with strong self-occlusions. Best viewed in electronic version.

optimally matched to the surface distributions pP(yn), while each of the remaining

points will suffer a constant penalty introduced by the occlusion distribution UO(yn).

Despite of the occlusion description in the ray visibility score, the occlusion distribution

UO(y) produces a larger number of visible face model points than the occluded points.

It is guaranteed if DKL[pQ(yn;θ)||UO(yn)] is usually larger than the incompatibility

between pQ(yn;θ) and N (yn|0, σ2o) that is led by the local shape variation of the face

model.


Rigid Pose Estimation

The rigid pose estimation for a current face model can be calculated by minimizing the

ray visibility score

θ� = argminθ

S(Q,P;θ). (5.12)

However, S(Q,P;θ) is highly nonlinear, thus there are no off-the-shelf closed-form

solutions in principal.

Actually, we apply a recursive estimation method to solve this problem. Thus, in

each loop, we alternatively solve two subproblems to estimate the intermediate hyperpa-

rameters θ(t) and γ(t), respectively. In the first subproblem, we apply the quasi-Newton

update θ(t) = θ(t−1) +Δθ using the trust region approach with respect to the ray vis-

ibility score S(Q,P;θ(t−1)) under the previous visibility label set γ(t−1). The second

subproblem is to update the visibility label set γ(t) = {γ(t)n }NMn=1 from the current pose

parameters θ(t−1). This iterative process will terminate until convergence or beyond

the pre-defined iteration loops.

To augment the performance of the proposed rigid pose estimation, the random

consensus method like the particle swarm optimization (PSO) [8; 108] is further brought

into this system. In detail, within the randomly sampled initial particles around the

initial pose parameters, a small set of seed particles that own the lowest ray visibility

scores and are divergent with each other, are updated using the recursive estimation

depicted above. The rest particles are clustered into several subsets according to their

nearness with respect to the seed particles, and they are updated based on the standard

PSO procedure. This augmentation effectively eliminates the mis-alignment problem

because of the poor initialization, and rectifies the wrong estimation when it gets stuck

in the wrong local minima of the ray visibility score.

In comparison with some common techniques like iterative closest points (ICP) [9],

the proposed rigid pose estimation only needs to find the set of rays V = {�v(qn,pn)}NMn=1

but does not require explicit correspondences. In addition, ICP fails to handle occlu-

sions if a poor initial pose is chosen, as shown in Figure 5.7(d). Moreover, the ray

visibility score is less vulnerable to bad local minima. It is analogous to approximate

the point cloud distribution pP(y) with the face model distribution pQ(y;θ) but not a

§ 5.4.2. Online Identity Adaptation 95

(a) Color image (b) Point cloud (c) Initial alignment

(d) ICP (e) RVC + ML (f) RVS (g) RVS + PSO

Figure 5.7: Comparison of the rigid pose estimation methods. (a) and (b) showthe color image and its corresponded point cloud. (c) depicts two views of the initialalignment between the generic face model and the point cloud. (d) visualizes theresult by ICP [9], and (e) reports the result by maximizing the likelihood that modeledby the ray visibility constraint (RVC). (f) is the proposed recursive method for theminimization of the ray visibility score (RVS), (g) is the augmented RVS method bythe particle swarm optimization (RVS+PSO). Details refer to the text and notice thatonly the generic face model is applied. Best viewed in electronic version.

point estimate like maximum likelihood (ML) or maximum a posteriori (MAP). For ex-

ample, maximizing the likelihood pQ→P(y;θ) =∏NM

n=1 pQ→P(yn;θ)γnUO(yn)1−γn may

seek a local mode that does not represent the majority of the likelihood, as shown in

Figure 5.7(e). On the contrary, the Kullback-Leibler divergence employed in the ray

visibility score ensures the optimal face model distribution with the estimated θ covers

the majority information conveyed in pP(y). While the modified particle swarm op-

timization further refine the facial pose. Figure 5.7 and 5.6 illustrates the superiority

of the proposed RVS and RVS+PSO methods in handling unconstrained facial poses

with large rotations and interfered by heavy occlusions.

5.4.2 Online Identity Adaptation

Together with the rigid facial pose tracking, the face model is progressively updated

to adapt to the user’s identity. Because the identity is not known in advance when


Figure 5.8: Examples of face model adaptation. The proposed method can success-fully personalize the face model to identities with different gender and races.

a new user is being captured, we begin with a generic face model M with the initial

identity and expression priors. The identity is then gradually personalized. In this

work, local shape variations caused by the expressions are effectively removed from the

face model generation, and the estimated identity is robust to the local distortions from

the expressions.

Variational Approximation

As depicted in section 5.3.2, the face model for one particular user is identified by

an unique identity distribution p�(wid) = N (wid|μ�id,Σ

�id), from which the other pa-

rameters in the face model can be derived. However, the exact identity distribu-

tion p�(wid) is not known if no adequate depth samples are available, thus the face

identity adaptation is performed through a sequential update algorithm like assumed-

density filtering (ADF) [64]. It approximates the Gaussian distribution p(t)(wid) from

the posterior induced by the current likelihood pL(y(t)|wid;θ(t)) and the previous

best estimate p(t−1)(wid). While provided with sufficient depth frames T , we have

p�(wid) � p(T )(wid).

We need a well defined likelihood pL(y(t)|wid;θ(t)) that both models the distances

from the face model points to the surface of P if the points are visible, and handles the

§ 5.4.2. Online Identity Adaptation 97

occlusions if the points are occluded,

pL(y(t)|wid;θ(t)) =

∑γ

NM∏n=1

pQ→P(y(t)n |wid;θ(t))γnUO(y(t)n )1−γnp(γ), (5.13)

where the p(t)(γ) =∏NM

n=1(π(t)n )γn(1− π

(t)n )1−γn is binomial with respect to each model

point. In contrast to the rigid pose estimation, the labels are not discriminatingly

given but generated with a prior distribution, enabling a soft label assignment about

whether a face model point is occluded. The projection distribution pQ→P(y(t)n |wid;θ

(t))

is similar as the form of pQ→P(y(t)n ;θ(t)) but with the mean value

mn = Δ(T(θ(t)) ◦ (fn +Pidwid);pn

)(5.14)

and covariance matrix by ξ2 = σ2o+e2α

(t)n(t)�n Σ

(t−1)E,[n] n

(t)n . To eliminate the quantization

errors in the input depth image, we introduce a robust modification for the projection

distance Δ(qn;pn) = sign(Δ(qn;pn))max{|Δ(qn;pn)| − ε, 0}.The identity distribution p(t)(wid) = N (wid|μ(t)

id ,Σ(t)id ) is estimated by minimiz-

ing the Kullback-Leibler divergence DKL[p(t)(wid)||p(wid|y(t))] [64], in other word, we

expect the true posterior

p(wid|y(t)) =pL(y(t)|wid;θ

(t))p(t−1)(wid)

p(y(t))� p(t)(wid). (5.15)

The parameters of p(t)(wid) is estimated through the variational Bayes framework [64].

We empirically find that this process will be convergent within 3 ∼ 5 iterations.

To fast capture the identity of a new user when the face model has been personal-

ized, we add a relaxation to the covariance matrix of p(t)(wid) as Σ(t)id ← (λ + 1)Σ

(t)id

immediately after the identity adaptation. This process is analogous to adding more

variances to μ(t)id from the identity space Σ

(t)id , thus it will neither lose the ability to

describe a new face that is different from the current face model, nor fail to preserve

the structure of the estimated identity space. The hyperparameter λ is empirically set

to 0.25.

Online Adaptation

The identity of the face model is adapted online through a two-step procedure for each

frame.


(a)

(b)

…

…

…

…

(c)

Figure 5.9: We continuously adapt the identities of the face model to different users.(a)-(c) are two examples showing that the face model can be gradually personalizedwhen the facial depth data from different poses are captured during the tracking process.The face model is initialized with the generic face model as shown in Figure 5.2.

At first, given the previous identity distribution p(t−1)(wid), we generate the dis-

tribution p(t−1)M (f) of the face model M(t−1) via the Equation (5.3). According to the

ray visibility constraint based on the previous face model p(t−1)M (f) and current surface

§ 5.5. Experiments and Discussions 99

model of P(t), we obtain the ray visibility score S(Q(t−1),P(t);θ). After the iterative

optimization of this score, the current rigid facial pose θ(t) is achieved.

Secondly, the face model is updated through the variational approximation given

the optimal rigid pose θ(t). In particular, π(t) encourages a soft assignment set between

the face model points and the input point cloud, and the robust projection function

reduces the quantization errors. In the end, the identity of the face model is updated

to p(t)(wid) = N (wid|μ(t)id ,Σ

(t)id ). We can further estimate the remaining parameters

and gather them together with the identity parameters as the face model parameter set

θ(t)F = {μ(t)

M,Σ(t)M,μ

(t)id ,Σ

(t)id ,Σ

(t)E ,Σ

(t)I }. These parameters help to generate the updated

face model distribution p(t)M(f) and facilitate the rigid facial pose estimation and identity

adaptation in the next frame.

5.5 Experiments and Discussions

In this section, we present the experiments on public depth-based facial pose datasets

and real scenarios to demonstrate the effectiveness of our robust 3D facial pose tracking

algorithm with the generative face model.

Section 5.5.1 introduces the datasets we employed for the evaluation and compar-

ison, and the system setup of the proposed method. We then quantitatively evaluate

the proposed method in comparison with the state-of-the-art algorithms on these pub-

lic datasets, and qualitatively visualize the performances of facial pose tracking and

identity adaptation in Section 5.5.2. At the end of this section, we list some limitations

in Section 5.5.3.

5.5.1 Datasets And System Setup

Datasets

We evaluate the performance of the proposed method and compare it with the state-of-

the-art algorithms on three public datasets, i.e., the Biwi Kinect head pose dataset [49]

and ICT 3D head pose (ICT-3DHP) dataset [94]. The dataset information is summa-

rized in Table 5.1.

Biwi Dataset: Biwi dataset contains over 15K RGB-D images of 20 subjects

(different genders and races) in 24 sequences, with large ranges in rotations and trans-

lations. The recorded faces suffer the occlusions from hair and face shape variations


Nseq Nfrm Nsubj occlusions ωmax

BIWI [49] 24 ∼15K 25accessories ±75◦ yaw

hair ±60◦ pitch

ICT-3DHP [94] 10 ∼14K 10accessories ±75◦ yaw

hair ±45◦ pitch

Table 5.1: Facial Pose Datasets Summarization

Figure 5.10: Tracking results on the Biwi dataset with the personalized face models.Our system is robust to profiled faces due to large rotations and occlusions from hairand accessories. The 1st and 2nd rows show the corresponded color and depth imagepairs. The third row visualizes the extracted point clouds of the head regions and theoverlaid personalized face models. Best viewed in electronic version.

from expressions. The Biwi dataset provides the ground-truth head pose parameters

for each frame by an off-the-shelf software Faceshift2, and the pixel-wise binary masks

for detected face regions.

ICT-3DHP Dataset: 10 Kinect RGB-D sequences including 6 males and 4 fe-

males are provided by the ICT-3DHP dataset. The data contain similar occlusions and

distortions like Biwi dataset. Each subjects in this dataset also has arbitrary expres-

sion variations. The ground truth rotation parameters were externally measured by

the Polhemus Fastrack flock of birds tracker [94] attached to a cap on each subject, but

the translation parameters were not reliable.

2http://www.faceshift.com/

§ 5.5.1. Datasets And System Setup 101

Figure 5.11: Tracking results on the ICT-3DHP dataset. The proposed system is alsoeffective to the expression variations. Best viewed in electronic version.

System Setup

We implemented the proposed 3D facial pose tracking algorithm on a MATLAB platform

equipped with the parallel computing toolbox. The results were measured on a 3.4

GHz Intel Core i7 processor with 16GB RAM. No GPU acceleration was applied.

Here we define the hyperparameters utilized in the proposed system. The dimen-

sion of the face model is NM = 11510, Nid = 150, Nexp = 47, while the truncated

face model has smaller dimensions of identity and expression Nid = 28, Nexp = 7.

The generic face model owns the identity and expression priors N (wid|U�id1,

1150I) and

N (wexp|U�exp1,

147I). The noise variance along the surface of the input point cloud is

σ2o = 25, while the outlier distribution is characterized by UO(y) = UO = 1

2500 . Note

that the measurement unit used in this work is millimeter (mm).

The proposed algorithm online adapts the identity for a period of frames and it

either stops until a pre-defined number of iterations is reached (in this work, we choose

50 frames) or the evolution of the adapted face model is converged. The online face

adaptation is performed every 10 frames so as not only to capture different facial parts

but also reduce the redundancy because of the subtle difference of the visible face

coverages between adjacent frames.


5.5.2 Quantitative and Qualitative Evaluations

Table 5.2 shows the average absolute errors for the rotations angles and the average

Euclidean errors for the translation by the proposed method and the reference methods

on the Biwi dataset. The rotational errors were further quantified through the average

absolute errors with respect to the yaw, pitch and roll angles, respectively. Similarly

in Table 5.3, we evaluated the average absolute angle errors for yaw, pitch and roll on

the ICT-3DHP dataset.

The proposed method

Observing the tracking performances between the generic and personalized face model,

the latter one has a better result both with the rotation and the translation metrics.

By allowing a gradually adapted face model to fit to each subject, the personalized

distributions for the shape and the expression enable the face model to fit compactly

with the input point cloud and makes the estimated facial pose robust to changes in

the personalized expressions. Figure 5.10 and Figure 5.11 demonstrate some successful

tracking poses on Biwi and ICT-3DHP datasets based on the personalized face models.

The performance based on the generic face model also revealed its superiority over

challenging cases like occlusions and expression variations, as shown in Figure 5.6.

As with the rigid pose tracking, the proposed ray visibility constraint, as shown in

Figure 5.10 and 5.11, Figure 5.6 and 5.7, efficiently infers the occlusions caused by hairs,

accessories and hands, as well as the self-occlusions like profiled faces. In contrast, if we

apply the point-to-plane ICP [9], it cannot always distinguish the occlusions and the

face model since it is not constrained by the visibility cue. In addition, the proposed

ray visibility score inherently suggests that the more visible vertices in the face model,

the lower S(Q,P;θ) will be. Optimally, the number of visible facial points should be

maximized. Similar observations have been explored and proven helpful to increase the

pose tracking accuracy in the references [8; 47]. S(Q,P;θ) ensure a optimal coverage

between the distributions of the warped face model and the surface of the input point

cloud, thus owns a more robust estimation than solutions based on point estimate, such

as MAP or ML estimations based on point-to-plane ICP or ray visibility constraint,

and etc.

The online identity adaptation can be progressively adapted to the test subject.

§ 5.5.2. Quantitative and Qualitative Evaluations 103

Figure 5.12: The proposed system can automatically adapt a face model from oneidentity to another. Top: Three identities are presented successively in adjacent threeframes. Bottom: The tracking face models that are adaptive to the current identity.Please note the differences of head and nose shapes among the visualized face models.

Based on the property of the assumed density filters applied in the online identity

adaptation, the speed of convergence depends on the portion of visible face model points

revealed in each frame. For example, a subject with no occlusions will usually has a

faster speed of convergence than a heavy occluded subjects. Moreover, the covariance

Σ(t)id will not goes to infinity with infinite frames of one subject coming into this system.

However, the shape of the identity distribution described by Σ(t)id is preserved. This

property offers a promising ability that the online identity adaptation can be switched

from one subject to a new one with a smooth facial identity transfer, as visualized in

Figure 5.12.

Comparison with the state-of-the-arts

A number of prior arts [8; 48; 49; 94; 102; 108; 112] for depth-based 3D facial pose

tracking have also been evaluated on the Biwi [49] and ICT-3DHP [94] datasets (listed

in Table 5.2 and Table 5.3), as the references for the performance evaluation of the

proposed method. The results by the reference methods are reported from their authors.

On the Biwi dataset, the proposed method produced the lowest errors for rota-

tion among the depth-based head pose tracking algorithms, such as the discriminative

methods like the random forests [49], the generative model-fitting methods like CLM-

Z [94], Martin et. al. [112] and Meyer et. al. [8], as well as the feature-based methods


MethodErrors

Yaw (◦) Pitch (◦) Roll (◦) Translation (mm)

ours 2.3 2.0 1.9 6.9

RF [49] 8.9 8.5 7.9 14.0Martin [112] 3.6 2.5 2.6 5.8CLM-Z [94] 14.8 12.0 23.3 16.7TSP [48] 3.9 3.0 2.5 8.4PSO [108] 11.1 6.6 6.7 13.8Meyer [8] 2.1 2.1 2.4 5.9Li� [102] 2.2 1.7 3.2 −

Table 5.2: Evaluations on Biwi dataset

MethodErrors

Yaw (◦) Pitch (◦) Roll (◦)ours 3.4 3.2 3.3

RF [49] 7.2 9.4 7.5CLM-Z [94] 6.9 7.1 10.5Li� [102] 3.3 3.1 2.9

Table 5.3: Evaluations on ICT-3DHP dataset

like triangular surface patch [48]. Despite the missing of appearance information intro-

duces uncertainties with respect to the estimated facial pose, the proposed approach

performs comparable with current state-of-the-art method [102] (marked with � in Ta-

ble 5.2 and 5.3) that employed the RGB-D data. Similar conclusion can also be drawn

on the ICT-3DHP dataset, where the proposed method also presents a superior per-

formance on estimating the rotational parameters in comparison with the depth-based

approaches like the random forests [49] and CLM-Z [94]. Its performance is similar as

Li [102] even though no color information is provided.

As with the translational parameters, the proposed method also presents the state-

of-the-art performance in comparison to the depth-based approaches on the Biwi dataset3.

The sight degradation against Meyer et. al. [8] on the translation parameters may be

because of the incompatibility of model centers between the groundtruth face model in

Biwi dataset and the proposed multilinear face model.

3No reliable groundtruth translation parameters are available for ICT-3DHP datasets [94].

§ 5.5.3. Limitations 105

5.5.3 Limitations

The proposed system is inevitably vulnerable when the input depth video is contam-

inated by heavy noise, outliers and quantization errors. For example, a Kinect depth

video capturing a long-distance user may tremendously quantize his/her face structure

so as not to ensure a stable facial pose estimation. On the other hand, effective clues

like facial landmarks are inaccessible due to the color information is not available, thus

hard facial poses receiving less confidence from the ray visibility constraint may still be

unreliable. However, this kind of unreliability can be relieved by constraining the tem-

poral coherency of facial poses among adjacent frames, like Kalman filtering or other

temporal smoothness techniques.

5.6 Summary

We introduce a robust facial pose tracking method for commodity depth sensors that

brings about the state-of-the-art performances on two popular facial pose datasets. The

proposed generative face model and the ray visibility score ensure a robust tracking that

is effective to handle heavy occlusions, profiled faces due to large rotation angles, and

expression variations. The generative model is adaptive to the identities with different

ages, races and genders. Its uncertainties of the identity and expression ambiguities

enable a groupwise optimization of the facial poses that is optimal for all identities

and expressions encoded in the face model. Its separation of identity and expression

parameters also avoid interference of the expression variations for the face model per-

sonalization. The ray visibility constraint focuses on the visibility of face model points

but not the explicit correspondences, and its information-theoretic ray visibility score

offers a more robust treatment for the facial pose estimation.

A number of future directions are beneficial for a more stable and accurate facial

pose tracking system. Effective temporal coherency deserves more attention since it

provides smoother tracking trajectories and predicts reliable future facial poses given

the previous motion patterns. The scene flow problem might be another interesting

direction as it will provide subtle per-point motion variations both from global rigid

pose and local expression variations, introducing new constraints for the estimation

of facial pose and expression recognition. Moreover, developing a more robust depth-

based features are helpful as it will give semantic correspondences between the face


model and the measured face data.

Chapter 6

Conclusions and Future Work

This thesis mainly presents the spatio-temporal RGB-D video enhancement and ap-

plications about image/video processing and computer vision based on the RGB-D

videos. In particular, with the probabilistic generative models, this thesis attempts

to solve three problems: (1) spatial enhancement for eliminating the noise, outliers

and depth-missing holes in a raw depth image; (2) temporal enhancement for adaptive

long-range temporal consistency to the content of the RGB-D video; and (3) robust 3D

facial pose tracking with online face model personalization under uncontrolled scenarios

and heavy occlusions. In this chapter, we will conclude our work in Section 6.1. The

future work will be discussed in Section 6.2.

6.1 Conclusions

This thesis first demonstrates a new guided depth image enhancement approach, which

is a hybrid strategy merging the filtering-based depth interpolation and segment-based

parametric structure propagation. Thanks to a novel arbitrary-shape and texture-

constrained patch matching method for a robust structure inference, the segments in

the depth holes can be reliably aligned with parametric structures with similar texture

and/or depth statistics. Experiments reveals that the proposed method outperforms

the reference methods with respect to depth hole filling and surface smoothing problem.

In the second place, this thesis proposes a novel weighted structure filters based

on parametric structural decomposition. In detail, a novel distribution construction

method is demonstrated for accelerating the weighted median/mode filters by a sepa-

rable kernel based on the probabilistic generative models that adaptive to the structure

of the input image. Different from the traditional brute-force methods with hundreds of

filtering operations for a sufficiently accurate performance, the proposed approach only

107

108 CHAP. 6. CONCLUSIONS AND FUTURE WORK

requires a very small amount of filtering operations based on the structure of the input

image. The accelerated weighted median filter and weighted mode filter are effective

for various applications from depth map enhancement, joint depth upsampling, detail

enhancement and so on.

This thesis also presents a novel method for robust temporally consistent depth en-

hancement by introducing a probabilistic intermediate static structure. The dynamic

region of the input depth video is enhanced spatially while the static region is sub-

stituted by the updated static structure so as to favor a long-range spatio-temporal

enhancement. Quantitative evaluation shows the efficiency and robustness of the pa-

rameters estimation on the static structure and illustrates a superior performance in

comparison to various static scene estimation approaches. Qualitative evaluation re-

veals that the proposed method operates well on various indoor scenes and different

depth cameras, and proves that the proposed temporally consistent depth video en-

hancement works satisfactory in comparison with existing methods.

At last, this thesis introduces a robust facial pose tracking system with an adaptive

face model personalization, which is specific for commodity depth sensors and brings

about the state-of-the-art performances on two popular facial pose datasets. The pro-

posed generative face model and the visibility-constrained information-theoretic rigid

pose estimation techniques enable a more efficient and effective facial pose tracking

method than the prior arts. Qualitative and quantitative results faithfully demon-

strate that the proposed method can effectively handle unconstrained facial tracking

cases like heavy occlusions, profiled faces with strong rotation angles, and expression

changes during the tracking procedure. Moreover, the proposed probabilistic multi-

linear face model possesses sufficient descriptive power for a plenty of identities from

different ages, races and genders with varying expressions.

6.2 Future Work

While we have listed potential future work for each problem at the end of their corre-

sponding chapter, we highlight several other suggestions here.

In addition to the parametric mixture model for the weighted local distribution

approximation, we can consider a non-parametric representation for describing the local

image statistics. This weighted structural prior centered at a pixel x can be analogous

§ 6.2. Future Work 109

to non-parametric kernel density estimator, but is augmented with fully-connected

pixel-wise relationship (which is similar as the fully convolutional conditional random

field as discussed in Section 4.3.4) between any pair of pixels {x,y}. Combining it with

suitable data likelihood, the underlying structure map can be discovered by maximizing

a posteriori (MAP) through efficient variational mean field approximation. This prior

guarantees the extracted structure map is piece-wise smooth everywhere within a same

piece of image structure but distinct among the image discontinuities. It also responses

to the observations and assumptions as discussed in Section 3.3.

Appendix A

Approximation for the Gaussian Kernel

Given a set of manifolds that within the domain of f as {ηk ∈ Rd|k = 1, 2, . . . ,K}, the

weighted distribution constructed by the Gaussian kernel can be further derived as

h(x, f) =1

Z(x)

∑y∈Ωx

w(x,y)φ(fy − f ;ΣF ) (A.1)

∝∑y∈Ωx

w(x,y)

∫ηx∈Rd

φ(fy − ηx;ΣF −Σx)φ(ηx − f ;Σx)dηx (A.2)

�∑y∈Ωx

w(x,y) · sK ·K∑k=1

φ(fy − ηky;Σky)φ(f − ηkx;Σ

kx) (A.3)

=∑y∈Ωx

w(x,y) · sK ·K∑k=1

px(f |k)py(fy|l). (A.4)

Note that the Gaussian-Hermite quadrature rule is applied in this derivation. The ap-

proximation is valid when the local manifolds ηk is sufficient smooth and the summation

of the variances at pixel x and y should be around ΣF . For a detailed interpretation,

please refer to [31].

110

Appendix B

Generative Model for Static Structure

B.1 Probabilistic Generative Mixture Model

The proposed static structure is modeled by a probabilistic generative mixture model.

Three states are introduced to describe the cases that the input depth samples may

occupy, each of which has a distribution as

• State-I: Fitting the static structure p(dx|Zx,mIx = 1) = N

(dx|Zx, ξ

2x

);

• State-F: Forward Outliers p(dx|Zx,mFx = 1) = Uf (d

tx|Zx) = Uf · 1[dtx<Zx];

• State-B: Backward Outliers p(dx|Zx,mBx = 1) = Ub(d

tx|Zx) = Ub · 1[dtx>Zx].

For the purpose to combine all the three states into a united model and describe

the overall likelihood that the input depth samples fit the current static structure, we

use a mixture model similar to the Gaussian Mixture Model [64]. Together with prior

distributions of the hidden variable mx and the static structure Zx, we can further

estimate the posterior with respect to Zx to infer the most possible static structure

given the input depth samples, and the posterior with respect to mx to indicate the

states that the input depth samples belong to.

B.1.1 Likelihood

The likelihood of the input depth sample dtx with respect to the depth value of the

static structure Zx and the hidden variable for the state indicator mx is

p(dtx|mx, Zx) = N (dtx|Zx, ξ2x)

mIx · Uf

(dtx|Zx

)mFx · Ub(d

tx|Zx)

mBx , (B.1)

which switches among these states by setting one specific mkx = 1, k ∈ Ψ = {I, F,B}

and the rest as 0s.

111

112 CHAP. B. GENERATIVE MODEL FOR STATIC STRUCTURE

B.1.2 Prior Distributions

Given the likelihood as well as suitable prior distributions, we will have a tractable

joint distribution. Thus the choices of the priors are essential to ensure tractable and

efficient estimation of the joint distribution as well as the posteriors.

To be compatible with the likelihood in Sec. B.1.1, we also introduce a Gaussian

distribution for Zx as

p(Zx) = N (Zx|μx, σ2x). (B.2)

The prior for mx needs to cope with the switching property that mx offers. Thus we

employ the categorical distribution since it outputs a probability ωkx when a state mk

x

is activated

p(mx|ωx) = Cat(mx|ωx) =∏k∈Ψ

(ωkx

)mkx, given

∑k∈Ψ

ωkx = 1. (B.3)

This distribution otherwise introduce an additional parameter ωx, which also needs an

explicit distribution [64]. We apply the Dirichlet distribution as

p(ωx) = Dir(ωx|αI

x, αFx , α

Bx

)= Dir (ωx|αx) , given αk

x ≥ 0, k ∈ Ψ. (B.4)

The reason to introduce p(ωx) is that we want to model the chance that one state

may happen so that we can judge the reliability of the estimated static structure.

Furthermore, given a prior distribution for ωx, we can further estimate the posterior

with respect to ωx when a series of data come into the model.

B.1.3 Joint Distribution

Given the input depth sample dtx, the joint distribution can be written as

p(dtx, Zx,mx,ωx;Px) = p(dtx|Zx,mx; ξx)p(Zx;μx, σx)p(mx|ωx)p(ωx;αx), (B.5)

in which the parameter set is Px = {ξx, μx, σx,αx}. By marginalizing the hidden vari-

able, we can have a joint distribution that only contains two variables: the depth value

Zx and the chance of each state ωx, as well as the observation dtx and the parameters

§ B.1.4. Data Evidence 113

Px. It will result in a distribution as

p(dtx, Zx,ωx;Px) = p(Zx;μx, σx)p(ωx;αx)

×[ωIxN (dtx|Zx, ξ

2x) + ωF

x Uf

(dtx|Zx

)+ ωB

x Ub

(dtx|Zx

)], (B.6)

which is a weighted combination of the three state densities multiplied with the prior

distributions of Zx and ωx.

B.1.4 Data Evidence

The data confidence p(dtx;Px) is simply calculated by marginalizing the variables Zx

and ωx a step further as

p(dtx;Px) =

∫Zx

∫ωx

p(dtx, Zx,ωx;Px)dZxdωx

=1∑

k∈Ψ αkx

{αIxN (dtx|μx, ξ

2x + σ2

x) +(αBxUb − αF

xUf

)Φ

(dtx − μx

σx

)+ αF

xUf

}.

(B.7)

B.1.5 Posteriors with First-order Markov Chain

In this paper, we want to estimate the posterior under an online fashion, it means

the posterior is estimated frame by frame, with new data to sequentially increase the

confidence of the static structure.

p(Zx,ωx|Dtx;Px) =

1

p(dtx|Dt−1x ;Px)

p(dtx|Zx,ωx;Px)p(Zx,ωx|Dt−1x ;Px) (B.8)

The posterior with respect to the hidden variable mx indicates the distributions of

the states that the input depth sample may occupy, it is similar as equation (B.8).

p(mx|Dtx;Px) =

1

p(dtx|Dt−1x ;Px)

×∫Zx

∫ωx

p(dtx|mx, Zx;Px)p(mx|ωx)p(Zx,ωx|Dt−1

x ;Px

)dZxdωx (B.9)

The posteriors seem complex and are not easy to estimate, thus we employ the vari-

ational approximation so that the posterior p(Zx,ωx|Dtx;Px) can be factorized into the

product of an independent Gaussian distribution qt(Zx) and an independent Dirichlet

distribution qt(ωx) with suitable parameters. The posterior p(mx|Dt−1x ;Px) can also be


rewritten by substituting p(Zx,ωx|Dtx;Px) with the approximated posterior qt(Zx,ωx).

B.2 Derivations of the Results in Variational Approximation

In this section, we show the detailed derivations of the results present in Section 4.3.2.

For brevity, we omit the related superscripts and subscripts of parameters and variables

from {dtx, Zx,ωx} to {d, Z,ω}, and from {μt−1x , σt−1

x , μtx, σ

tx, ξx} to {μ, σ, μnew, σnew, ξ}.

Moreover,{αI,t−1x , αF,t−1

x , αB,t−1x , αI,t

x , αF,tx , αB,t

x ,∑

k∈Ψ αk,t−1x ,

∑k∈Ψ αk,t

x

}is transferred

to {α1, α2, α3, αnew1 , αnew

2 , αnew3 , α0, α

new0 } for brevity.

B.2.1 Approximated Joint Distributions

Approximated Joint Distributions Q(Z,ω, d)

Incorporating the properties of Gaussian and Dirichlet distribution, the approximated

joint distribution is a mixture of products of Gaussian and Dirichlet distributions.

Q(Z,ω, d) = p(d|Z,ω)qt−1 (Z,ω) (B.10)

=[ωIN (d|Z, ξ2) + ωBUb(d|Z) + ωFUf (d|Z)

]N (Z|μ, σ2)Dir(ω|α1, α2, α3)

=α1

α0N

(d|μ, ξ2 + σ2

)N

(Z∣∣∣ξ2μ+ σ2d

ξ2 + σ2,

ξ2σ2

ξ2 + σ2

)Dir (ω|α1 + 1, α2, α3)

+α2

α0Uf (d|Z)N

(Z|μ, σ2

)Dir (ω|α1, α2 + 1, α3)

+α3

α0Ub(d|Z)N

(Z|μ, σ2

)Dir (ω|α1, α2, α3 + 1) . (B.11)

Approximated Joint Distributions Q(Z, d) and Q(ω, d)

It is easy to calculate the moments related to Z and ω respectively by estimating the

moments of the approximated posterior Q(Z|d) and Q(ω|d). Specifically, we need to

calculate the joint distribution with respect to Z and d as

Q(Z, d) =

∫ωQ(Z,ω, d)dω (B.12)

=α1

α0N

(d|μ, ξ2 + σ2

)N

(Z∣∣∣ξ2μ+ σ2d

ξ2 + σ2,

ξ2σ2

ξ2 + σ2

)

+α2

α0Uf (d|Z)N

(Z|μ, σ2

)+

α3

α0Ub(d|Z)N

(Z|μ, σ2

), (B.13)

§ B.2.2. Approximated Data Evidence For The Observation 115

and the joint distribution with respect to ω and d as

Q(ω, d) =

∫ωQ(Z,ω, d)dZ

=α1

α0N

(d|μ, ξ2 + σ2

)Dir (ω|α1 + 1, α2, α3)

+α3

α0UbΦ

(d− μ

σ

)Dir (ω|α1, α2, α3 + 1)

+α2

α0Uf

(1− Φ

(d− μ

σ

))Dir (ω|α1, α2 + 1, α3) . (B.14)

B.2.2 Approximated Data Evidence For The Observation

Similarly, the approximated data evidence is presented below:

qt(d) =

∫Z

∫ωQ(Z,ω, d)dZdω (B.15)

=α1

α0N

(d|μ, ξ2 + σ2

)+

α3

α0UbΦ

(d− μ

σ

)+

α2

α0Uf

(1− Φ

(d− μ

σ

)), (B.16)

which is also analytic as long as the parameters are known. The posteriors Q(Z|d)and Q(ω|d) are calculated accordingly as dividing the joint distribution Q(Z, d) and

Q(ω, d) by the data evidence qt(d).

B.2.3 Parameter Updating for the Approximated Static Structure

The parameter estimation for qt(Z) is to match first and second moments between

qt(Z) and Q(Z|d). The first moment between qt(Z) and Q(Z|d) is

μnew = EQ(Z|d)[Z] (B.17)

=1

qt(d)α0

{α1N

(d|μ, ξ2 + σ2

) ξ2μ+ σ2d

ξ2 + σ2

+ α2Uf

[μ

(1− Φ

(d− μ

σ

))+ σ2N

(d|μ, σ2

)]

+ α3Ub

[μΦ

(d− μ

σ

)− σ2N

(d|μ, σ2

)]}, (B.18)

which can be further written as

μnew =1

qt(d)α0

{α1N

(d|μ, ξ2 + σ2

) ξ2μ+ σ2d

ξ2 + σ2+ α2Ufμ

+ (α3Ub − α2Uf )

[μΦ

(d− μ

σ

)− σ2N

(d|μ, σ2

)]}.


The second moment is under a similar fashion as

μ2new + σ2

new = EQ(Z|d)[Z2] (B.19)

=1

qt(d)α0

{α1N

(d|μ, ξ2 + σ2

) [(ξ2μ+ σ2d

ξ2 + σ2

)2

+ξ2σ2

ξ2 + σ2

]

+ (α3Ub − α2Uf )

[(μ2 + σ2)Φ

(d− μ

σ

)− (d+ μ)σ2N

(d|μ, σ2

)]}

+ α2Uf

(μ2 + σ2

). (B.20)

B.2.4 Parameter Updating for the Approximated State Frequencies

The parameters αnewk , k ∈ {1, 2, 3} are calculated by introducing new variables mi and

m(2)i , i = 1, 2, 3, which defines the first moments and the second moments with respect

to ω for qt(ω) [84]. The first moments are calculated according to the property of the

Dirichlet distribution as

m1 =αnew1

αnew0

= EQ(ω1|d)[ω1] =α1

α0qt(d)N

(d|μ, ξ2 + σ2

) α1 + 1

α0 + 1+

α2

α0qt(d)Uf

(1− Φ

(d− μ

σ

))α1

α0 + 1+

α3

α0qt(d)UbΦ

(d− μ

σ

)α1

α0 + 1, (B.21)

m2 =αnew2

αnew0

= EQ(ω2|d)[ω2] =α1

α0qt(d)N

(d|μ, ξ2 + σ2

) α2

α0 + 1+

α2

α0qt(d)Uf

(1− Φ

(d− μ

σ

))α2 + 1

α0 + 1+

α3

α0qt(d)UbΦ

(d− μ

σ

)α2

α0 + 1, (B.22)

m3 =αnew3

αnew0

= EQ(ω3|d)[ω3] =α1

α0qt(d)N

(d|μ, ξ2 + σ2

) α3

α0 + 1+

α2

α0qt(d)Uf

(1− Φ

(d− μ

σ

))α3

α0 + 1+

α3

α0qt(d)UbΦ

(d− μ

σ

)α3 + 1

α0 + 1. (B.23)

§ B.2.5. Approximated Posterior for the State Frequencies 117

The second moments are calculated as follows:

m(2)1 = EQ(ω1|d)[(ω1)2] =

αnew1 (αnew

1 + 1)

αnew0 (αnew

0 + 1)(B.24)

=α1

α0qt(d)N

(d|μ, ξ2 + σ2

) (α1 + 1)(α1 + 2)

(α0 + 1)(α0 + 2)

+α2

α0qt(d)Uf

(1− Φ

(d− μ

σ

))α1(α1 + 1)

(α0 + 1)(α0 + 2)

+α3

α0qt(d)UbΦ

(d− μ

σ

)α1(α1 + 1)

(α0 + 1)(α0 + 2), (B.25)

m(2)2 = EQ(ω2|d)[(ω2)2] =

αnew2 (αnew

2 + 1)

αnew0 (αnew

0 + 1)(B.26)

=α1

α0qt(d)N

(d|μ, ξ2 + σ2

) α2(α2 + 1)

(α0 + 1)(α0 + 2)

+α2

α0qt(d)Uf

(1− Φ

(d− μ

σ

))(α2 + 1)(α2 + 2)

(α0 + 1)(α0 + 2)

+α3

α0qt(d)UbΦ

(d− μ

σ

)α2(α2 + 1)

(α0 + 1)(α0 + 2), (B.27)

m(2)3 = EQ(ω3|d)[(ω3)2] =

αnew3 (αnew

3 + 1)

αnew0 (αnew

0 + 1)(B.28)

=α1

α0qt(d)N

(d|μ, ξ2 + σ2

) α3(α3 + 1)

(α0 + 1)(α0 + 2)

+α2

α0qt(d)Uf

(1− Φ

(d− μ

σ

))α3(α3 + 1)

(α0 + 1)(α0 + 2)

+α3

α0qt(d)UbΦ

(d− μ

σ

)(α3 + 1)(α3 + 2)

(α0 + 1)(α0 + 2). (B.29)

The parameters are thus estimated with the help of the introduced variables as

αnew0 =

∑3i=1mi −m

(2)i∑3

i=1m(2)i −m2

i

, αnewi = αnew

0 mi, i = 1, 2, 3. (B.30)

B.2.5 Approximated Posterior for the State Frequencies

Similarly, the approximated posterior with respect to each state is

• State-I: Fitting the static structure

qt(mx = I|dtx) =αIxN (dtx|μt−1

x , ξ2x + (σt−1x )2)(

qt(dtx)∑

k∈Ψ αkx

) ; (B.31)

• State-F: Forward Outliers

qt(mx = F |dtx) =αFxUf (1− Φ((dtx − μt−1

x )/σt−1x ))(

qt(dtx)∑

k∈Ψ αkx

) ; (B.32)


• State-B: Backward Outliers

qt(mx = B|dtx) =αBxUbΦ((d

tx − μt−1

x )/σt−1x )(

qt(dtx)∑

k∈Ψ αkx

) . (B.33)

Appendix C

The Choice of Depth Noise Standard Deviation

C.1 Depth Map from Stereo or Kinect

Since depth map obtained by Stereo or Kinect is actually estimated via the disparity

estimation technique, in which case the conversion between depth and disparity is

ddisp

B=

f

d=⇒ d =

fB

ddisp. (C.1)

ddisp is the disparity and d is the depth. f is the focal length of the camera, B is the

baseline between stereo sensors.

The noise and outliers in the depth map are originated at the errors in the disparity

map. Assume the Gaussian noise and uniform outliers in the disparity map, we try to

find their characteristics in the corresponding depth map. Define a universal Gaussian

noise standard deviation σdispn in the disparity map, it results in a noise disparity value

ddispn from a mean μdispn . Converting the noisy disparity value into the depth, we have

dn =fB

ddispn

(C.2)

=fB

μdispn + (ddispn − μdisp

n )(C.3)

=fB

μdispn

1

1 + (ddispn − μdispn )/μdisp

n

≈ fB

μdispn

(1 +

μdispn − ddispn

μdispn

)

= 2μn − μnddispn

μdispn

. (C.4)

Here the mean μn = fB/μdispn . It needs one constraint |μdisp

n −ddispn | < μdispn , which can

be satisfied in a general setting. Thus the mean value for dn is E[dn] = 2μn−μnE[ddispn ]

μdispn

=

119

120 CHAP. C. THE CHOICE OF DEPTH NOISE STANDARD DEVIATION

μn, its variance is

σ2n = E[(dn − μn)

2] (C.5)

=μ2n

(μdispn )2

E([μdispn − ddispn )2] (C.6)

=

(μn

μdispn

)2 (σdispn

)2(C.7)

=μ4n

(fB)2

(σdispn

)2=⇒ σn = σdisp

n

μ2n

fB. (C.8)

The outliers in the depth map is still modeled by uniform distribution.

Therefore, to better model the static structure estimation, we set the depth noise

standard deviation ξx ∝ (dtx)2/fB, which is a function of the depth sample dtx. Samples

with larger depth values will require larger standard deviations to fit its noise.

C.2 Depth Map from Other Sources

For depth map obtained by other sources, the noise standard deviation ξx = σ is a

constant over the image domain. If the property of the systematic error for a depth

sensor is available, the standard deviation ξx can be modeled more specifically.

Bibliography

[1] C. Richardt, C. Stoll, N. A. Dodgson, H.-P. Seidel, and C. Theobalt, “Coherent spatio-temporal filtering, upsampling and rendering of RGBZ videos,” Computer Graphics Fo-rum (Proceedings of Eurographics), vol. 31, no. 2, May 2012.

[2] L. Wang, H. Jin, R. Yang, and M. Gong, “Stereoscopic inpainting: Joint color and depthcompletion from stereo images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.IEEE, 2008, pp. 1–8.

[3] M. Kass and J. Solomon, “Smoothed local histogram filters,” ACM Trans. Graph., vol. 29,no. 4, p. 100, 2010.

[4] Z. Ma, K. He, Y. Wei, J. Sun, and E. Wu, “Constant time weighted median filtering forstereo matching and beyond,” in Proc. IEEE Int. Conf. Comput. Vis., 2013.

[5] D. Min, J. Lu, and M. Do, “Depth video enhancement based on weighted mode filtering,”vol. 21, no. 3, pp. 1176–1190, March 2012.

[6] M. Lang, O. Wang, T. Aydin, A. Smolic, and M. Gross, “Practical temporal consistencyfor image-based graphics applications,” ACM Trans. Graph., vol. 31, no. 4, pp. 34:1–34:8,Jul. 2012.

[7] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3D facial expressiondatabase for visual computing,” vol. 20, no. 3, pp. 413–425, 2014.

[8] G. P. Meyer, S. Gupta, I. Frosio, D. Reddy, and J. Kautz, “Robust model-based 3d headpose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3649–3657.

[9] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in 3-D DigitalImaging and Modeling, 2001. Proceedings. Third International Conference on. IEEE,2001, pp. 145–152.

[10] J. Smisek, M. Jancosek, and T. Pajdla, 3D with Kinect. London: Springer London,2013, pp. 3–25. [Online]. Available: http://dx.doi.org/10.1007/978-1-4471-4640-7 1

[11] J. Diebel and S. Thrun, “An application of Markov random fields to range sensing,”in Advances in Neural Information Processing Systems, vol. 18. MIT press, 2005, pp.291–298.

[12] J. Yang, X. Ye, K. Li, and C. Hou, “Depth recovery using an adaptive color-guidedauto-regressive model,” in Proc. Euro. Conf. Comput. Vis. Springer, 2012, pp. 158–171.

[13] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality depth mapupsampling for 3D-ToF cameras,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp.1623–1630.

[14] D. Herrera, J. Kannala, J. Heikkila et al., “Depth map inpainting under a second-ordersmoothness prior,” in Image Analysis. Springer, 2013, pp. 555–566.

[15] C. D. Herrera, J. Kannala, P. Sturm, and J. Heikkila, “A learned joint depth and intensityprior using Markov random fields,” in Proc. IEEE 3DTV-CON, 2013, pp. 17–24.

121

122 BIBLIOGRAPHY

[16] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems withapplications to imaging,” Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp.120–145, 2011.

[17] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruther, and H. Bischof, “Image guided depth up-sampling using anisotropic total generalized variation,” in Proc. IEEE Int. Conf. Comput.Vis., 2013, pp. 993–1000.

[18] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” in Proc. IEEEInt. Conf. Comput. Vis., 2015, pp. 3406–3414.

[19] B. Ham, M. Cho, and J. Ponce, “Robust image filtering using joint static and dynamicguidance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2015, pp. 4823–4831.

[20] Y. Kim, B. Ham, C. Oh, and K. Sohn, “Structure selective depth superresolution forrgb-d cameras,” vol. 25, no. 11, pp. 5227–5238, 2016.

[21] B. Ham, D. Min, and K. Sohn, “Depth superresolution by transduction,” vol. 24, no. 5,pp. 1524–1535, 2015.

[22] S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix completion,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3390–3397.

[23] K. Matsuo and Y. Aoki, “Depth image enhancement using local tangent plane approxima-tions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2015, pp. 3574–3583.

[24] D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. N. Do, “Fast global image smoothingbased on weighted least squares,” vol. 23, no. 12, pp. 5638–5653, 2014.

[25] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,”in ACM Trans. Graph., vol. 26, no. 3, 2007, p. 96.

[26] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proc.IEEE Int. Conf. Comput. Vis., 1998, pp. 839–846.

[27] B. Huhle, T. Schairer, P. Jenke, and W. Straßer, “Fusion of range and color imagesfor denoising and resolution enhancement with a non-local filter,” Comput. Vis. ImageUnderstanding, vol. 114, no. 12, pp. 1336–1345, 2010.

[28] J. Dolson, J. Baek, C. Plagemann, and S. Thrun, “Upsampling range data in dynamicenvironments,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1141–1148.

[29] D. Chan, H. Buisman, C. Theobalt, S. Thrun et al., “A noise-aware filter for real-timedepth upsampling,” in Workshop on Multi-camera and Multi-modal Sensor Fusion Algo-rithms and Applications-M2SFA2, 2008.

[30] F. Garcia, B. Mirbach, B. Ottersten, F. Grandidier, and A. Cuesta, “Pixel weightedaverage strategy for depth sensor data fusion,” in Proc. IEEE Int. Conf. Image Process.,2010, pp. 2805–2808.

[31] E. S. L. Gastal and M. M. Oliveira, “Adaptive manifolds for real-time high-dimensionalfiltering,” ACM Trans. Graph., vol. 31, no. 4, pp. 33:1–33:13, 2012.

[32] ——, “Domain transform for edge-aware image and video processing,” ACM Trans.Graph., vol. 30, no. 4, pp. 69:1–69:12, Jul. 2011.

[33] Q. Yang, N. Ahuja, R. Yang, K.-H. Tan, J. Davis, B. Culbertson, J. Apostolopoulos, andG. Wang, “Fusion of median and bilateral filtering for range image upsampling,” vol. 22,no. 12, pp. 4841–4852, Dec 2013.

[34] Q. Yang, R. Yang, J. Davis, and D. Nister, “Spatial-depth super resolution for rangeimages,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2007, pp. 1–8.

BIBLIOGRAPHY 123

[35] J. Lu, H. Yang, D. Min, and M. Do, “Patch match filter: Efficient edge-aware filteringmeets randomized search for fast correspondence field estimation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., June 2013, pp. 1854–1861.

[36] L. Sheng and K. N. Ngan, “Depth enhancement based on hybrid geometric hole fillingstrategy,” in Proc. IEEE Int. Conf. Image Process., Sept 2013, pp. 2173–2176.

[37] H. Li, P. Roivainen, and R. Forchheimer, “3-D motion estimation in model-based facialimage coding,” vol. 15, no. 6, pp. 545–555, Jun 1993.

[38] M. J. Black and Y. Yacoob, “Tracking and recognizing rigid and non-rigid facial motionsusing local parametric models of image motion,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun 1995, pp. 374–381.

[39] D. DeCarlo and D. Metaxas, “Optical flow constraints on deformable models withapplications to face tracking,” International Journal of Computer Vision, vol. 38, no. 2,pp. 99–127, 2000. [Online]. Available: http://dx.doi.org/10.1023/A:1008122917811

[40] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Proceedingsof the 26th annual conference on Computer graphics and interactive techniques. ACMPress/Addison-Wesley Publishing Co., 1999, pp. 187–194.

[41] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” no. 6, pp.681–685, 2001.

[42] D. Cristinacce and T. Cootes, “Automatic feature localisation with constrained localmodels,” Pattern Recognition, vol. 41, no. 10, pp. 3054–3067, 2008.

[43] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regressiontrees,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1867–1874.

[44] J. M. Saragih, S. Lucey, and J. F. Cohn, “Deformable model fitting by regularized land-mark mean-shift,” International Journal of Computer Vision, vol. 91, no. 2, pp. 200–215,2011.

[45] X. Xiong and F. Torre, “Supervised descent method and its applications to face align-ment,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

[46] Y. Sun and L. Yin, “Automatic pose estimation of 3d facial models,” in Proc. IEEE Int.Conf. Pattern Recognit. IEEE, 2008, pp. 1–4.

[47] M. D. Breitenstein, D. Kuettel, T. Weise, L. Van Gool, and H. Pfister, “Real-time facepose estimation from single range images,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. IEEE, 2008, pp. 1–8.

[48] C. Papazov, T. K. Marks, and M. Jones, “Real-time 3D head pose and facial landmarkestimation from depth images using triangular surface patch features,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2015, pp. 4722–4730.

[49] G. Fanelli, T. Weise, J. Gall, and L. Van Gool, “Real time head pose estimation fromconsumer depth cameras,” in Pattern Recognition. Springer, 2011, pp. 101–110.

[50] G. Fanelli, J. Gall, and L. Van Gool, “Real time head pose estimation with randomregression forests,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2011,pp. 617–624.

[51] G. Riegler, D. Ferstl, M. Ruther, and H. Bischof, “Hough networks for head pose es-timation and facial feature localization,” in Proceedings of the British Machine VisionConference. BMVA Press, 2014.

[52] V. Kazemi, C. Keskin, J. Taylor, P. Kohli, and S. Izadi, “Real-time face reconstructionfrom a single depth image,” in 3D Vision (3DV), 2014 2nd International Conference on,vol. 1. IEEE, 2014, pp. 369–376.

124 BIBLIOGRAPHY

[53] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixelscompared to state-of-the-art superpixel methods,” vol. 34, no. 11, pp. 2274–2282, Nov2012.

[54] S. Perreault and P. Hebert, “Median filtering in constant time,” vol. 16, no. 9, pp. 2389–2394, 2007.

[55] D. Cline, K. White, and P. Egbert, “Fast 8-bit median filtering based on separability,” inProc. IEEE Int. Conf. Image Process., vol. 5, Sept 2007, pp. V – 281–V – 284.

[56] J. Van de Weijer and R. Van den Boomgaard, “Local mode filtering,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., vol. 2, 2001, pp. II–428.

[57] E. Parzen, “On estimation of a probability density function and mode,” Annals of Math-ematical Statistics, vol. 33, pp. 1065–1076, Sep. 1962.

[58] D. Barash and D. Comaniciu, “A common framework for nonlinear diffusion, adaptivesmoothing, bilateral filtering and mean shift,” Image and Vision Computing, vol. 22,no. 1, pp. 73–81, 2004.

[59] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. Euro. Conf. Comput. Vis.Springer, 2010, pp. 1–14.

[60] S. Paris and F. Durand, “A fast approximation of the bilateral filter using a signal pro-cessing approach,” in Proc. Euro. Conf. Comput. Vis. Springer, 2006, pp. 568–580.

[61] J. Chen, S. Paris, and F. Durand, “Real-time edge-aware image processing with thebilateral grid,” in ACM Trans. Graph., vol. 26, no. 3, 2007, p. 103.

[62] A. Adams, N. Gelfand, J. Dolson, and M. Levoy, “Gaussian kd-trees for fast high-dimensional filtering,” in ACM Trans. Graph., vol. 28, no. 3, 2009, p. 21.

[63] A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional filtering using the permu-tohedral lattice,” in Computer Graphics Forum, vol. 29, no. 2, 2010, pp. 753–762.

[64] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning. Springer,2006.

[65] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuring ecologicalstatistics,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 2, July 2001, pp. 416–423.

[66] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereocorrespondence algorithms,” Int. Journal of Comput. Vis., vol. 47, no. 1-3, pp. 7–42,2002.

[67] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-dimensional sceneflow,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 2, 1999, pp. 722–729.

[68] C. Vogel, K. Schindler, and S. Roth, “3D scene flow estimation with a rigid motion prior,”in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 1291–1298.

[69] ——, “Piecewise rigid scene flow,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp.1377–1384.

[70] S.-Y. Kim, J.-H. Cho, A. Koschan, and M. Abidi, “Spatial and temporal enhancementof depth images captured by a time-of-flight depth sensor,” in Proc. IEEE Int. Conf.Pattern Recognit., Aug 2010, pp. 2358–2361.

[71] J. Zhu, L. Wang, J. Gao, and R. Yang, “Spatial-temporal fusion for high accuracy depthmaps using dynamic MRFs,” vol. 32, no. 5, pp. 899–909, 2010.

[72] J. Shen and S.-C. S. Cheung, “Layer depth denoising and completion for structured-light RGB-D cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp.1187–1194.

BIBLIOGRAPHY 125

[73] R. Szeliski, “A multi-view approach to motion and stereo,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., vol. 1, 1999, pp. 157–163.

[74] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm, R. Yang, D. Nister, andM. Pollefeys, “Real-time visibility-based fusion of depth maps,” in Proc. IEEE Int. Conf.Comput. Vis., Oct 2007, pp. 1–8.

[75] S. Liu and D. Cooper, “A complete statistical inverse ray tracing approach to multi-viewstereo,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2011, pp. 913–920.

[76] Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, and S. Thrun, “Multi-viewimage and ToF sensor fusion for dense 3D reconstruction,” in Proc. IEEE Int. Conf.Comput. Vis. Workshops, 2009, pp. 1542–1549.

[77] C. Zitnick, S. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High-quality videoview interpolation using a layered representation,” in ACM SIGGRAPH, vol. 23, no. 3,August 2004, pp. 600–608.

[78] K. Pathak, A. Birk, J. Poppinga, and S. Schwertfeger, “3D forward sensor modeling andapplication to occupancy grid based sensor fusion,” in Proc. IEEE/RSJ Int. Conf. Intell.Robots. Syst., 2007, pp. 2059–2064.

[79] B. Curless and M. Levoy, “A volumetric method for building complex models from rangeimages,” in Proc. ACM SIGGRAPH, 1996, pp. 303–312.

[80] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux,S. Hodges, D. Kim, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mappingand tracking,” in Proc. IEEE Int. Symp. Mixed Augmented Reality, 2011, pp. 127–136.

[81] O. J. Woodford and G. Vogiatzis, “A generative model for online depth fusion,” in Proc.Euro. Conf. Comput. Vis. springer, 2012, pp. 144–157.

[82] S. Thrun, “Learning occupancy grids with forward models,” in Proc. IEEE/RSJ Int.Conf. Intell. Robots. Syst., vol. 3, 2001, pp. 1676–1681.

[83] G. Vogiatzis and C. Hernandez, “Video-based, real-time multi-view stereo,” Image andVision Computing, vol. 29, no. 7, pp. 434–441, 2011.

[84] T. P. Minka, “A family of algorithms for approximate Bayesian inference,” Ph.D. disser-tation, Massachusetts Institute of Technology, 2001.

[85] P. Krahenbuhl and V. Koltun, “Efficient inference in fully connected CRFs with Gaussianedge potentials,” in Advances in Neural Information Processing Systems. MIT press,2011, pp. 109–117.

[86] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image guided depth up-sampling using anisotropic total generalized variation,” in Proc. IEEE Int. Conf. Comput.Vis., December 2013.

[87] D. Scharstein and C. Pal, “Learning conditional random fields for stereo,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., June 2007, pp. 1–8.

[88] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2007, pp. 1–8.

[89] C. Fehn, “Depth-image-based rendering (DIBR), compression, and transmission for anew approach on 3D-TV,” in Electronic Imaging. International Society for Optics andPhotonics, 2004, pp. 93–104.

[90] D. Vlasic, M. Brand, H. Pfister, and J. Popovic, “Face transfer with multilinear models,”in ACM Trans. Graph., vol. 24, no. 3. ACM, 2005, pp. 426–433.

126 BIBLIOGRAPHY

[91] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon, “The vitruvian manifold: Infer-ring dense correspondences for one-shot human pose estimation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. IEEE, 2012, pp. 103–110.

[92] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, andA. Blake, “Real-time human pose recognition in parts from a single depth image,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, June 2011.

[93] L. Wei, Q. Huang, D. Ceylan, E. Vouga, and H. Li, “Dense human body correspondencesusing convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2016.

[94] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “3D constrained local model for rigid andnon-rigid facial tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE,2012, pp. 2610–2617.

[95] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial anima-tion,” ACM Trans. Graph., vol. 32, no. 4, p. 41, 2013.

[96] Y. Cai, M. Yang, and Z. Li, “Robust head pose estimation using a 3D morphable model,”Mathematical Problems in Engineering, vol. 2015, 2015.

[97] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang, “3d deformable face tracking with a com-modity depth camera,” in Proc. Euro. Conf. Comput. Vis. Springer, 2010, pp. 229–242.

[98] C. Chen, H. X. Pham, V. Pavlovic, J. Cai, and G. Shi, “Depth recovery with face priors,”in Proc. Asia Conf. Comput. Vis. Springer, 2014, pp. 336–351.

[99] A. Brunton, A. Salazar, T. Bolkart, and S. Wuhrer, “Review of statistical shape spacesfor 3d data with comparative analysis for human faces,” Computer Vision and ImageUnderstanding, vol. 128, pp. 1–17, 2014.

[100] S. Bouaziz, Y. Wang, and M. Pauly, “Online modeling for realtime facial animation,”ACM Trans. Graph., vol. 32, no. 4, p. 40, 2013.

[101] P.-L. Hsieh, C. Ma, J. Yu, and H. Li, “Unconstrained realtime facial performance cap-ture,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1675–1683.

[102] S. Li, K. Ngan, R. Paramesran, and L. Sheng, “Real-time head pose tracking with onlineface template reconstruction.” 2015.

[103] M. Storer, M. Urschler, and H. Bischof, “3D-MAM: 3D morphable appearance modelfor efficient fine head pose estimation from still images,” in Computer Vision Workshops(ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp.192–199.

[104] S. Tulyakov, R.-L. Vieriu, S. Semeniuta, and N. Sebe, “Robust real-time extreme headpose estimation,” in Proc. IEEE Int. Conf. Pattern Recognit. IEEE, 2014, pp. 2263–2268.

[105] T. Weise, S. Bouaziz, H. Li, and M. Pauly, “Realtime performance-based facial anima-tion,” in ACM Trans. Graph., vol. 30, no. 4. ACM, 2011, p. 77.

[106] H. Li, J. Yu, Y. Ye, and C. Bregler, “Realtime facial animation with on-the-fly correc-tives.” ACM Trans. Graph., vol. 32, no. 4, pp. 42–1, 2013.

[107] S. Saito, T. Li, and H. Li, “Real-time facial segmentation and performance capture fromRGB input,” arXiv preprint arXiv:1604.02647, 2016.

[108] P. Padeleris, X. Zabulis, and A. A. Argyros, “Head pose estimation on depth data basedon particle swarm optimization,” in Computer Vision and Pattern Recognition Workshops(CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 42–49.

[109] R. Wang, L. Wei, E. Vouga, Q. Huang, D. Ceylan, G. Medioni, and H. Li, “Capturingdynamic textured surfaces of moving targets,” arXiv preprint arXiv:1604.02801, 2016.

BIBLIOGRAPHY 127

[110] H. Li, T. Weise, and M. Pauly, “Example-based facial rigging,” ACM Trans. Graph.,vol. 29, no. 4, p. 32, 2010.

[111] P. Ekman and W. Friesen, “Facial action coding system: a technique for the measurementof facial movement,” Consulting Psychologists, San Francisco, 1978.

[112] M. Martin, F. Van De Camp, and R. Stiefelhagen, “Real time head model creation andhead pose estimation on consumer depth cameras,” in 3D Vision (3DV), 2014 2nd In-ternational Conference on, vol. 1. IEEE, 2014, pp. 641–648.

new probabilistic approaches for rgb-d video enhancement and … · 2020. 4. 27. · abstract iii...

Documents