new probabilistic approaches for rgb-d video enhancement and … · 2020. 4. 27. · abstract iii...
TRANSCRIPT
Probabilistic Approaches for RGB-D VideoEnhancement and Facial Pose Tracking
SHENG, Lu
A Thesis Submitted in Partial Fulfilment
of the Requirements for the Degree of
Doctor of Philosophy
in
Electronic Engineering
The Chinese University of Hong Kong
October 2016
題題題獻獻獻/Dedication
獻給
我親愛的妻子邵婧
及
我們的父母親
To
my dear wife Shao Jing
&
our beloved parents
i
Abstract
Abstract of thesis entitled:
Probabilistic Approaches for RGB-D Video Enhancement and Facial Pose
Tracking
Submitted by SHENG, Lu
for the degree of Doctor of Philosophy
at The Chinese University of Hong Kong
Acquiring high-quality and well-defined depth data from real scenes has been a hot
research topic in multimedia and computer vision. With the prevalence of various 3D
computer vision applications, it has been used in virtual reality, 3DTV, free-viewpoint
TV, human computer interaction and robot vision. Conventional passive acquisition
algorithms (e.g., stereoscopic vision, shape-from-X , etc.) mostly assume that the cap-
tured 3D scene is simple and artificial, i.e., under constant lighting conditions or other
constraints, only containing static or slowly-moving objects. Fortunately the depth
cameras, e.g., time-of-light cameras, laser scanners or structured-light sensors, are able
to capture standard-resolution depth maps in video frame rate, making real-time 3D
natural scene reconstruction, rendering, manipulation and interaction feasible. Nev-
ertheless, artifacts like noise, outliers, depth-missing regions and low resolution deter
direct usage of the raw depth data. Hence, there is an imperative need to develop a
unified and high-quality spatio-temporal depth video enhancement algorithm.
Accompanied by synchronized color videos offered by these sensors, the composed
RGB-D videos provide multi-modal structural features that are shared by both texture
and geometry, enabling effective guidance by texture features to regularize the depth
videos. Furthermore, such kind of guidance and structure-sharing properties between
different kinds of feature maps (e.g., RGB maps versus depth map) enable a series of
structure-preserving/propagation filters that do not only handle depth data but are
ii
ABSTRACT iii
also applicable to a much more broad area in image/video processing, graphics and
computer vision.
This thesis proposes solutions for exploring probabilistic approaches to discover
effective ways for efficient spatio-temporal RGB-D video enhancement. In addition,
probabilistic structure-preserving/propagation filters for various image and video ap-
plications are designed. Moreover, applications based on the RGB-D videos, like 3D
facial pose tracking, are effectively treated under the probabilistic view as well. The
depth videos employed in this thesis were captured by Kinect version 1 and low reso-
lution time-of-flight camera.
The employed probabilistic approaches not only handle the uncertainties, e.g., noise,
outliers and other artifacts, they also enable compact and learnable models with reliable
predictions. For example, the enhanced depth videos for RGB-D video enhancement,
the tracking parameters for rigid facial pose tracking, and the face model descriptions
for online face model personalization.
This thesis first demonstrates the spatial and temporal depth video enhancement
under the guidance of the synchronized color video. In spatial enhancement, at first
a novel hybrid strategy is proposed to simultaneously smooth the depth surface and
preserve the discontinuities by the combination of joint bilateral filtering and segment-
based surface structure propagation. Secondly, a probabilistic approach is proposed to
accelerate the time-consuming local weighted distribution estimation for the weighted
median/mode filters, which is based on a novel separable kernel defined by a weighted
combination of a set of probabilistic generative models. It reduces the large number of
filtering operations in conventional algorithms to a small amount, and is also compactly
adaptive to the structure of the input image. This method is not only compatible
with the RGB-D video enhancement, but also suitable for various image and video
applications, e.g. detail enhancement, structure extraction, JEPG artificial removals,
and etc.
In in temporal enhancement, an efficient online enhancement is developed by in-
troducing a probabilistic intermediary capturing the static structure of the captured
scene. By performing a novel variational generative model with respect to the static
structure, the proposed method both maintains long-range temporal consistency on the
static scene and keeps necessary depth variations in the dynamic content. With added
iv ABSTRACT
spatial refinement, it can produce flicker-free and spatially optimized depth videos with
reduced motion blur and depth distortion.
Thirdly, one application is presented in this thesis that applies the RGB-D videos
to track 3D facial pose with online face model personalization. Its inherent probabilis-
tic model brings about (1) robust estimation of the tracking parameters that are less
vulnerable under uncontrolled scenes with heavy occlusions and facial expression vari-
ations, and (2) reliable face model adaptation avoiding the interference of occlusions
and expression changes. The experimental results reveal that the proposed approach
is effective and superior to the state-of-the-art methods.
摘摘摘要要要
近年來,從真實場景中獲取高質量、高精細度的深度數據成為了一個在多媒體及計算
機視覺領域日益活躍的研究課題。隨著各種三維計算機視覺應用的持續流行,深度數
據已經被廣泛的應用于虛擬現實、三維電視、自由視角電視、人機交互以及機器視覺
等領域。傳統的被動式深度獲取算法(例如雙目立体視覺系統,shape-from-X等等)大多數假設拍攝場景條件簡單,比如說恆定、勻質的光照條件,只包含靜態或者緩慢
移動的物體等等。幸運的是深度攝像機──比如time-of-flight相機、激光掃描和結構光
傳感器等等──能夠實時錄製標清深度圖像,使得實時三維自然場景的重建、生成、
交互和操作變成可行的任務。然而測得深度數據的噪聲和離群性,特殊區域深度量的
缺失,以及深度像的低分辨率致原始的深度數據不適合直接使用。因此,我們亟需一
種綜合的高質量時空深度視頻恢復和加強的算法作為一種必要的預處理。
如果深度(depth)視頻與同步的彩色(RGB)視頻配合起來,這種RGB-D視頻提
供了被紋理和幾何結構共享的多模結構性特征。因此我們可以採用紋理特征來幫助約
束和指導深度視頻的處理。不僅如此,這種不同特征圖之間(例如,彩色圖與深度圖
之間)的結構指導或者結構共享啟發了一系列新穎的結構保持、結構擴展濾波器。這
些濾波器不僅可以用來處理深度數據,而且可以應用到更為廣闊的圖像與視頻處理,
計算機圖形學和計算機視覺領域之中。
本博士論文探索了使用概率方法在時空域上高效修復和加強RGB-D視頻的算法。
同時,設計了針對多種圖像與視頻應用的概率式結構保持和結構擴展濾波器。並且,
基於RGB-D視頻信號,從概率角度研究了三維面部方位追蹤問題。本論文採用的概率
方法不但可以描述數據的不確定性,比如說噪聲、離群點和其他缺陷,而且支持緊緻
的可學習模型:不但可以對RGB-D視頻提供可靠的深度視頻預測,而且可以對對頭部
方位追蹤以及在線面部建模問題提供有效的參數估計。
在本博士論文首先論述了在同步彩色視頻指導下的時域和空域深度視頻的修復
和增強算法。在空域增強部分,提出了一個新穎的基於聯合雙邊帶濾波器和超像素
(superpixel)表面結構擴散的混合策略來同時平滑深度表面和維持斷面的結構。然
後,本文提出了一種概率型結構保持、結構擴散濾波器。它不僅可以用於RGB-D視頻
v
vi ABSTRACT
的加強,而且可以應用于多種圖像與視頻應用,比如說細節增強,結構提取,JPEG圖
像畸變消除等等。該方法是一種對於加權中位數和眾數濾波器中高時間複雜度的局部
加權分佈估計的加速算法。該算法基於一種定義于一系列概率生成模型加權組合的新
穎的可分離濾波核。此算法极大少了之前算法所需的濾波運算,而且同時可以緊緻地
適應輸入圖像的結構特征。
在時域增強部分,提出了一種以採集場景的靜態結構(static structure)為概率
媒介的即時增強算法。通過一種新穎的針對于靜態結構的變分生成模型(variational
generative model),該方法同時對靜態場景保證了長期的時域一致性以及對動態內容
保持了必要的深度變化。在加入空域調優之後,該方法可以產生消除閃爍擾動、優化
空域信號,以及較少運動模糊和深度扭曲的深度視頻。
第三,本博士論文論述了一項利用RGB-D視頻的應用:三維面部方位追蹤。本論
文在該應用中採用的概率模型不僅使其對追蹤參數的魯棒估計免受非受控場景和強烈
遮擋引起的干擾,而且保證在線面部建模免收來自遮擋和表情變化帶來的畸變。實驗
結果表明本論文提出的算法高效而且優於當前最優的方法。
Acknowledgments
First and foremost, I wish to thank my supervisor Prof. King Ngi Ngan for his en-
couragement, support and mentorship. He is an accomplished scholar in his field of
expertise of image and visual signal processing. Not only did he guide me to think in
a creative way, but also provided plenty of innovative ideas to broaden my scope of re-
search horizon. Any achievement during my doctorial study cannot be gained without
his insightful supervision. Moreover, his attitude towards perfectness always motivates
me to move on and work harder.
My deep gratitude also goes to Prof. Jianfei Cai with the School of Computer
Science and Engineering in Nanyang Technological University (NTU), for his great
guidance and help when I did the overseas research internship in NTU for six months.
He provided me a valuable chance to improve my research skills and broaden the field of
my research vision. I would also like to thank Prof. Xiaogang Wang, Prof. Thierry Blu,
Prof. Wai Kuen Cham and Prof. Hung Tat Tsui, who are faculty members in the image
and video processing (IVP) laboratory. Their insightful suggestions and comments
engendered a more thorough understanding of my research topics, and introduced to
me a plethora of advanced knowledges about signal processing, computer vision and
machine learning.
I must express my appreciations to my colleges in IVP lab. Thanks go to Songnan
Li, Lin Ma, Wanli Ouyang, Qiang Liu, Qian Zhang, Feng Xue, Cong Zhao, Miaohui
Wang, Ran Shi, Chi Ho Cheung, Yichi Zhang, Tianhao Zhao, Fanzi Wu, Yu Zhang, Kai
Kang, Tong Xiao, Qinglong Han, Wei Li, Hanjie Pan, Xingyu Zeng, Zhisheng Huang,
Cong Zhang and others in IVP lab. I will treasure forever the time with them during
my PhD study. I also need to thank Jie Chen, Di Xu and Teng Deng for their help
both in academic and daily life when I was with Nanyang Technological University.
Last but not least, I am deeply indebted to my family. My most sincere grati-
tude goes to my wife, Shao Jing, for her consistent love, support, encouragement and
vii
viii ACKNOWLEDGMENTS
understanding. None of my achievements is possible without her support. I want to
especially thank my parents, for their unconditional love and support in the past twenty
seven years. Their love motivates me to pursue my dreams with the strongest resolve.
致致致謝謝謝
當我博士論文即將書寫完成之際,我的學生生涯也將畫下一個句號。我從一個“江南
水鄉”寧波出發,來到“人間天堂”杭州,再來到“東方明珠”香港,輾轉求學已二
十一年有餘。來到香港中文大學前後五載,時間雖然短暫,但是成長迅速、受益良
多。雖然研究成果並不豐富,但是學習科研、見聞見識都有了長足的進步。在此,謹
允許我向幫助、鼓勵和支持過我的老師、同學、朋友和家人致以深深的敬意和衷心的
感謝。
首先感謝我的恩師顏慶義教授(Prof. King Ngi Ngan)對我的諄諄教誨和悉心指
導。顏教授學識淵博、造詣深厚,不僅給予我許多關鍵的啟發性建議,而且也助我邁
過諸多彎路。我在攻讀博士學位期間獲得的任何成果和進步都離不開顏教授的磨礪和
鞭策。不但如此,顏教授嚴謹的治學態度和一絲不苟的工作作風,更是讓我深深敬佩
而且受益終身。
同時要感謝的是南洋理工大學計算機科學與工程學院的Prof. Jianfei Cai,以及香
港中文大學圖像與視訊處理實驗室的王曉剛教授、Prof. Thierry Blu、湛偉權教授和
徐孔達教授。他們對各種學術問題的獨到見解和精闢評論,同樣給我很大幫助和啟
發,讓我對自己的博士研究課題有跟進一步的理解,同時學到了大量信號處理、計算
機視覺和機器學習相關的前沿技術和知識。特別感謝Prof. Jianfei Cai在我于南洋理工
大學交流期間的指導和幫助,給予我一個珍貴的提高學術能力和拓寬眼界的機會。
我也要感謝實驗室一起奮鬥的伙伴們。感謝李松楠、馬林、歐陽萬里、劉強、張
茜、薛峰、趙叢、王妙輝等師兄師姐對我的指導。感謝史冉、張志豪、張一馳、趙天
昊、吳凡子、張瑜等師弟師妹對我的支持。同時感謝康愷、肖桐、韓慶龍、李蔚、潘
漢杰、曾星宇、黃之燊、張聰等圖像處理實驗室同窗對我的幫助和照顧。還需感謝在
南洋理工大學交流期間陳杰、徐迪、鄧騰等同學對我的接待和幫助。在與他們在一起
科研和學習的時光里,我感到充實而且快樂,收穫了美好而且難忘的回憶。感謝你們
一路上的陪伴,在此祝你們未來一片光明,生活幸福美滿。
最後,讓我把最真摯的謝意和感激獻給我的家人,他們是我的堅實後盾和力量源
泉。父母的辛勤栽培和殷殷期望,是我漫長求學生涯中不變的支柱。而我的妻子邵
ix
x ACKNOWLEDGMENTS
婧,也是我同實驗室的同學,總是在我困難的時候幫助我,在我軟弱的時候鼓勵我。
在照顧好我們生活的同時和我就科研問題互相參詳、互相討論。我的所有成果都有她
的一份功勞。謝謝你們給我力量和勇氣去面對困難,迎接挑戰。
Publications
Journal Papers
• Lu Sheng and King Ngi Ngan, “Weighted Structural Prior for Structure-preserving
Image and Video Applications”, IEEE Transaction on Image Processing (TIP),
U.S.A., in preparation.
• Lu Sheng, Jianfei Cai and King Ngi Ngan, “A Generative Model for Robust 3D
Facial Pose Tracking”, IEEE Transactions on Image Processing (TIP), U.S.A.,
in preparation.
• Lu Sheng, King Ngi Ngan, Chern-Loon Lim and Songnan Li, “Online Tempo-
rally Consistent Indoor Depth Video Enhancement via Static Structure”, IEEE
Transactions on Image Processing (TIP), U.S.A., vol. 24, no. 7, pp. 2197-2211,
Jul. 2015.
• Songnan Li, King Ngi Ngan, Raveendran Paramesran and Lu Sheng, “Real-
time Head Pose Tracking with Online Face Template Reconstruction”, IEEE
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), U.S.A.,
accepted.
Conference Papers
• Lu Sheng, Tak-Wai Hui and King Ngi Ngan, “Accelerating the Distribution Es-
timation for the Weighted Median/Mode Filters”, In Asian Conference on Com-
puter Vision (ACCV), Poster, Singapore, Nov. 1-5, 2014.
• Lu Sheng, Songnan Li and King Ngi Ngan, “Temporal Depth Video Enhance-
ment Based On Intrinsic Static Structure”, In IEEE International Conference on
Image Processing (ICIP), Oral, Paris, France, Oct. 27-30, 2014.
xi
xii PUBLICATIONS
• Lu Sheng, King Ngi Ngan and Songnan Li, “Depth Enhancement Based On
Hybrid Geometric Hole Filling Strategy”, In IEEE International Conference on
Image Processing (ICIP), Poster, Melbourne, Australia, Sep. 15-18, 2013.
• Chi Ho Cheung, Lu Sheng and King Ngi Ngan, “A disocclusion filling method
using multiple sprites with depth for virtual view synthesis”, In IEEE Interna-
tional Conference on Multimedia and Expo Workshop (ICMEW), Oral, Turin,
Italy, Jun. 29 - Jul. 3, 2015.
• Songnan Li, King Ngi Ngan and Lu Sheng, “Screen-camera Calibration Us-
ing a Thread”, In IEEE International Conference on Image Processing (ICIP),
Poster, Paris, France, Oct. 27-30, 2014.
• Songnan Li, King Ngi Ngan and Lu Sheng, “A Head Pose Tracking System Us-
ing RGB-D Camera”, In International Conference on Computer Vision Systems
(ICVS), Oral, St. Petersburg, Russia, Jul. 16-18, 2013.
Declaration
I hereby declare that this thesis is composed by myself and all the contents has not
been submitted to this or any other universities for a degree. The materials of some
chapters have been published in the following conferences or journals:
• Chapter 2:
– Lu Sheng, King Ngi Ngan and Songnan Li, “Depth Enhancement Based On
Hybrid Geometric Hole Filling Strategy”, In IEEE International Conference
on Image Processing (ICIP), Melbourne, Australia, Sep. 15-18, 2013.
• Chapter 3:
– Lu Sheng, Tak-Wai Hui and King Ngi Ngan, “Accelerating the Distribution
Estimation for the Weighted Median/Mode Filters”, In Asian Conference on
Computer Vision (ACCV), Singapore, Nov. 1-5, 2014.
• Chapter 4:
– Lu Sheng, Songnan Li and King Ngi Ngan, “Temporal Depth Video En-
hancement Based On Intrinsic Static Structure”, In IEEE International
Conference on Image Processing (ICIP), Paris, France, Oct. 27-30, 2014.
– Lu Sheng, King Ngi Ngan, Chern-Loon Lim and Songnan Li, “Online Tem-
porally Consistent Indoor Depth Video Enhancement via Static Structure”,
IEEE Transactions on Image Processing (TIP), U.S.A., vol. 24, no. 7, pp.
2197-2211, Jul. 2015.
xiii
Contents
Dedication i
Abstract ii
Acknowledgments vii
Publications xi
Declaration xiii
Contents xiv
List of Figures xviii
List of Tables xxvi
1 Introduction and Background 1
1.1 RGB-D Video Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 RGB-D Spatial Enhancement . . . . . . . . . . . . . . . . . . . . 5
1.1.2 RGB-D Temporal Enhancement . . . . . . . . . . . . . . . . . . 7
1.2 RGB-D Video Applications . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 The Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Hybrid Geometric Hole Filling Strategy for Spatial Enhancement 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Unreliable Region Detection and Invalidation . . . . . . . . . . . 16
2.3.2 Hybrid Strategy of Geometric Hole Filling . . . . . . . . . . . . . 16
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
xiv
CONTENTS xv
3 Weighted Structure Filters Based on Parametric Structural Decom-
position 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Non-parametric Representations of Local Image Statistics . . . . 26
3.3.2 Correlations across Local Structures . . . . . . . . . . . . . . . . 28
3.3.3 Complexity of the Local Statistics Estimation . . . . . . . . . . . 28
3.4 Accelerating the Distribution Estimation . . . . . . . . . . . . . . . . . . 29
3.4.1 Kernel Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Probability Distribution Approximation . . . . . . . . . . . . . . 31
3.4.3 Gaussian Model for the Proposed Kernel . . . . . . . . . . . . . . 32
3.5 Accelerated Weighted Filters . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Weighted Average Filter . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.2 Weighted Median Filter . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 Weighted Mode Filter . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . 38
3.6.1 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Temporal Enhancement based on Static Structure 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 A Probabilistic Generative Mixture Model . . . . . . . . . . . . . 51
4.3.2 Variational Approximation . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Improvement with Color Video . . . . . . . . . . . . . . . . . . . 55
4.3.4 Layer Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.5 Online Static Structure Update Scheme . . . . . . . . . . . . . . 58
4.3.6 Temporally Consistent Depth Video Enhancement . . . . . . . . 60
4.4 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Numerical Evaluation of the Static Structure Estimation By Syn-
thesized Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.2 Evaluation of the Static Structure Estimation By Real Data . . . 68
4.4.3 Temporally Consistent Depth Video Enhancement . . . . . . . . 71
4.5 Limitations and Applications . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xvi CONTENTS
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 A Generative Model for Robust 3D Facial Pose Tracking 80
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Probabilistic 3D Face Parameterization . . . . . . . . . . . . . . . . . . 83
5.3.1 Multilinear Face Model . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.2 A Statistical Prior . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Probabilistic Facial Pose Tracking . . . . . . . . . . . . . . . . . . . . . 87
5.4.1 Robust Facial Pose Tracking . . . . . . . . . . . . . . . . . . . . 88
5.4.2 Online Identity Adaptation . . . . . . . . . . . . . . . . . . . . . 95
5.5 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.1 Datasets And System Setup . . . . . . . . . . . . . . . . . . . . . 99
5.5.2 Quantitative and Qualitative Evaluations . . . . . . . . . . . . . 102
5.5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 Conclusions and Future Work 107
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A Approximation for the Gaussian Kernel 110
B Generative Model for Static Structure 111
B.1 Probabilistic Generative Mixture Model . . . . . . . . . . . . . . . . . . 111
B.1.1 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.1.2 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.1.3 Joint Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.1.4 Data Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.1.5 Posteriors with First-order Markov Chain . . . . . . . . . . . . . 113
B.2 Derivations of the Results in Variational Approximation . . . . . . . . . 114
B.2.1 Approximated Joint Distributions . . . . . . . . . . . . . . . . . 114
B.2.2 Approximated Data Evidence For The Observation . . . . . . . . 115
B.2.3 Parameter Updating for the Approximated Static Structure . . . 115
B.2.4 Parameter Updating for the Approximated State Frequencies . . 116
B.2.5 Approximated Posterior for the State Frequencies . . . . . . . . . 117
C The Choice of Depth Noise Standard Deviation 119
C.1 Depth Map from Stereo or Kinect . . . . . . . . . . . . . . . . . . . . . 119
C.2 Depth Map from Other Sources . . . . . . . . . . . . . . . . . . . . . . . 120
CONTENTS xvii
Bibliography 121
List of Figures
1.1 (a)-(b) Illustration of RGB-D image pairs. (c) Texture-rendered point
clouds. Data is captured from Kinect. . . . . . . . . . . . . . . . . . . . 2
1.2 Applications based on RGB-D data. . . . . . . . . . . . . . . . . . . . . 3
1.3 Spatial distortions in raw depth images from Kinect version 1. (a)-(b)
Raw RGB-D image pairs. (c) Depth mesh generated from the raw depth
image, illustrating the noise and outliers. (d) Depth holes from various
sources. The blue box indicates depth holes from occlusions, while the
green box shows the depth holes from light reflection and absorption. . . 4
1.4 Temporal distortions in raw and spatially enhanced depth videos. The
videos were captured by Kinect version 1. (a) Raw depth videos suffer
the temporal flickering problem due to the inconsistent noise, outliers
and depth holes. (b) Spatially enhanced depth videos still contain tem-
poral artifacts from the blurs around object boundaries and inconsistent
spatial filtering operations between neighbor frames. . . . . . . . . . . . 5
2.1 Framework of the proposed method. . . . . . . . . . . . . . . . . . . . . 14
2.2 Align the depth map into color image coordinate and then partition the
hole region into Ωs and Ωf . Test depth map comes from the Middlebury
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Illustration of patch matching process. The left image is segmented
color image, the right one is a close-up of local region marked blue in
left image. Pu is the query patch, Pv is in candidate patch sets. Detail
description is in the text. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Middlebury datasets employed for the experimental comparsions. Test
scene (from left to right) are Raindeer, Midd2 and Teddy. . . . . . . . . 21
2.5 Visual comparison on the Middlebury datasets. From up to bottom:
color images, results by [1], [2] and the proposed method. Test scene
(from left to right) are Raindeer, Midd2 and Teddy. . . . . . . . . . . . 22
xviii
LIST OF FIGURES xix
3.1 Illustration of correlations among structures in local patches. (a) is the
sample image. Four patches A,B,C and D were selected from the area
in the black box. (b) shows the histograms of the four patches, which
were fitted by the kernel regression. These revealed modes indicate the
local structures. We labeled the estimated structures as #1 to #4. (c)
indicates the locations of these structures in each patch. These structures
are slowly varying in a local neighborhood and are shared among these
patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Illustration of the proposed kernel. (a) shows a 1D signal and two pixels
x and y. (b) represents the construction of κ(fx, fy), where the mean
values of three models are shown in three different colors. It measures
the similarity of fx and fy by evaluating the summation of the joint
likelihood of them w.r.t. each model. . . . . . . . . . . . . . . . . . . . . 30
3.3 Locally adaptive models (LAM) v.s. uniformly quantized models (UQM).
A 1D signal is extracted from a gray-scale image shown in the left column
and marked by orange. Both the LAM and UQM models (L = 3) are
exploited to represent the signal, which are shown in the right column.
The top row is by UQM models, the bottom one is by LAM models.
The LAM models are adaptive to the local structures and own superior
performance on representing the signal with limited number of models
(e.g.L = 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 h(x, g) and h(x, g) of the patches C and D (from the image shown in
Figure 3.1) different conditions. The window size |N (x)| = 11 × 11
and only the spatial weights are exploited. (a) h(x, g) are estimated
by the smoothed local histogram [3] under different data variance σ2n.
σn = 10−1, 10−2 and 10−3. (b) h(x, g) are estimated by the proposed
kernel under different data variances as in (a). (c) h(x, g) are estimated
under different number of models L and the data variance is sticked as
σn = 10−2. The y-axis is rescaled to show the subtle differences between
different curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Execution time comparison on the distribution construction w.r.t. the
number of models. The input is a 8-bit single-channel image and the
guidance is a 3-channel image. The reference method is brute-force and
traverses 256 discretized bins. . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 The distribution of the number of necessary local adaptive models in
BSDS300 dataset. Left : the window size is 21 × 21. Right : the window
size is 11× 11. The smaller the window size, the fewer number of locally
adaptive models is necessary. . . . . . . . . . . . . . . . . . . . . . . . . 40
xx LIST OF FIGURES
3.7 Depth map enhancement on tsukuba. The first row shows the raw
input disparity map, the ground truth, results by CT-median [4] and
BF-mode [5] respectively, from left to right. Disparity maps in the 2nd
and 3rd rows were obtained by the proposed weighted median filter and
weighted mode filter, under different number of models. The models
were generated by the LAM models. The error was evaluated on bad
pixel ratio with the threshold 1. GF weights were chosen and related
parameters were fairly configured. . . . . . . . . . . . . . . . . . . . . . 41
3.8 Results of the weighted mode filter with 7 models. . . . . . . . . . . . . 41
3.9 JPEG compression artifact removal results by the weighted median filter.
(a) The input degraded eyes image. (b) CT-median [4]. (c) The proposal
weighted median filter with the LAM models and (d) is with the UQM
models. The second row shows the corresponding zoomed-in patches.
The DF weights were chosen and all the related parameters were fairly
configured. Best viewed in electronic version. . . . . . . . . . . . . . . . 43
3.10 Detail enhancement by the proposed weighted median filter under the
LAM models. From left to right, the original rock image, after edge-
preserving smoothing, and the detail enhanced image. GF weights were
chosen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.11 Joint depth map upsampling. The input disparity map was 8× upsam-
pled by the proposed weighted median filter and the weighted mode filter
under the LAM models. The raw input diparity map is shown in the
top-left corner of the leftmost image. GF weights were chosen. . . . . . 44
4.1 The illustration of the static structure in comparison with the input
depth frame. (a) shows the input depth frame (in blue curve) lies on the
captured scene, (b) represents the static structure (in black curve). The
depth sensor is above the captured scene. The static structure includes
the static objects as well as the static background. . . . . . . . . . . . . 49
4.2 Flowchart of the overall framework of the proposed method on the esti-
mation of static structure and depth video enhancement. Please refer to
the text for the detailed description. . . . . . . . . . . . . . . . . . . . . 50
4.3 Illustration of three states of input depth measurements with respect to
the static structure on one line-of-sight. The current static structure
refers to the blue stick in the middle. Decision boundaries are marked
as blue dot lines. The depth measurement d is categorized into state-I
when it is placed around the static structure. When d is in front of
this structure, we denote it as state-F. While it is far behind the static
structure, the state is state-B. . . . . . . . . . . . . . . . . . . . . . . . 51
LIST OF FIGURES xxi
4.4 Variational approximation of the parameter set of the static structure
for a 1D depth sequence. The number of frames is T = 500. (a) The
expected depth sequence of the static structure versus the raw depth
sequence, where the ideal Zx = 50. (b) The confidence interval of
Ztx, the interval is centered μt
x and between μtx ± 2σt
x with 95% con-
fidence. (c) The evolution of the portions (defined by the expected value
of ωx at frame t, denoted by [ωI,tx , ωF,t
x , ωB,tx ]) of the three states. The
ideal portions are ωx = [0.89, 0.1, 0.01]. (d) The estimated distribution
qT (dx|PD,Tx ) versus the normalized histogram estimated by DT
x when
T = 500. The estimated depth of the static structure goes to the ideal
value only with a few samples. Its confidence interval shrinks rapidly,
which means the uncertainty is reduced very fast. The portion of each
state is evolved with the raw depth sequence, and they match their ideal
value with enough depth samples. When T = 500, the estimated data
distribution fits the data histogram compactly. . . . . . . . . . . . . . . 54
4.5 One toy example illustrates the layer assignment. The cyan dot line
indicates the current estimated depth structure of the static structure,
and the red solid line is from the input depth frame. If color frames are
available, they provide additional constraints to regularize the assign-
ment, where the upper line corresponds to the current estimated texture
structure of the static structure, and the lower one refers to the input
color frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Sample frames of the input depth video with two types of noise and
outliers. (a) is the sample color frame, (b) and (c) are the contaminated
depth frames with σn = 2 and ωn = 10−2. (b) is type-I but (c) is type-II.
Type-II error is worse than type-I error with the same parameters. . . . 62
4.7 RMSE maps with varying u and σ under different noise and outlier
parameter pairs (ωn, σn). (a)-(c) were contaminated by type-I, while
(d)-(e) were contaminated by type-II. . . . . . . . . . . . . . . . . . . . 63
4.8 Performance comparisons between the constant and depth-dependent ξx
under different type-II noise and outlier parameter pairs (ωn, σn). The
red curve is by depth-dependent ξx, and the blue curve is by constant
ξx. Each curve is obtained at its own optimal parameter pair (u, σ), as
shown in the legends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.9 Comparison with other methods on static structure estimation of the
synthetic static scenes. Three levels of noise and outlier parameter pairs
(ωn, σn) were tested. (a), (c) and (e) were of type-I. (b), (d) and (f) were
of type-II. The x-axis marks the frame order, and y-axis is the RMSE
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xxii LIST OF FIGURES
4.10 Visual evaluation on real indoor static scenes. (a) is the result of a
real indoor scene Indoor Scene 1. The first row shows the raw depth
sequences and color sequences. The second row is the selected results
of the estimated static structures without spatial enhancement at frame
t = 0, 5, 10 respectively. The third row shows corresponding spatially en-
hanced static structure without texture information, while the last row
exhibits the results with the guidance of texture information. The yellow
color in the second row marks missed depth values (holes). Gray rep-
resents depth value, lighter meaning a nearer distance from the camera.
Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 Visual evaluation on real indoor static scenes. (b) is the results of a
real indoor scene Indoor Scene 2. The first row shows the raw depth
sequences and color sequences. The second row is the selected results
of the estimated static structures without spatial enhancement at frame
t = 0, 5, 10 respectively. The third row shows corresponding spatially en-
hanced static structure without texture information, while the last row
exhibits the results with the guidance of texture information. The yellow
color in the second row marks missed depth values (holes). Gray rep-
resents depth value, lighter meaning a nearer distance from the camera.
Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.12 Reliability maps of two test sequences of indoor static scenes. . . . . . . 69
4.13 Static structure estimation on dyn kinect tl. (a) and (b) are the first
five frames of the input sequence. (c) shows the layer assignment results.
Red, green, blue denote liss, ldyn, locc, respectively. (d) represents the
depth map of the static structure, and (e) shows the corresponding color
map. The first frame is for initialization. . . . . . . . . . . . . . . . . . . 70
4.14 Static structure estimation on dyn tof tl. (a) shows the first 5 frames of
the input sequence. (b) shows the layer assignment results. Red, green,
blue denote liss, ldyn, locc, respectively. (c) represents the depth map of
the static structure. The first frame is for initialization. . . . . . . . . . 71
4.15 Comparison on depth video enhancement. (a) and (b) are selected
frames from the test RGB-D video sequences. From left to right: the
113rd, 133th, 153th, 173th, 193th and 213th frame. (c) shows the results by
CSTF [1], and (d) by WMF [5]. (e) by Lang et. al. [6] (f) is generated
by the proposed method. (g) compares the performances among these
methods in the enlarged sub-regions (shown in raster-scan order). Best
viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
LIST OF FIGURES xxiii
4.16 Comparison on depth video enhancement. (a) are selected frames from
an RGB-D video sequence dyn kinect 2. From top to bottom: the RGB
frames, the raw depth frames, results by Lang et. al. [6] and results by
the proposed method. Best viewed in color. . . . . . . . . . . . . . . . . 73
4.17 Comparison on depth video enhancement. (b) are selected frames from
an RGB-D video sequence dyn kinect 3. From top to bottom: the RGB
frames, the raw depth frames, results by Lang et. al. [6] and results by
the proposed method. Best viewed in color. . . . . . . . . . . . . . . . . 74
4.18 Failure cases of the proposed method. (a) and (b) are two representa-
tive results. From left to right: color frame, raw depth frame and the
enhanced depth frame. Artifacts are bounded by the red dot boxes. . . 76
4.19 Examples of the background subtraction. Best viewed in color. . . . . . 77
4.20 Examples of the novel view synthesis. (a) and (b) are the input RGB and
depth frames. (c) is the enhanced depth frame by the proposed method.
(d) is the synthesized view by the raw depth frame and the RGB frame.
Image holes in (d) is filled by the static structure, as shown in (e). (f) is
the synthesized view based on the enhanced depth frame and the image
holes are also filled by the estimated static structure. Best viewed in color. 78
5.1 Sample face meshes in the FaceWarehouse dataset. This dataset contains
face meshes from a comprehensive set of expressions and a variety of
identities including different ages, genders and races. . . . . . . . . . . . 83
5.2 Illustration of the generic multilinear face model trained by the Face-
Warehouse dataset [7]. (a) The mean face f . (b) Illustration of per-
vertex shape variation caused jointly by wid and wexp. (c)–(d) Illustra-
tion of per-vertex shape variation with respect to wid and wexp, respec-
tively. The shape variation is represented as the standard deviation of
the marginalized per-vertex distribution. The shape variations in (b)–
(d) are overlaid on the same neutral face model μM. Best viewed in
electronic version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 System overview. We propose a unified probabilistic framework for ro-
bust facial pose estimation and online identity adaption. In both threads,
the generative face model acts as the key intermediate and it is updated
immediately with the feedback of the identity adaptation. The input
data is the depth map while the output is the rigid pose parameter θ(t)
and the updated face identity parameters {μ(t)id ,Σ
(t)id } that encode the
identity distribution p(t)(wid). . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4 Samples of the occluded faces. The occlusions are caused by multiple
factors. For instance, the face is occluded by itself, or the face is occluded
by other objects like hair, accessories, hands and etc. . . . . . . . . . . . 89
xxiv LIST OF FIGURES
5.5 Illustration of the ray visibility constraint. A profiled face model and a
curve in the surface of the input point cloud are presented in front of
a depth camera. Three cases are presented. (a) Case-I: a partial face
region is fitted to the input point cloud, while the rest facial regions are
occluded. (b) Case-II: the face model is completely occluded. (c) Case-
III: a part of face region is visible and in front of the point cloud, and
the rest face regions are occluded. Best viewed in electronic version. . . 91
5.6 Examples of the proposed rigid pose estimation. (a) and (b) are the
color images and the corresponded point cloud. (c) shows the initial
alignment provided by the head detection method [8], and (d) visualizes
the proposed rigid pose estimation results. Notice that only generic
face model is applied. It robustly estimates difficult face poses from the
partial scans with heavy occlusions by hands and hairs, as well as the
profiled faces with strong self-occlusions. Best viewed in electronic version. 93
5.7 Comparison of the rigid pose estimation methods. (a) and (b) show the
color image and its corresponded point cloud. (c) depicts two views of the
initial alignment between the generic face model and the point cloud. (d)
visualizes the result by ICP [9], and (e) reports the result by maximizing
the likelihood that modeled by the ray visibility constraint (RVC). (f) is
the proposed recursive method for the minimization of the ray visibility
score (RVS), (g) is the augmented RVS method by the particle swarm
optimization (RVS+PSO). Details refer to the text and notice that only
the generic face model is applied. Best viewed in electronic version. . . . 95
5.8 Examples of face model adaptation. The proposed method can success-
fully personalize the face model to identities with different gender and
races. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.9 We continuously adapt the identities of the face model to different users.
(a)-(c) are two examples showing that the face model can be gradually
personalized when the facial depth data from different poses are captured
during the tracking process. The face model is initialized with the generic
face model as shown in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . 98
5.10 Tracking results on the Biwi dataset with the personalized face mod-
els. Our system is robust to profiled faces due to large rotations and
occlusions from hair and accessories. The 1st and 2nd rows show the
corresponded color and depth image pairs. The third row visualizes the
extracted point clouds of the head regions and the overlaid personalized
face models. Best viewed in electronic version. . . . . . . . . . . . . . . 100
5.11 Tracking results on the ICT-3DHP dataset. The proposed system is also
effective to the expression variations. Best viewed in electronic version. . 101
LIST OF FIGURES xxv
5.12 The proposed system can automatically adapt a face model from one
identity to another. Top: Three identities are presented successively
in adjacent three frames. Bottom: The tracking face models that are
adaptive to the current identity. Please note the differences of head and
nose shapes among the visualized face models. . . . . . . . . . . . . . . . 103
List of Tables
2.1 Comparison of bad pixel rate (%) . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Comparison of mean absolute difference . . . . . . . . . . . . . . . . . . 23
4.1 Per-frame running time comparison (MATLAB platform) . . . . . . . . 67
5.1 Facial Pose Datasets Summarization . . . . . . . . . . . . . . . . . . . . 100
5.2 Evaluations on Biwi dataset . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Evaluations on ICT-3DHP dataset . . . . . . . . . . . . . . . . . . . . . 104
xxvi
Chapter 1
Introduction and Background
With the prevalence of various three-dimensional applications in manufacturing and
the entertainment industry, automatically acquiring dense and high-quality depth data
from the real world scenarios has been an essential requirement in 3D reconstruc-
tion, virtual reality and augmented reality (VR/AR), 3D and free-viewpoint televisions
(3DTV and FTV), human-computer interaction (HCI), robot vision, as well as a host
of high-level 3D learning tasks like 3D object/scene understanding and analysis.
Unlike most research works in computer graphics that rely on synthesizing the scene-
level or object-level depth data, the computer vision community focuses on measuring
the depth data from the real world. Recently a variety of systems have been proposed
to obtain depth information of a real scene, from passive stereo vision and shape-from-
X systems to active sensors like real-time structured-light depth sensors (e.g., Kinect),
Time-of-Flight (ToF) cameras or laser scanners. The passive systems mostly require
simple or artificial environmental conditions (i.e., constant lighting, simple background,
static or slowly-moving objects, and etc.) for the captured scenes, so as to keep their
performances as stable as possible. Fortunately, recent commodity active depth cameras
are able to capture standard-resolution depth maps in video frame rate, making the
low-cost but real-time 3D applications possible.
Even thought the depth data acquired by the recent commodity depth sensors are of
low quality, they provide a more convenient and explicit way to model and understand
the geometric structures of the 3D world instead of the implicit inferences from the 2D
texture information offered by the RGB images and videos. A lot of 3D image/video
processing and computer vision tasks benefit from the usage of these depth sensors. To
name a few, 3DTV and FTV adopt “RGB + depth” video pairs from either dense or
sparse viewpoints to seamlessly render immersive and user-plausible novel viewpoint
1
2 CHAP. 1. INTRODUCTION AND BACKGROUND
(a) RGB image (b) Depth image (c) Point cloud with texture
Figure 1.1: (a)-(b) Illustration of RGB-D image pairs. (c) Texture-rendered pointclouds. Data is captured from Kinect.
videos. VR/AR and HCI employ the streaming depth data to determine the user’s
head pose, facial expression, body pose and actions in real time. The 3D geometrical
data also give the researchers a new modality of cues in addition to the conventional
2D texture patterns, enabling a more thorough high-level analysis and understanding
of the 3D real world both from the viewpoints of appearance and geometry. As for the
field of robot vision, one example is the simultaneous localization and mapping (SLAM)
algorithms that explicitly utilize the point clouds from the depth sensors mounted on
the robots to concurrently reconstruct the 3D layouts of the scanned scenes and localize
the trajectories of the robots. As a consequence, the introduction of depth information
facilitates a lot of tasks that were once difficult or intractable by using the texture
information alone and make them much easier and way more accessible.
However, despite the advantages listed above, RGB-D video and enhancement is
still an urgent issue since the poor quality of the captured depth data, for example
from Kinect version 1, will more or less impede the depth-based tasks to give full play
to their potential performances. Moreover, the depth data accompanied by the texture
information suggests that we need specific treatment compatible to the 3D geometrical
properties other than the conventional methods particularly for the texture patterns.
It means that methods dedicated to the depth data are necessary and essential for
3D image and video processing, as well as various 3D computer vision applications.
Therefore, on one hand, this thesis aims to propose reliable solutions of RGB-D video
enhancement for Kinect version 1, as a faithful preprocessing for various 3D applica-
tions. On the other hand, taking the 3D facial pose tracking as an example, this thesis
explores novel depth-based techniques to model 3D geometrical relationships and re-
construct 3D structures in an online fashion. This thesis unifies the tool for all the tasks
§ 1.1. RGB-D Video Enhancement 3
(a) Immersive 3DTV and FTV (b) 3D facial expression reenactment (c) RGBD SLAM
(d) 3D facial pose estimation (e) 3D body pose estimation
Figure 1.2: Applications based on RGB-D data.
based on the parametric generative models, which are not only effective to model these
problems with reliable uncertainty (or noise) compensation and faithful 2D/3D struc-
ture and motion predictions, but also efficient in computing complexity for real-time
performances.
1.1 RGB-D Video Enhancement
Most commodity depth sensors only offer low quality depth data and usually suffer
from various systematic distortions depending on the mechanisms behind them.
The spatial distortions of depth videos can be roughly classified into three categories:
• Noise and outliers. For Kinect and structured-light sensors, noise usually comes
from the quantization errors from the disparity-to-depth conversion [10]. How-
ever, outliers always stem from the strong light reflection from the non-lambertian
materials, or light attenuation because of the light-absorbable materials, or inter-
ference across multiple depth sensors or ambient light. For ToF sensors, the noise
and outliers are usually from the light absorptions from different materials. For
both types of depth sensors, complex geometrical structures often produce un-
stable outliers since the depth measurements around the discontinuities between
distinct structures are always erratic.
4 CHAP. 1. INTRODUCTION AND BACKGROUND
(a) RGB image (b) Depth image
(c) Noise and outliers (d) Depth holes from various sources
Figure 1.3: Spatial distortions in raw depth images from Kinect version 1. (a)-(b) RawRGB-D image pairs. (c) Depth mesh generated from the raw depth image, illustratingthe noise and outliers. (d) Depth holes from various sources. The blue box indicatesdepth holes from occlusions, while the green box shows the depth holes from lightreflection and absorption.
• Holes without depth measurements. A part of the depth holes are caused by
the occlusions for the structured-light sensor. And similarly, light attenuation and
reflection also lead to depth holes without reliable depth measurements. Another
kind of holes happens when the captured objects or scenes are out of the effective
range of the depth sensors.
• Low resolution. Although various types of depth sensors are getting to the
market with increasing resolutions, most of their resolutions still cannot compete
with commodity web cameras (usually 1920×1080 or larger). For example, Kinect
version 1 (structured-light) is of 320 × 240 pixels and Kinect version 2 (time-of-
flight) is of 512×424 pixels. While one popular ToF camera like the Swiss Ranger
only has 176× 144 pixels.
• RGB-D mis-alignment. Aiming at a unified framework on both the depth and
its synchronized RGB videos, the misalignment errors between the depth and the
RGB frames are an extra kind of spatial distortions that is critical for reliable
RGB-D video based tasks. It is more severe if there is resolution incompatibility
§ 1.1.1. RGB-D Spatial Enhancement 5
(a) Raw depth video
(b) Spatially enhanced depth video
Figure 1.4: Temporal distortions in raw and spatially enhanced depth videos. Thevideos were captured by Kinect version 1. (a) Raw depth videos suffer the temporalflickering problem due to the inconsistent noise, outliers and depth holes. (b) Spatiallyenhanced depth videos still contain temporal artifacts from the blurs around objectboundaries and inconsistent spatial filtering operations between neighbor frames.
between the depth and RGB videos.
Apart from the spatial distortions, the temporal inconsistency problem is another
type of distortions that occurs in the raw depth videos. Not only the noise, outliers and
depth holes in adjacent frames introduce severe temporal flickering artifacts, but also
the spatial enhancement specific to single depth image aggravates the inconsistency
problem between neighboring frames.
These shortcomings make it difficult to use the raw depth of RGB-D videos directly.
1.1.1 RGB-D Spatial Enhancement
To tackle these limitations, the spatial enhancement of the depth video require extensive
research efforts. A pioneer work in this field was done by Diebel et. al. [11]. They
modeled the enhancement problem as a pixel-wise Markov Random Field (MRF) guided
by the RGB image with the assumptions that
• structure and texture discontinuities are co-aligned in the color and depth images;
• pixels with similar texture patterns have similar geometrical structures.
6 CHAP. 1. INTRODUCTION AND BACKGROUND
Under similar assumptions, several augmented models were also proposed to handle in-
painting and super-resolution [12–16], with special choices of the data and smoothness
terms as well as additional regularization terms. For instance, effective image-guided
regularizations like TV-�1 norm [16], anisotropic total generalized variation [17], mu-
tual structure [18], as well as a regularization term even without texture informa-
tion [16]. Modern global optimization methods also attempt joint static and dynamic
guidance [19], or employ statistical inference for the local structures [20; 21], or ex-
plicitly enforce the local geometric structures [22; 23]. But the high computational
cost of these methods hinders real-time applications except some carefully designed
accelerations [24].
With similar assumptions as above, Kopf et. al. [25] proposed the Joint Bilateral
Filter (JBF), which is a kind of high-dimensional filters, to efficiently filter the noisy and
low resolution depth image under the guidance of the corresponded RGB image. It is an
extension of the famous Bilateral Filter (BF) [26] by modifying the structural guidance
from external feature maps, whose weights are defined by the spatial nearness and
feature proximity. To solve the texture copying and edge blurring artifacts underlying
the JBF, and further enforce its power of structural filtering, a list of variants of the JBF
have been proposed [1; 27–32] in recent literatures. The features can be texture/depth
intensities or patches [27; 31] and other specifically defined ones.
Another variant uses the median of the weighted depth candidate histogram [3; 4;
33] instead of the mean of this histogram as what JBF does, producing much more
robust results but also suffering from quantization error and slower speed. Weighted
mode filtering [3; 5; 34] otherwise looks for the histogram’s global mode, and has similar
artifacts. For a satisfactory performance with lower computational requirement, instead
of the introduction of parallel computation units like GPGPU for a brute-force imple-
mentation, acceleration techniques should be able to consider some theoretical ways to
approximate the distribution (or histogram) estimation by parametric formulations or
other non-parametric but efficient means.
In addition, the spatial enhancement, especially super-resolution and inpainting,
can be performed by patch matching throughout the depth image, which achieved
satisfactory visual results [35; 36] but with high computational complexity.
§ 1.1.2. RGB-D Temporal Enhancement 7
The depth hole filling problem is strongly related to the image inpainting and oc-
clusion handling in stereo vision. According to a recent work of Richardt et. al. [1],
standard JBF (actually similar for a series of global optimization methods) can effi-
ciently and seamlessly interpolate these depth holes, but it is vulnerable to produce
artifacts when a depth hole is too large or its corresponded texture patterns imply
unreliable structure inference. Moreover, the joint bilateral filtering operation cannot
preserve the high-order surface structures if no extra specific setting is involved. Be-
cause the extrapolated depth will always be piece-wise constant in a large hole. Many
stereo algorithms simply fill the depth holes from the background content, which all
suffer from significant artifacts when the scene is too complex. Our work on hole filling
is related to work of Wang et. al. [2] about stereoscopic inpainting, they over-segmented
stereo images and fitted a plane to each segment with estimated disparity, then prop-
agated the parametric planes into holes by matching segments in a greedy way. Their
segment matching cost function heavily relied on the stereo images that could not be
exploited in general case, while the plane regression technique is not precise enough to
estimate local surface structure.
In this thesis, the spatial enhancement is explored in two aspects. On one hand, a
hybrid strategy is proposed to upsample the raw depth image with interpolation and
faithfully complete the depth holes through structure preparation under the guidance
of the accompanied RGB image. On the other hand, a parameterized probabilistic
model is designed for the approximation of the weight distribution, the derived weighted
mode filter and weighted median filter have similar performances as the state-of-the-art
methods but only require a fraction of runtime and computational complexities.
1.1.2 RGB-D Temporal Enhancement
Even though the spatial enhancement of depth maps have been extensively studied as
discussed in the previous section, the temporal inconsistency problem is nevertheless
neglected in recent state-of-the-art methods, thus resulting in severe flickering artifacts
because the necessary temporal relationship between adjacent frames has not been
taken into consideration. However, due to various complex and even unpredictable
dynamic contents, as well as spatial distortions in a depth video, it is not easy to exactly
locate the regions where temporal consistency should be enforced. Several existing
8 CHAP. 1. INTRODUCTION AND BACKGROUND
methods [1; 5] employ the temporal texture similarity to extract 2D motion information,
but correct depth variation cannot always be maintained causing severe motion blur
artifacts. In addition, typical treatments always apply temporal consistency over a
short-length sequence (usually 2 ∼ 3 frames), which is insufficient to generate stable and
temporally consistent results over hundreds of frames. Furthermore, over-smoothing
around the boundaries between dynamic objects and static scenes should be eliminated
to produce high quality and well-defined depth video.
This thesis presents an alternative method to enhance a depth video both spatially
and temporally by addressing two aspects of these problems:
• efficiently and effectively enforcing the temporal consistency where it is necessary,
• and enabling online processing.
A common fact is that regions in one frame with various motion patterns (e.g., static,
slowly/fast moving and etc.) belonging to different objects or structures require tempo-
ral consistencies with different levels. For instance, the static region needs a long-range
temporal enhancement to ensure that it is static over a long duration, while dynamic
regions with slow/rapid motions expect short-term or no temporal consistency. How-
ever, it is difficult to accurately enhance arbitrary and complex dynamic contents in
the temporal domain without apparent motion blurs or depth distortions. Thus an
intuitive compromise does without the temporal enhancement in the dynamic region as
long as its spatial enhancement is done sufficiently well, in which the necessary depth
variation will not be distorted while the temporal artifacts are not easily perceived in
the static region. Therefore, we aim at strengthening long-range temporal consistency
around the static region whilst maintaining necessary depth variation in the dynamic
content. To accurately separate the static and dynamic regions, we track and incre-
mentally refine a probabilistic model called static structure in an online fashion, which
acts as a medium to indicate the region that is static in the current frame. By online
fusing the static region of the current frame into the static structure with an efficient
variational fusion scheme, this structure has implicitly gathered all the temporal data
at and before the current frame that belong to it. Substituting the static region by
the updated static structure, it is thus temporally consistent and stable in a long time
frames. Moreover, it is also suitable for online processing the streaming depth videos
§ 1.2. RGB-D Video Applications 9
(3D teleconference, 3DTV and etc.) without the necessity of storing long sequence of
adjacent frames, which is memory and computationally efficient.
1.2 RGB-D Video Applications
Provided with the RGB-D video sequences, many tasks that were once difficult using
only the RGB sequences become possible and simpler. The depth videos either act
as an explicit data source for tasks that focus on the interpretation, manipulation or
inference of the geometrical structure of the captured content, for instance, 3D facial
pose estimation and tracking, 3D scene reconstruction and etc, or provide implicit
geometrical cues for high-level learning or analyses of the 3D real world. In this thesis,
novel methods with respect to the geometrical manipulation and inference have been
explored, such as a depth-based robust 3D facial pose tracking system with online facial
model personalization, and two by-products of the proposed temporal enhancement –
novel view synthesis and background subtraction.
Conventional three-dimensional television (3DTV) systems require the binocular or
multi-view stereo RGB videos as its input. With the wide popularity of the RGB-D
cameras, the modern systems have been compatible with one or several synchronized
high-quality RGB-D video pairs for the synthesis of a new frame on the screen from a
novel viewpoint. It is further extended to the free-viewpoint television (FTV) system if
it can synthesize the novel video from any viewpoint in front of the screen. However, a
trade-off between the transmitted data and storage budget, and the complete coverage
of the captured 3D scene suggests a sparser RGB-D cameras system setup. To facili-
tate a visually plausible novel view synthesis, sufficient accurate depth videos should
be provided based on the raw RGB-D videos and the texture holes in the novel view
should be faithfully recovered. In this thesis, the novel view synthesis is performed as
an application for the proposed spatio-temporal depth video enhancement, which inher-
ently includes an online 3D static scene reconstruction. The progressively updated 3D
static scene offers reliable inference of the content in the texture holes. In addition, the
resultant structure-optimized depth videos greatly eliminate the misalignment errors
between the depth and RGB frames, and structure errors when filling the depth holes
while smoothing the noise and outliers. These advantages enable structure-optimized
novel view synthesis with reduced spatial distortions.
10 CHAP. 1. INTRODUCTION AND BACKGROUND
Robust tracking of the 3D facial pose is an essential task in the fields of com-
puter vision and computer graphics, with applications in facial performance capture,
human-computer interaction, immersive 3DTV and FTV, as well as VR and AR sys-
tems. Traditionally, the facial pose tracking has been successfully performed on RGB
videos [37–45] for optimally constrained scenes, but illumination variations, shadows,
and large and severe occlusions hampers the RGB-based facial pose tracking systems
from being employed in unconstrained scenarios. However, unconstrained scenarios,
on the other hand, are much more common in numerous consumer applications, e.g.,
interactive games in VR/AR, virtual chat and etc. Fortunately, driven by the emer-
gence of commodity real-time range sensors, utilizing the depth information has been
a new trend for robust 3D facial pose tracking, since the depth data explicitly tells
the spatial relationship and gives additional cues for the occlusion reasoning. Albeit
promising results have been proposed by leveraging both the RGB video and depth data
to facilitate unconstrained facial pose tracking, they cannot reliably handle occlusion
when the RGB data alone is inadequate due to inconsistent or poor lighting conditions.
Therefore, exploring the depth data alone for the robust 3D facial pose tracking is mean-
ingful as an alternative and is complementary to the traditional tracking systems. In
unconstrained scenarios with depth cameras as the input, there are new challenges: (1)
complex self-occlusions and object-to-face occlusion caused by hair, accessories, hands
and etc.; (2) the facial pose tracking algorithm should always be available and online
adaptive to any user without manual calibration; (3) the tracking should be stable over
time and not vulnerable to user’ expression variations. Unlike previous depth-based
approaches based on discriminative or data-driven methods [46–52] that require sophis-
ticated training or manual intervention, we leverage a parameterized generative face
model and a robust occlusion-aware pose estimation to facilitate the robust 3D facial
pose tracking system. It is designed to handle large and complex occlusions in uncon-
trolled scenes under the inconsistent illumination changes or poor lighting conditions,
and enable simultaneous facial pose tracking and face model personalization on-the-fly.
1.3 The Probabilistic Models
By interpreting the target problems listed in this thesis from the probabilistic view, not
only can we handle the uncertainties raised in each task, e.g., noise, outliers and other
§ 1.4. Thesis Contributions 11
artifacts, but the probabilistic models also encourage the formulations of compact and
learnable models with reliable predictions as long as proper models have been selected.
Furthermore, the generative model in probability is a complete probabilistic model
for generating the distribution of the observations as well as those for the underlying
prior models. Therefore, the generative model’s advantages over the discriminative
model are that it is a full probabilistic model of all variables and it simulates the
inherent (or hidden) prior distributions and randomly samples the observations. In
contrast, the discriminative model only focuses on the posterior but does not really
care what the inherent model is. In particular, for the tasks of the temporal RGB-D
video enhancement and 3D facial pose tracking, the generative model expresses more
complex relationship between the observed depth data and the hidden probabilistic
models like the static structure and the 3D multi-linear morphable face model [7].
In addition, the generative model can faithfully predict itself if no observations are
available.
The parameter estimation techniques with respect to the probabilistic generative
model vary from case to case. In this thesis, online variational Bayesian methods are
employed because of its effectiveness and memory-efficiency for online model adaptation
as well as the feasibility of handling intractable integrals arising in Bayesian inference.
1.4 Thesis Contributions
This thesis carries out research on spatial and temporal RGB-D video enhancement,
and robust 3D facial pose tracking with online face model personalization. The applied
RGB-D video data were captured by Kinect version 1 and low resolution time-of-flight
camera. The contributions of the thesis are as follows:
1. In the part of spatial enhancement of RGB-D videos, this thesis presents two
practical solutions: (a) One approach is a hybrid strategy combining the segment-
based parametrized structure propagation and the depth interpolation with high-
dimensional guided filtering. A new arbitrary-shape patch matching method is
proposed to robustly extend neighboring patches’ structures into the query patch.
Experiments show that the proposed method outperforms the state-of-the-art
methods with respect to the depth hole filling problem. (b) The other approach
is a novel parameterized probabilistic model for the acceleration of the weighted
12 CHAP. 1. INTRODUCTION AND BACKGROUND
distribution. Different from the conventional methods that need quite a number of
filtering operations to estimate a sufficiently accurate distribution, the proposed
method only requires a finite and a small amount of filtering operations based
on the structure of the input image. The derived weighted mode and median
filters are much faster but still effective as the state-of-the-art methods in various
applications like the spatial enhancement of RGB-D videos, and detailed contrast
enhancement, as well as JPEG compression artifact removal.
2. The temporally consistent RGB-D video enhancement is performed by introduc-
ing the static structure of the captured scene, which is estimated online by a
probabilistic generative mixture model with efficient variational parameter ap-
proximation, spatial enhancement and update scheme. Based on this special
probabilistic static structure, the proposed enhancement aims at strengthening
the long-range temporal consistency around the static region whilst maintaining
necessary depth variation in the dynamic content. The proposed framework is
compatible with online streaming RGB-D videos so that there is no necessity of
storing long sequence of adjacent frames, thus is memory and computationally
efficient.
3. This thesis unifies the 3D facial pose tracking and online identity adaptation
based on a parameterized generative face model that integrates the descriptions
of shape, identity and expression. This face model does not only effectively model
the identity but also provide the statistical interpretation for the expression.
By tracing the identity distribution in a generative perspective, the face model
can be gradually adapted to the user with sequentially inputted depth frames.
The occlusion-aware pose estimation is achieved by minimizing an information-
theoretic ray visibility score that regularizes the visibility of the face model in the
current depth frame. This method does not need explicit correspondence detec-
tion, but it both accurately estimates the facial pose and robustly handles the
occlusion problem.
1.5 Outline
This thesis is organized into six chapters.
§ 1.5. Outline 13
Chapter 2 provides the detailed algorithm about the spatial depth enhancement based
on a hybrid strategy combining the segment-based parameterized depth structure prop-
agation and depth interpolation based on high-dimensional guided filtering.
Chapter 3 presents a parameterized probabilistic approximation method for the ac-
celeration of the weighted median/mode filtering, which are much faster with barely
no sacrifice of their performances.
Chapter 4 proposes a temporally consistent depth video enhancement method based
on the online estimation of a probabilistic generative model called static structure.
Specifically, this chapter describes a two-stage procedure designed separately for the
static and dynamic regions of the current depth frame, both enabling long-term tem-
poral consistency and preserving necessary depth variations.
Chapter 5 presents an unified framework for robust 3D facial pose tracking and online
face model personalization. The facial tracking thread consists of a novel correspondence-
free and occlusion-aware rigid pose tracking method, while the generative face model
in the online personalization thread effectively depicts the identity and is robust to the
shape variations caused by expression changes.
Chapter 6 provides conclusions for the works listed above and suggests a number of
areas to be pursued as future work.
Chapter 2
Hybrid Geometric Hole Filling Strategy for
Spatial Enhancement
2.1 Introduction
Assume the raw depth image captured from a commodity depth sensor is with lower
resolution in comparison with the corresponded color image, and contains noise, out-
liers and severe depth holes. This chapter tries to tackle the low-resolution, noise and
outliers in a raw depth image together with a special treatment to the large hole filling
problem. In particular, the depth holes are originated from depth upsampling, unreli-
able depth removal along with the depth missing regions. In the first step, we invalidate
and remove unreliable depth pixels that are within vulnerable regions around complex
discontinuities or structures, then align the depth map with the color image and map it
into the color image’s coordinate. In the second step, a hybrid strategy is proposed to
fill in the depth hole by the combination of segment-based structure propagation and
depth interpolation. After that, a standard joint bilateral filter is applied to refine the
depth image. The overall framework is shown in Figure 2.1.
Hole RegionPartition
Filtering-basedDepth Interpolation
Segment-basedDepth Inference
Depth MapRefinement
Alignmentof Depthand ImagePair
Invalidationof LowReliableDepth
Depth Map
Color Image
Figure 2.1: Framework of the proposed method.
14
§ 2.2. Related Work 15
2.2 Related Work
Spatial depth map enhancement has been extensively studied for years. The most
studied work is the upsampling and smoothing problem. A pioneer work in this field
was done by Diebel et. al. [11]. They model the depth upsampling problem as a Markov
Random Field (MRF) with the assumptions that 1) discontinuities in color image and
corresponding depth map should be co-aligned, and 2) pixels with similar texture should
have similar depth. Under similar fashion, many researchers [12; 13] also use MRF
model or Auto-Regression model to upsample and smooth depth surface while preserve
the discontinuities. Differences among their works mainly come from the smoothness or
regularization terms in their objective functions. But such kind of energy minimization
methods is always computationally expensive, which hinders a variety of real-time
applications.
Based on the similar assumptions as depicted above, Kopf et. al. [25] proposed the
Joint Bilateral Upsampling (JBU) for fast and effective upsamping and smoothing of the
low resolution and noisy depth maps, as an extension of the famous bilateral filter [26].
To solve the artifacts occur in JBU, e.g., texture copying and edge blurring, plentiful
modified filters have been proposed [27; 29; 30] in recent years.
The depth hole filling problem is related to the tasks about image inpainting and
occlusion handling in stereo vision. According to the recent work of Richardt et. al. [1],
the standard joint bilateral filter (JBF) can efficiently fill depth holes, but it is vul-
nerable to produce artifacts when a hole region is too large. Moreover, the depth
interpolation by filtering methods cannot preserve the geometrical surface structures,
because the extrapolated depth surfaces can only be piecewise constant. On the other
hand, many stereo algorithms simply fill the depth holes from the background under the
assumption that the occlusions are usually located at the background regions, which
all suffer from significant artifacts when the captured scene is too complex. Our work
on hole filling is related to work of Wang et. al. [2] about stereoscopic inpainting, they
over-segment stereo images and fit a plane to each segment with the estimated dis-
parities, then propagate the 3D planes into holes by matching segments in a greedy
way. Their cost function for segment matching heavily relies on the stereo images thus
could not be exploited in general cases, and their plane fitting procedure is not precise
enough to estimate lthe ocal surface structures.
16 CHAP. 2. HYBRID GEOMETRIC HOLE FILLING STRATEGY FOR SPATIAL ENHANCEMENT
Figure 2.2: Align the depth map into color image coordinate and then partition thehole region into Ωs and Ωf . Test depth map comes from the Middlebury dataset.
2.3 Proposed Method
We take an image I and its corresponding depth map D as inputs. Define the set of
invisible (hole) pixels as Ω, and the set of visible pixels as Ψ.
2.3.1 Unreliable Region Detection and Invalidation
Before transforming the depth map into image’s coordinate, we need to invalidate
unreliable depth pixels. The reliability can be measured by the depth gradient as men-
tioned in [1], because unreliable pixels always occur along the depth discontinuities or
in a neighbourhood with high depth variance according to the fact that depth camera
cannot accurately capture depth in such regions. What’s more, most real-time depth
sensors have the mismatching errors between color and depth edges due to calibration
error between color camera and depth sensor. Invalidating such kinds of low reliable re-
gions and filling in depth with the guidance of image will diminish the edge mismatching
problem and increase the depth value reliability.
In detail, sobel approximation is applied to compute the depth gradient, while we
invalidate pixels that have larger gradient value than a given threshold τ .
2.3.2 Hybrid Strategy of Geometric Hole Filling
After invalidating the unreliable regions, and transforming it into color image’s coor-
dinate, the resultant depth map contains three types of holes in the depth map: holes
from occlusion and/or specular regions Ωo, invalidation Ωd and sparse upsampling Ωu.
Therefore, we define the hole set Ω as
Ω = {p | p ∈ Ωo ∪ Ωu ∪ Ωd}, (2.1)
§ 2.3.2. Hybrid Strategy of Geometric Hole Filling 17
where p indicates pixel coordinate. Our proposed hybrid strategy is a combination
of filtering and surface structure propagation. Filtering-based approaches are quite
efficient to interpolate depth values if the hole region is small, but it will possibly
fail when dealing with large holes. However, we can exploit the widely used segment
constraint [2] to infer the structure, i.e., segment a hole and its neighbors into several
small patches according to the guided color image, and we assume each patch has a
smooth surface structure without sudden depth variation. Then a patch with enough
depth samples can be modeled by a plane or curved surface and it is reasonable to
propagate its surface parameters into its neighbor patches with similar textures in the
hole.
Our hole filling process firstly partitions the hole set Ω into two subsets Ωf and
Ωs, and then employs the depth interpolation in Ωf and the depth inference in Ωs, see
Figure 2.2.
Hole Region Partition
A pixel q is considered in the region Ωf when its local w × w window has enough
informative samples to interpolate its depth. We dilate visible region Ψ by a square of
width w, as Ψw = Dilation(Ψ, w), then pixel p ∈ Ψw ∩ Ω will always have one sample
at least. Then the set Ωs and Ωf are
Ωs = Dilation ((Ω−Ψw ∩ Ω) , w) ∩ Ω (2.2)
Ωf = Ω− Ωs (2.3)
The dilation operation in Equation (2.2) is to safely exclude pixels that have insufficient
depth samples in their neighbor from Ωf .
Depth Interpolation by filtering
To fill Ωf , a standard joint bilateral filtering [25] is utilized. For each pixel p ∈ Ωf , and
its visible local neighbours q ∈ Np ∩Ψ in a w × w window, its estimated depth is
Dp =1
Np
∑q∈Np∩Ψ
Gs (p, q)Gr (Ip, Iq)Dq (2.4)
18 CHAP. 2. HYBRID GEOMETRIC HOLE FILLING STRATEGY FOR SPATIAL ENHANCEMENT
where Gs and Gr are Gaussian kernel functions with standard deviations σs and σr,
measuring the spatial similarity and range (color) similarity, respectively. Np is the
normalization factor that ensures the summation of weights is equal to zero.
Depth Inference under segment constraint
Many successful super-pixel segmentation methods have been published recently, in this
application we use a fast method called simple linear iterative clustering (SLIC) [53] to
group pixels into a set of color patches, in which pixels share similar color or texture.
Then patches that overlap Ωs will be sorted into two sets Sv and Su. Sv means the set
where each patch has enough visible pixels (e.g., more than 50%) to infer its surface
structure and patches in Su are not.
Surface model estimation for patches in Sv. For simplicity, we can just model
the surface by
D(u, v) = a0 + a1u+ a2v, or (2.5)
D(u, v) = a0 + a1u+ a2v + a3u2 + a4v
2 + a5uv (2.6)
where Equation (2.5) is the linear form, and Equation (2.6) is the quadratic form.
We use RANSAC to robustly estimate each patch’s surface model. What’s more, for
the sake of accuracy, we can alternatively transform the depth map into 3D metric
coordinate (X,Y, Z), and model function Z(X,Y ) under a similar way. In this case,
recovering pixel p’s depth is to find the intersection of the surface and the line-of-sight
along p.
After estimating surface models for patches in set Sv, their invisible pixels can
be efficiently inferred. At the same time, the surface models of visible patches are
also estimated. We can further refine them by merging patches with similar surface
structure, and then re-calculate their surface models.
Surface propagation for patches in Su. It turns out to be a patch matching
problem. Here we propose a greedy algorithm that robustly find two most similar
patches according to a novel matching cost.
Our algorithm firstly selects candidate patches set CSu = {Pv} against Su, where Pv
has an estimated surface model and it is chosen near hole Su, because surface structure
will be more consistent and reliable near the hole boundary. Thus filling process will
§ 2.3.2. Hybrid Strategy of Geometric Hole Filling 19
Figure 2.3: Illustration of patch matching process. The left image is segmented colorimage, the right one is a close-up of local region marked blue in left image. Pu is thequery patch, Pv is in candidate patch sets. Detail description is in the text.
be under an order from outer to inner patches. In each iteration, find the best matched
patch in CSu and assign its surface model to query patch Pu and fill in the depth, then
Pu will be added into CSu . This process will continue until all patches in Su is filled.
Given patch Pv ∈ CSu , and Pu in the set Su, we want to measure their similarity.
Since each patch has arbitrary shape, then commonly used MSE is inapplicable, and
mean intensity is not enough distinctive. Our proposed method randomly selects n
pixels in Pu as pju, j = 1, . . . , n and k pixels in Pv as piv, i = 1, . . . , k, and defines a m×m
square-sized sub-patch to each selected pixel, as Biv in Pv and Bj
u in Pu respectively. If
two patches are similar, their sub-patch matching cost should be minimal. Sub-patch
matching is valid because it considers the color and spatial distributions of texture
while is able to handle patch with arbitrary shape.
To robustly estimate their similarity and not introduce mismatch, we propose a
shape-adapted sum-of-square to measure the similarity between Biv and Bj
u.
EBj
u
(Bi
v
)=
‖Kiv ◦
(Bi
v −Bju
)‖2F
N iv
+‖Kj
u ◦(Bi
v −Bju
)‖2F
N ju
(2.7)
where Kiv and Kj
u are bilateral kernels centred at pixel piv and pju, which are similar
as Equation (2.4), measuring the color similarity and spatial similarity of the center
pixel against its neighbours. ◦ represents element-wise multiplication. N iv and N j
u are
20 CHAP. 2. HYBRID GEOMETRIC HOLE FILLING STRATEGY FOR SPATIAL ENHANCEMENT
normalization factors similar in Section 2.3.2. Then cost between Bju and patch Pv is
EBj
u(Pv) =
1k
∑ki=1 EBj
u(Bi
v).
Therefore, given CSu and a query patch Pu, to each Bju in Pu, we can find the best
patch Pv∗ that has the smallest cost. Then we can form a histogram that each bin
indicates a candidate patch, whose bin value is the number of sub-patches in Pu that
matches referred candidate patch. Then the bin with largest value refers to the most
similar patch. We normalize the histogram and denote it as HPu(Pv), where Pv ∈ CSu .
It is possible to find more than one patches that similar in color, we further add
spatial constraint into our framework. In detail, we measure the Euclidean distance be-
tween center pixels of two patches d (Pu, Pv), and normalize the distance by exponential
function, then the overall cost function is
TPu (Pv) = HPv (Pu) · exp(−d (Pu, Pv)
2/(2× σ2
d
))(2.8)
The maximum value of TPu (Pv) shows the optimal patch pair. Because patches under
similar texture may have different surface structures, just choose the best matched
patch may inevitably introduce errors. To eliminate it, we fill the query patch from
the most similar one to the least one. Once the filled patch is consistent with local
neighbours, this process will stop.
Depth Map Refinement
After filling in all the missed pixels in depth map, we can further refine it to reduce
noise and artifact, as well as enhance depth structure according to the guided image.
Recently we find standard joint bilateral filtering is sufficient to provide effective and
efficient results.
2.4 Experiments
In this section, we evaluate the performance of our proposed algorithm, and compare it
with other existing methods. Since the main contribution of our work is the hole filling
strategy, we compare its performance with other hole filling methods, e.g., algorithms
presented by Richardt et. al. [1] so-called multi-resolution joint bilateral upsampling
(MR-JBU), and Wang et. al. [2]. Test scenes are from the Middlebury datasets 1. We
1http://vision.middlebury.edu/stereo/data/
§ 2.4. Experiments 21
(a) Color Images
(b) Raw Depth Maps
(c) Ground Truth
Figure 2.4: Middlebury datasets employed for the experimental comparsions. Testscene (from left to right) are Raindeer, Midd2 and Teddy.
choose linear form to model the surface similar as that in [2] for fair comparison. The
noisy depth map is construct by introducing occlusion according to cross-checking of
stereo images, down-sampling(2×) and adding Gaussian noise.
Visual comparison is present in Figure 2.5. Obviously, JBU undergoes texture
mapping and blurring artifact, while Wang’s greedy patch matching algorithm pro-
duces apparent mismatching errors as well since stereo constraint is not applicable.
Representative artifacts are shown in red boxes. Quantitative comparisons are done
via measuring the average percentage of bad pixels (BPR, error ≥ 1) and mean absolute
difference (MAD), results on three test scenes are listed in table 2.1 and 2.2 and our
method outperforms the rest algorithms with least BPR rate and MAD score (in bold
22 CHAP. 2. HYBRID GEOMETRIC HOLE FILLING STRATEGY FOR SPATIAL ENHANCEMENT
(a) Color Images
(b) MR-JBU [1]
(c) Wang’s [2] Method
(d) Proposed Method
Figure 2.5: Visual comparison on the Middlebury datasets. From up to bottom: colorimages, results by [1], [2] and the proposed method. Test scene (from left to right) areRaindeer, Midd2 and Teddy.
font). According to quantitative and qualitative comparisons, our proposed method
performs satisfactory and better than the other methods.
§ 2.5. Summary 23
MR-JBU [1] Wang’s [2] Ours
Raindeer 8.35 3.65 3.33
Midd2 14.10 3.10 2.51
Teddy 7.23 4.09 3.66
Table 2.1: Comparison of bad pixel rate (%)
MR-JBU [1] Wang’s [2] Ours
Raindeer 1.13 0.98 0.47
Midd2 1.67 0.62 0.31
Teddy 0.68 0.64 0.40
Table 2.2: Comparison of mean absolute difference
2.5 Summary
In this chapter, a new depth image enhancement approach is proposed as a hybrid
strategy combining the filtering- and segment-based structure propagation. Specifically,
this thesis has presented a new arbitrary-shape patch matching method to robustly
extend neighbor patches’ structure into the query patch. Experiments show that the
proposed method outperforms the reference methods with respect to the depth hole
filling problem. In the future, we will pay more attention on improving the robustness
of the depth inference model so that the filled regions will be seamless and contain
fewer mismatching errors.
Chapter 3
Weighted Structure Filters Based on Parametric
Structural Decomposition
3.1 Introduction
A variety of popular image filters in computer vision are related to the local statistics
of the input image. For example, the median filter outputs the point that reaches
half of the local cumulative distribution [4; 54; 55]. The weighted mode filter [5; 56;
57] tries to find the global mode of the local distribution. Not only that, the widely
popular bilateral filter [26], can be expressed as the mean of the local distribution that is
estimated by a Gaussian kernel density estimator [58]. Provided a guidance feature map
e.g., image intensity, patch and etc.), the weighted local distribution can be modified
to jointly reflect the statistics of both the input image and the feature map, which in
addition contributes to several kinds of structure- or style-transfer applications, like
depth or disparity refinement in the stereo matching [4; 5] and joint filtering [25].
Not explicitly estimating the local distribution, there are a certain number of ap-
proaches that are designed for accelerating the bilateral filter or similar weighted av-
erage filter, such as the domain transform filter [32], adaptive manifolds filter [31] and
the guided filter [59]. However, efficient methods for immediate estimation of the local
distributions need further attention because many applications require direct opera-
tions on these distributions. Although the brute-force implementation is still adopted
in many computer vision systems, its high complexity limits its popularity and ham-
pers real-time systems and applications. Constant time algorithms for the estimation
of the local distributions (or histograms) have been proposed in the literature. For in-
stance, the constant time weighted median filter [4] and the smoothed local histogram
filters [3]. The complexity of these methods relies on the number of bins to generate the
histograms as well as the complexity of the filtering operation that calculates the value
24
§ 3.2. Related Work 25
of each bin. Even though the complexity of filtering operations have been reported as
O(1) in the literature, an 8-bit single channel gray-scale image usually needs 256 bins
to produce a sufficiently accurate result, not to mention continuous or high-precision
images.
Related to but different from these methods, in this chapter we proposed a novel
distribution estimation method for the sake of efficiency to accelerate various image
filters. It is based on the kernel density estimation with a new separable kernel defined
by a weighted combination of a series of probabilistic generative models. The resultant
distribution has a much reduced number of filtering operations which are also inde-
pendent of the values of the bins. The number of filtering operations is exactly the
number of models used, and is usually smaller than the number of bins so as to abate
the computational complexity. The required models can be the uniform quantization
of the domain of the input image, or locally adaptive to the structures of the inputs.
Since it is always the case that a local patch of an image can be decomposed into a
limited number of distinct local structures, only a small amount of the locally adaptive
models are necessary, thus the complexity is further reduced. We also accelerated the
weighted mode filter and the weighted median filter by leveraging the proposed distri-
bution estimation method. They own comparable performance in various applications
but a faster speed in comparison with current state-of-art algorithms.
3.2 Related Work
Weighted average filters, like the bilateral filter [25; 26], implicitly reflect properties of
the local distribution. The brute-force implementation generally suffers the issue of in-
efficiency. In [60] an approximated solution was proposed by formulating the bilateral
filtering as a high-dimensional low-pass filter, and can be accelerated by downsam-
pling the space-range domain. Following this idea, different data structures have been
proposed afterwards to further speedup the filters [31; 61–63], in which the adaptive
manifolds [31] caught our attention and inspired our research to construct the locally
adaptive models. Guided filter [59] is a popular and efficient constant-time alterna-
tive. It can imitate a similar filter response as that of bilateral filter, but enforces local
linear relationship between the filtering output and the guidance image. Domain trans-
form filter [32] also produces a similar constant-time edge-preserving filter and earns
26CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
real-time performance without quantization or coarsening.
Median filter might be the first image filter that explicitly applies the local histogram
(a discretized distribution). Unlike the weighted median filter, which has no abundant
work focusing on its acceleration, the unweighted counterpart receives several constant
time solutions. One kind of these algorithms was present in the literature to lessen
the histogram update complexity [54; 55]. Another version introduced by Kass and
Solomon [3] drawn the isotropic filtering into the construction of a so-called smoothed
local histogram, which is a special case of the kernel density estimation, and the median
and mode of this histogram are thus estimated by a look-up table.
The weighted median filter as well as the weighted mode filter, however, cannot
directly duplicate the success in the previous discussion, since the weights are spatially
varying for each local window. Min et. al. [5] proposed a weighted mode filtering that
adopts bilateral weights for the depth video enhancement, but it lacks an efficient
implementations. The constant time weighted median filter [4] for disparity refinement
is one of the most recent works that try to accelerate the local distribution construction.
This method performs edge-preserving filtering to produce the probability of each bin in
the local histogram. The number of bins determines the number of filtering operations
applied. Thus it is less effective when hundreds of intensity levels are required, especially
for the processing of the natural images.
3.3 Motivation and Background
3.3.1 Non-parametric Representations of Local Image Statistics
Given an input grayscale image1 f and its corresponding feature map as its guidance,
the intensity distribution h(x, ·) in a patch centered at pixel x can be represented
non-parametrically by anisotropic kernel regression [64] as
h(x, g) =1
Z(x)
∑y∈Ωx)
w(x,y)φx (g, fy) , (3.1)
Ωx is a local neighborhood centered at x, whose area is the same as the target patch.
The kernel φ(·, ·) varies in different applications, a common choice is the Gaussian
kernel as φ(u, v;λ) =√
λ2π exp{−λ
2‖u−v‖2}, λ indicates its bandwidth and controls the
1A color image stacks red, green and blue intensity maps, and each of which owns similar non-parametric representation as that of a grayscale image.
§ 3.3.1. Non-parametric Representations of Local Image Statistics 27
Figure 3.1: Illustration of correlations among structures in local patches. (a) is thesample image. Four patches A,B,C andD were selected from the area in the black box.(b) shows the histograms of the four patches, which were fitted by the kernel regression.These revealed modes indicate the local structures. We labeled the estimated structuresas #1 to #4. (c) indicates the locations of these structures in each patch. Thesestructures are slowly varying in a local neighborhood and are shared among thesepatches.
distribution smoothness. It is worth noting that λ → ∞ results in φ(u, v;∞) = δ(u−v),
δ(·) is the Kronecker Delta function, which renders h(x, ·) weighted histogram [4].
The normalized factor Z(x) =∑
y∈Ωxw(x,y), whilst the weight w(x,y) measures the
spatial nearness and guidance feature affinity between x and y, which controls the
impact of pixel y to the center pixel x. Thus this distribution is not only controlled by
the intensity distribution but also adjusted by the guidance feature affinity. Despite the
huge amount of data prepared to non-parametrically describe the local image statistics,
it owns the flexibility to compactly fit the distribution almost in any patch of a natural
image.
Generally speaking, a small patch of a natural image does not contain a large num-
ber of distinct structures so that the local distribution is generally sparse. As shown
in Figure 3.1(b), the multi-modal distribution depict a small number of distinct struc-
tures in one pixel’s local neighborhood. Since each mode represents a subpopulation of
intensities that one structure may possess, thus a number of structure-preserving oper-
ations over a patch can be conducted by analyzing and manipulating its distribution.
28CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
For instance, a variety of structure-preserving image smoothers are related to this non-
parametric description. The weighted median filter (WMed) [4] outputs the median
fmed that lets the cumulative distribution of h(x, fmed) equal to 0.5. The weighted
mode filter (WMod) [5] tries to seek the maximum mode of h(x, f). Not only that, the
widely popular bilateral filter [26], is to estimate the mean of h(x, ·) [58].
3.3.2 Correlations across Local Structures
Figure 3.1(a) shows four patches A,B,C and D extracted from a natural grayscale
image, their histograms were fitted by the kernel regression (3.1) and shown in Fig-
ure 3.1(b). Even though these patches lie in different locations, they actually share
similar structures. For example, patches A and patches B own the same structure
#4 referring to “white lighthouse”. The structure #3 represents “cloud”, and it oc-
curs at the patches A and C. Likewise, #2 indicates the “sky” and is shared by the
patches B,C and D. Observed the similarity between the distributions generated for
these patches, notably we find that two pixels x and y in a local neighborhood share
similar responses to each structure since the structures change subtly over a small
neighborhood. For the sake of constructing coherent representation that accounts for
both the local and global statistics of structures, the global consistencies or correlations
among structures should be taken into account. We propose a parametric approach to
explicitly represent the spatially varying structures as a series of low-dimensional man-
ifolds [31], and formulated the weighted distribution by Gaussian mixture models with
weights adjusted by guided feature maps. Therefore, we can successfully utilize the lo-
cal image statistics and constrain it with the global correlation among local structures,
which enables a simple and effective image/video cue for various structure-preserving
applications.
3.3.3 Complexity of the Local Statistics Estimation
Common local image statistics are the mean, mode and median of the weighted local
distributions. However, as discussed in Section 3.3.1, despite the calculation of the mean
value is trivial as a bilateral filtering operation, the estimation of mode or median is
of high computational budget. The approximated probability distribution immediately
gets involved in the weighted median filter or the weighted mode filter since it replaces
§ 3.4. Accelerating the Distribution Estimation 29
the value of a pixel by the median or the global mode of h(x, ·). The median is usually
estimated by tracing the cumulative distribution [3]:
C(x, g) =∫ g
−∞h(x, g)dg =
1
Z(x)
∑y∈Ωx
w(x,y) ·∫ g
−∞φx (g, fy) dg (3.2)
until it meets 0.5. Because it involves a high dimensional filtering operation in estimat-
ing C(x, g) at each g, too many samples of g will bring about heavy computational cost.
On the other hand, typical ways to find the mode are the fixed-point iteration [56] or
sampling by a look-up table and interpolation [3]. The key element in either method
is the gradient of h(x, g) as
∂h(x, g)
∂g
∣∣∣g=g
=1
Z(x)
∑y∈Ωx
w(x,y) · ∂φx (g, fy)
∂g
∣∣∣g=g
, (3.3)
which is also the output after filtering. Similar problem occurs since the number of
filtering operations depends on the number of iterations to converge or the sampling
density of the look-up table.
To eliminate this issue, in the following sections we define a novel separable kernel
as a weighted combination of a series of probabilistic generative models to decrease the
number of filtering operations required to represent the distribution, and exploit the
constant time filters [32; 59] to reduce the complexity of the filtering operation.
3.4 Accelerating the Distribution Estimation
In this chapter, we propose a novel approach to approximate the probability distribution
by defining a new kernel based on a series of probabilistic generative models, which
can be factorized explicitly so as to extract the filtering operations in advance before
the distribution construction. With the proposed kernel, we introduce the accelerated
versions of the weighted mode filter and the weighted median filter. We will show it
later that they have excellent performance in terms of both quality and efficiency in
various applications.
3.4.1 Kernel Definition
Assume the input image is modeled by several (e.g., L) models throughout the whole
pixel domain, each of which is governed by a distribution as p(ηx|l), l ∈ L = {1, 2 . . . , L}
30CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
Figure 3.2: Illustration of the proposed kernel. (a) shows a 1D signal and two pixelsx and y. (b) represents the construction of κ(fx, fy), where the mean values of threemodels are shown in three different colors. It measures the similarity of fx and fy byevaluating the summation of the joint likelihood of them w.r.t. each model.
at each pixel x. These models actually act as prior knowledge to represent distinct local
structures in the input image. Two pixels x and y are similar if they both have high
probabilities to agree with the lth model (see Figure 3.2) as the following kernel:
κl(fx, fy) = px(fx|l)py(fy|l) (3.4)
=
∫ηx∈Hx
p(fx|ηx)p(ηx|l)dηx ·∫ηy∈Hy
p(fy|ηy)p(ηy|l)dηy, (3.5)
where px(fx|ηx) is the data likelihood. Hx and Hy are the domains of ηx and ηy,
respectively.
When all the L models are available, the overall kernel can be further defined as
their weighted combination:
κ(fx, fy) =L∑l=1
κl(fx, fy)px,y(l) =L∑l=1
px(fx|l)py(fy|l), (3.6)
κ(fx, fy) achieves the maximum value when p(fx|l) = [p(fx|l = 1), . . . , p(fx|l = L)]�
and p(fy|l) = [p(fy|l = 1), . . . , p(fy|l = L)]� are linearly dependent because of the
Cauchy-Schwarz inequality. Therefore, the fact that similar likelihood p(fx|l) and
p(fy|l) with respect to each model advises fx and fy are similar in the proposed kernel,
as suggested in Section 3.3.2.
What’s more, we can prove that κ(fx, fy) is a valid kernel since it is the inner
product of the feature vectors p(fx|l) and p(fy|l), which act as the non-linear mapping
from f onto the feature space defined by the L models. Not only that, it is able to
reliably approximate some popular kernels like Gaussian kernel [31] or Kronecker delta
kernel [4]2.
2Please refer to the appendix for a detailed derivation
§ 3.4.2. Probability Distribution Approximation 31
3.4.2 Probability Distribution Approximation
The approximated distribution can be written similarly as Equation (3.1) by replacing
φx(g, fx) with the proposed kernel as
h(x, g) ∝∑y∈Ωx
w(x,y)L∑l=1
px(g|l)py(fy|l) =L∑l=1
px(g|l) · ψx(l). (3.7)
The filtering operation ψx(l) =∑
y∈Ωxw(x,y)py(fy|l) is independent of g, and thus the
approximated distribution becomes a mixture of L densities. Instead of immediately fil-
tering φx(g, fy) for each g to obtain h(x, g), the proposed method can precompute ψx(l)
by merely L filtering operations in total and then estimate h(x, g) provided the priors
p(g|l). The proposed kernel approximates the distribution by extracting the filtering
operations independent of g and therefore reduces the complexity of the distribution
construction.
The cumulative distribution is hence C(x, g) ∝ ∑Ll=1 ψx(l)
∫ g−∞ p(g|l)dg and the
gradient ∂h(x,g)∂g |g=g ∝ ∑L
l=1 ψx(l)∂p(g|l)∂g |g=g, both of which do not contain additional
filtering operations except those for ψx(l), and thus have the potential to accelerate the
weighted median and mode filters.
Relationship with the Constant Time Weighted Median Filter [4] (CT-
Median)
Let the L models be equally quantized levels μl, l ∈ L of the intensity space, and denote
p(ηx|l) = δ(ηx − μl), p(fx|ηx) = δ(fx − ηx). We have the distribution as h(x, g) ∝∑Ll=1 δ(g − μl) · ∑y∈Ωx
w(x,y)δ(fy − μl), which is exact the form introduced in CT-
median.
Relation with the bilateral weighted mode filter [5] (BF-mode)
Similarly as the setup of the CT-Median, the L models are equally quantized lev-
els μl, l ∈ L. But we denote p(ηx|l) = N (ηx|μl,Σn) and p(fx|ηx) = δ(fx − ηx),
where Σn is the data variance. Therefore we estimate the distribution h(x, g) ∝∑Ll=1N (g|μl,Σn)ψx(l), where ψx(l) =
∑y∈Ωx
w(x,y)N (fy|μl,Σn). However, the his-
togram exploited in BF-mode is hBF-mode(x, g) ∝ ∑Ll=1 δ(g − μl)ψx(l). They share
the same coefficients ψx(l) but the proposed distribution employs the Gaussian kernel
32CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
Figure 3.3: Locally adaptive models (LAM) v.s. uniformly quantized models (UQM).A 1D signal is extracted from a gray-scale image shown in the left column and markedby orange. Both the LAM and UQM models (L = 3) are exploited to represent thesignal, which are shown in the right column. The top row is by UQM models, thebottom one is by LAM models. The LAM models are adaptive to the local structuresand own superior performance on representing the signal with limited number of models(e.g.L = 3).
instead of the Kronecker delta kernel applied in BF-mode.
3.4.3 Gaussian Model for the Proposed Kernel
An essential element of the proposed kernel is to determine and estimate the mod-
els as the priors to represent the input image. In particular, we apply the Gaussian
distribution to define these models for its convenience and efficiency in various image
processing applications.
Locally Adaptive Models
A simple strategy to define the models is just to equally quantize the domain f , named
as Uniformly Quantized Models (UQM). The mean of each model represents a quan-
tization level μl and the diagonal elements in Σl is set as the square of half of the
quantization interval. For a multi-dimensional image, each channel shares the same
process. Specifically, μlx = μl,Σl
x = Σl at the lth model for all x. It can well represent
cartoon style images and disparity maps from frontal parallel stereos. However, more
quantization levels are required to present a local complex structure under a sufficient
accuracy, as shown in Figure 3.3.
§ 3.4.3. Gaussian Model for the Proposed Kernel 33
Locally adaptive models (LAM) ought to be a superior idea since they tend to
describe the local structures by fewer models. The idea behind it is that we assume
a Gaussian mixture model in any local patch. Each model actually represents a local
mean estimator. Therefore, we just need the number of models is a few more than
the number of modes in the local distribution. For example, a natural image shown in
Figure 3.3 can be well represented by the LAM models. On the contrary, the UQM
models cannot well fit the local distribution if its number is insufficient.
The popular EM algorithm [64] is abandoned for the training of the LAM models
due to its high complexity and instability to ensure a good estimation. In this chapter,
we exploit an alternative and more efficient way to train the required models. Similarly
as [31], we also use a hierarchical segmentation approach to iteratively separate pixels
from distinct structures, which act as local clusters, into different models. We set the
segments as Sl, l ∈ L. This method involves simple low-pass filtering and fast PCA
operations, thus is efficient in implementation [31]. The mean and variance of each
pixel x for the lth model are
μlx =
1
Wlx
∑y∈Ωx
θlyfy, (3.8)
Σlx =
1
Wlx
∑Ωx
θlyfyf�y − μl
xμlx�, (3.9)
where θy = 1[y∈Sl] means the mask indicating pixels inside Sl. 1[·] is the indicator
function that equals to 1 when the input argument is true. The neighborhood Ωx is
set as the same local window as Equation (3.7). Wlx =
∑y∈Ωx
θly is the normalization
factor.
Kernel Specification
The prior probability for the lth model is p(ηx|l) = N (ηx|μlx,Σ
lx). Assume the data
likelihood p(fx|ηx) = N (fx|ηx,Σn), where Σn = σ2nI
d denotes the noise variance. Id
is the identity matrix where d is the number of channels in the input image. Thus we
have the kernel κ(fx, fy) accordingly as
κ(fx, fy) =1
L
L∑l=1
N (fx|μlx,Σn +Σl
x)N (fy|μly,Σn +Σl
y). (3.10)
34CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
Distribution Approximation
The approximated probability distribution to each pixel x is
h(x, g) =1
Z(x)
L∑l=1
N (g|μlx,Σn +Σl
x)ψx(l), (3.11)
where ψx(l) =∑
y∈Ωxw(x,y)N (fy|μl
y,Σn +Σly) and Z(x) =
∑Ll=1 ψx(l). The coeffi-
cients ψx(l), l ∈ L are estimated by filtering N (fy|μly,Σn +Σl
y) characterized by the
properties of w(x,y). This weight defines a joint filtering with the guidance of the
guided image. In this chapter, we choose two kinds of filters: Guided filter (GF) [59]
and Domain-transform filter (DF) [32]. They both have O(1) complexity and approxi-
mate the bilateral weight. GF has better performance on transferring local structures
from the guided feature map to the target image while DF is natural to process higher
dimensional images. Different applications exploit different weights. Here we denote
the parameters of the filtering operation as ω. ω = {r, ε} for GF, where r is the spatial
radius and ε denotes the fitting variance. And ω = {σs, σr} for DF, where σs is the
spatial standard variance and σr is the range standard variance.
The overall algorithm about the distribution approximation acceleration based on
the locally adaptive models is summarized in Algorithm 1.
Algorithm 1: Distribution Approximation Acceleration for the Locally Adaptivemodels
Input : Input image Fi, guided image Fg, parameter set {Lth, r, σn,ω};Output: Approximated distribution h(x, g);// 1. model generation
1 {Sl| l ∈ L} ← hierarchical segmentation [31] of Fi given Lth and r, σn;2 for l ← 1 to L do3 θly = 1[y∈Sl], Wl
x =∑
y∈Ωxθly;
4 μlx ← 1
Wlx
∑y∈Ωx
θlyfy, Σlx = 1
Wlx
∑y∈Ωx
θlyfyf�y − μl
xμlx�;
5 Ml ← {μlx,Σ
lx| ∀x}, l ∈ L // model parameters
// 2. distribution approximation
6 ψx(l) ←∑
y∈Ωxw(x,y)N (fy|μl
y,Σn +Σly), ψx(l) ← ψx(l)/
∑Ll=1 ψx(l);
7 h(x, g) ← ∑Ll=1N
(g|μl
x, σ2nI
d +Σlx
)ψx(l);
§ 3.4.3. Gaussian Model for the Proposed Kernel 35
0 0.5 10
0.05
0.1
0 0.5 10
0.05
0.1
σn = 10−3
σn = 10−2
σn = 10−1
σn = 10−3
σn = 10−2
σn = 10−1
Patch D Patch C
(a) h(x, g) under different data variances σ2n
0 0.5 10
0.05
0.1
0 0.5 10
0.05
0.1
σn = 10−3
σn = 10−2
σn = 10−1
σn = 10−3
σn = 10−2
σn = 10−1
Patch D Patch C
(b) h(x, g) under different data variances σ2n, L = 31
0 0.5 10
0.02
0.04
0 0.5 10
0.02
0.04
L = 7L = 15L = 31L = 63
L = 7L = 15L = 31L = 63
Patch D Patch C
(c) h(x, g) under different number of models L, σn = 10−2
Figure 3.4: h(x, g) and h(x, g) of the patches C and D (from the image shown inFigure 3.1) different conditions. The window size |N (x)| = 11×11 and only the spatialweights are exploited. (a) h(x, g) are estimated by the smoothed local histogram [3]under different data variance σ2
n. σn = 10−1, 10−2 and 10−3. (b) h(x, g) are estimatedby the proposed kernel under different data variances as in (a). (c) h(x, g) are estimatedunder different number of models L and the data variance is sticked as σn = 10−2. They-axis is rescaled to show the subtle differences between different curves.
Parameters
The proposed kernel needs two parameters σn and L. A larger σn suggests a smaller
number of necessary LAM models so as to reduce the overlapping intervals between
different models. While a smaller one requires more models to cover all the available
local structures. Therefore, we choose an automatic criterion [31] to stop generating
the LAM models when a high percentage of pixels are close to at least one model. In
detail, the criterion of closeness is set as ‖fx − μlx‖Σn ≤ 1. Together with a user-given
threshold Lth, L is determined when either the criterion or Lth is reached.
36CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
Figure 3.4(a) shows h(x, g) obtained by the smoothed local histogram [3] under
the same window size but different σns. Only spatial weights were adopted in w(x,y).
It shows that the larger σn is, the smoother the distribution will be. But h(x, g)
with a small σn = 10−3 doesn’t only report the structures but also the subtle texture
variations. However, in the case that σn = 0.1, even modes once had been referred
to different structures were merged into a common one. On the contrary, since the
proposed LAM models actually try to fit the local distribution by the estimated local
structures, systematically h(x, g) cannot sensitively record the textures. Thus h(x, g)’s
own similar shapes when σn = 10−3 and 10−2. But h(x, g) with a large σn = 10−1 also
inclines to merge nearby modes like that in h(x, g), as shown in Figure 3.4(b).
On the other hand, we estimated h(x, g) under different number of models with a
fixed data variance σ2n = 10−4, as shown in Figure 3.4(c). h(x, g) with a small L tries
to catch the main structure in the local window as much as possible, but fails to extract
the detail structures. However, those with a large L are capable to describe the detail
structures with the added models, thus the distribution is more similar to h(x, g) under
the same configuration.
To conclude what we have observed, the proposed kernel prefers to describing the
local structures rather than the total information that the local patch conveys. The
parameters σn and L are complementary with each other. The more the number of
models is, the more similar h(x, g) is as h(x, g). But the a large σn avoids to introduce a
large L because too many overlaps between different models will lose the identification
of each model. Therefore, by drawing the automatic stopping criterion and a manual
threshold Lth into the LAM model generation, the resultant distribution h(x, g) is both
efficient and effective.
3.5 Accelerated Weighted Filters
In this section, we propose the accelerated version of the weighted median and mode
filters based on the kernel discussed previously. We will also show it later that they have
excellent performance in terms of both quality and efficiency in various applications.
§ 3.5.1. Weighted Average Filter 37
3.5.1 Weighted Average Filter
The weighted average filter estimates the mean of h(x, g) at each pixel. The solution
is straightforward as
gavgx = E
[h(x, g)
]=
1
Z(x)
L∑l=1
ψx(l)μlx, (3.12)
according to the property of the GMM models [64].
This filter is closely related to the adaptive manifolds filter (AM-average) [31],
which is a fast approximation of the bilateral filter [26]. It computes the filter’s re-
sponse at a reduced set of sampling points and interpolating them to obtain the output
image [31]. Only a small number of low-cost filtering operations (equal to the amount
of the sampling points) is required, which shares a similar idea behinds the proposed
distribution approximation. However, AM-average mimics the exponential range ker-
nel in the weight by the Gauss-Hermite quadrature [31] given the sampling points. In
contrast to that, our method can incorporate various kernels (not only bilateral) as the
weight and otherwise approximates the data kernel φx(·, ·). The filter response of our
method is a weighted combination of the local structures, thus keeps local structures
and behaves more like a robust filter to prevent outliers.
3.5.2 Weighted Median Filter
The weighted median filter wants to find the median value throughout the given prob-
ability distribution. Since the resultant distribution is actually a mixture of Gaussian
models, an accelerated method is proposed by estimating the cumulative probability
C(x, μlx) at the mean value μl
x of each model. The median value is approximated
by interpolating two adjacent cumulative probabilities C(x, μkx) and C(x, μk+1
x ), where
C(x, μkx) ≤ 0.5 and C(x, μk+1
x ) ≥ 0.5. In detail,
gmedx ≈ 0.5− C(x, μk
x)
C(x, μk+1x )− C(x, μk
x)(μk+1
x − μkx) + μk
x. (3.13)
In practice we find the proposed method is simple and effective after all. However,
please notice that the median should be tracked per-channel for UQM models.
38CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
3.5.3 Weighted Mode Filter
The weighted mode filter is to find the global mode of h(x, g). Simple fixed-point
iteration is sufficient for the proposed Gaussian models. Let the gradient ∂h(x, g)/∂g =
0, we have the fixed-point iteration as
gn+1x =
(L∑l=1
Blx(g
nx)
(Σn +Σl
x
)−1)−1( L∑
l=1
Blx(g
nx)
(Σn +Σl
x
)−1μlx
), (3.14)
where Blx(g
nx) = N(gnx|μl
x,Σn+Σlx)ψx(l). Equation (3.14) recursively goes to the closest
mode and thus a good initialization g0x is necessary to avoid being stuck in wrong local
mode. In practice, let g0x = μm�
x where m� = argmaxm∑L
l=1 Blx(μ
mx ) is both effective
and reasonable.
3.6 Experimental Results and Discussions
3.6.1 Implementation Notes
We have implemented the proposed weighted mode filter and the weighted median filter
on a MATLAB platform. The results reported were measured on a 3.4 GHz Intel Core i7
processor with 16 GB RAM.
Parameter Definition
All input images and guidance images were normalized into [0, 1] for the convenience of
parameter definition. The data variance Σn = σ2nI
d, where σn is the standard variance
of the noise, Id is an identity matrix and d is the dimension of the input image. The
guided filter (GF) and the domain transform filter (DF) share the same parameter
setting, i.e., r = σs and ε = σ2r (ω = {r, ε} for GF, ω = {σs, σr} for DF). r and σs
was measured in pixels. For fair comparisons, the number of iterations in the weighted
mode filter was set as 10 for all the experiments.
Number of Models
An automatic criterion [31] stops generating the LAM models when a high percentage
of pixels are close to at least one model. In detail, the criterion of closeness is set as
‖fx−μlx‖Σn ≤ 1. Together with a user-given threshold Lth, the LAM models generation
will be stopped when either the criterion or Lth is reached. In addition, the number of
§ 3.6.2. Performance Evaluation 39
10 20 30 40 50 600
0.5
1
Number of models (L)
Runtim
eratio
LAM (ours)UQM (ours)Brute−force
Figure 3.5: Execution time comparison on the distribution construction w.r.t. thenumber of models. The input is a 8-bit single-channel image and the guidance is a3-channel image. The reference method is brute-force and traverses 256 discretizedbins.
the UQM models shared the same threshold Lth, and no automatic stopping criterion
was applied.
Compared Methods
We compared our proposed filters with two popular filters: the constant time weighted
median filter (CT-median) [4] and the bilateral weighted mode filter (BF-mode) [5].
The parameters of CT-median were given by the authors [4] and those of BF-mode
were optimized by exhaustive search. The number of bins in the reference methods was
fixed to 256 per-channel [3–5].
3.6.2 Performance Evaluation
Runtime Comparison
Figure 3.5 shows the execution time comparison between our method and the brute-
force constant time algorithm (cf. Equation (3.1) with GF weights to construct the
distribution. Both LAM and UQM models were under evaluation. Related parameters
were fairly configured. The y-axis is the ratio of runtime of the proposed method w.r.t.
the reference method, which assumed 256 discretized bins. L was defined manually
without automatic stopping criterion. Both the two proposed methods only possess
a fraction of the runtime against the reference one and are nearly proportion to the
number of models. But the LAM spends a little bit more time because of additional
filtering operations at the model generation step. Notice that when L is around 50, the
execution time of the proposed methods becomes half of that of the reference one.
40CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
The Number of Necessary LAM models
In fact, natural images, no matter color images or disparity/depth maps, are always
locally smooth. There is little necessity to generate so many LAM models (e.g., more
than 60) to fit the local distribution. To validate this observation, we estimated the
LAM models for all the color images in a published image dataset BSDS300 [65] with the
threshold of Lth = 64 and examined the distribution of necessary number of models.
The automatic stopping criterion was triggered when no less than 99.9% pixels were
fulfilled the constraint in Section 3.6.1.
Results are illustrated in Figure 3.6, where the left one was obtained by a window
size 21× 21 (i.e., r = 10) and the right one was 11× 11 (i.e., r = 5). Σn = 0.01× I3 for
both cases. The majority of images generally required at most 50 models to meet the
criterion. What’s more, the smaller the window size is, the fewer number of necessary
models are required, which verifies the discussions in Section 3.4.1. Based on these
results, we conclude that for the general case, the number of LAM models required for
a natural image merely exceeds a certain value under a given window size. As a typical
case, let the window size be 21×21 or smaller, we can safely constrain the threshold to
Lth = 64, and the runtime on the probability distribution construction is always fewer
than half of the brute-force implementation, as shown in Figure 3.5.
As a conclusion, the gain of the proposed method is generally 2 ∼ 3× faster than
the brute-force one for the gray-scale images. And it can be increased to 6 ∼ 9× for
color images as the number of channels is increased. For disparity/depth maps and
cartoon images, the number of necessary models can be reduced even further because
of their high structure homogeneity.
10 20 30 40 50 600
0.05
0.1
Number of models (L)
|N (x)| = 21× 21
10 20 30 40 50 600
0.05
0.1
Number of models (L)
|N (x)| = 11× 11
Figure 3.6: The distribution of the number of necessary local adaptive models inBSDS300 dataset. Left : the window size is 21× 21. Right : the window size is 11× 11.The smaller the window size, the fewer number of locally adaptive models is necessary.
§ 3.6.3. Applications 41
Raw input Ground truth CT-median,
Err. 2.76
BF-mode,
Err. 2.37
L = 7,
median,
Err. 4.96
L = 15,
median,
Err. 3.96
L = 31,
median,
Err. 3.34
L = 7,
mode,
Err. 2.62
L = 15,
mode,
Err. 2.36
L = 31,
mode,
Err. 2.41
Figure 3.7: Depth map enhancement on tsukuba. The first row shows the raw inputdisparity map, the ground truth, results by CT-median [4] and BF-mode [5] respectively,from left to right. Disparity maps in the 2nd and 3rd rows were obtained by the proposedweighted median filter and weighted mode filter, under different number of models.The models were generated by the LAM models. The error was evaluated on bad pixelratio with the threshold 1. GF weights were chosen and related parameters were fairlyconfigured.
L = 7,
LAM
L = 7,
UQM
L = 7,
LAM
L = 7,
UQM
Figure 3.8: Results of the weighted mode filter with 7 models.
3.6.3 Applications
Depth Map Enhancement
Depth maps with low resolution and poor quality, e.g., structural outliers, depth holes,
noise and etc, can be enhanced with the guidance of the registered high resolution
texture images [4; 5]. It is a popular and practical post-processing for acquiring visual
plausible and high accurate depth map from various depth acquisition techniques, like
stereo, ToF-camera or Kinect. Two state-of-the art approaches that take advantage of
the statistics information of the depth map are BF-mode [5] and CT-median [4]. Our
methods, both the weighted mode filter and the weighted median filter, gain similar
42CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
performance against them and require much less cost.
Figure 3.7 shows the results of a disparity map named tsukuba. The raw input
was generated by a simple box-filter aggregation [66] followed by left-right check and
hole-filling. LAM models were adopted for all these results and we fixed the number
of models utilized. Small L (e.g., L = 7) limits the LAM to define enough models to
cover all the local structures, thus tended to output slightly blurred results or assign
incorrect values in comparison with the referenced methods. Fortunately, by adopting
a few more models, the results become stable and similar to the reference results. For
instance, the BF-mode in our implementation required 15.09 sec to process the tsukuba
image, but the proposed weighted mode filter with 31 LAM models only cost 5.23 sec.
What’s more, the bad pixel ratio of the proposed method (Err. 2.41) is similar as that
(Err. 2.37) of BF-mode, but the PSNR is otherwise higher (25.28dB) against that of
the BF-mode (25.09dB).
Although a small L of the LAM models cannot cover all the details of the input
image, it still has a superior performance against the UQM models with the same L.
As shown in Figure 3.8, when L = 7, the LAM models captured more details of the
two test disparity maps and produced smoother outputs than the UQM models The
staircase artifact of the UQM models also occurs at BF-mode and CT-median, since
both of them are based on a discretized weighted histogram. When the bin number
is not sufficient, the quantization artifact will happen around the smooth and slanted
surfaces.
JPEG Artifact Removal
JPEG compression is a lossy compression scheme that usually brings about quantiza-
tion noise and block artifact. CT-median has been proven effective in eliminating this
compression artifact in clip-art cartoon images [4]. However, since CT-median encour-
ages piecewise constant intensities/colors, its drawback is apparent when processing
natural images.
As shown in Figure 3.9(b) and its zoomed-in patch, CT-median forced the image
eyes into several distinct layers, pixels inside one layer seemed constant everywhere.
Contrary to it, exploiting the LAM models, our method represented a piecewise smooth
result, as shown in Figure 3.9(c). Not only the compression artifact was removal, but
§ 3.6.3. Applications 43
(a) Input (b) CT-median (c) Ours-LAM (d) Ours-UQM
Zoom in of (a) Zoom in of (b) Zoom in of (c) Zoom in of (d)
Figure 3.9: JPEG compression artifact removal results by the weighted median filter.(a) The input degraded eyes image. (b) CT-median [4]. (c) The proposal weightedmedian filter with the LAM models and (d) is with the UQM models. The second rowshows the corresponding zoomed-in patches. The DF weights were chosen and all therelated parameters were fairly configured. Best viewed in electronic version.
the structure of the input image was still preserved. The UQM models, unfortunately,
had a slightly worse performance than that of LAM. The reason is straightforward
as it also tried to recover piecewise constant colors. In terms of runtime comparison,
both the LAM and UQM models only spent a small fraction of the runtime owned
by CT-median (i.e., 88.134 sec) to obtain Figure 3.9(b). The LAM models required
L = 15,Σn = 0.072 × I3 and |N (x)| = 11 × 11, it cost 16.74 sec in total. The UQM
models also owned L = 15, and the runtime was a little faster as 15.54 sec.
More Applications
We show two additional applications to indicate the potential of the proposed weighted
median filter and the weighted mode filter. Figure 3.10 shows the detail enhance-
ment for a natural rock image by the proposed weighted median filter under the LAM
models. The result is plausible for naked eyes without apparent artifact. Figure 3.11
presents the joint upsampling of a low-resolution and noisy disparity map with the
guidance of a registered high-resolution image. Both of the proposed filters generated
satisfactory results but the result by the weighted median filter tended to be smoother
and introduced a little blurring artifact, while that by the weighted mode filter was
sharper and contained a slight of staircase artifact.
44CHAP. 3. WEIGHTED STRUCTURE FILTERS BASED ON PARAMETRIC STRUCTURAL
DECOMPOSITION
Figure 3.10: Detail enhancement by the proposed weighted median filter under theLAM models. From left to right, the original rock image, after edge-preserving smooth-ing, and the detail enhanced image. GF weights were chosen.
Ground truth Ours-median Ours-mode
Figure 3.11: Joint depth map upsampling. The input disparity map was 8× up-sampled by the proposed weighted median filter and the weighted mode filter underthe LAM models. The raw input diparity map is shown in the top-left corner of theleftmost image. GF weights were chosen.
3.7 Summary
In this chapter, we propose a novel distribution construction method for accelerating
the weighted median/mode filters by defining a new separable kernel based on the
probabilistic generative models. Different from traditional methods that need quite
a number of filtering operations to estimate a sufficiently accurate distribution, the
proposed approach only requires a finite and a small amount of filtering operations
based on the structure of the input image. The accelerated weighted median filter and
weighted mode filter are thus introduced and utilized into various applications from
depth map enhancement, joint depth upsampling, outlier removal, detail enhancement
and so on.
As a part of the future work, the extension for video processing is interesting and
meaningful. A more robust and efficient way to estimate the locally adaptive models
shall be a great benefit. Moreover, increasing the efficiency on the median tracking and
mode seeking can further accelerate the proposed filters.
Chapter 4
Temporal Enhancement based on Static
Structure
4.1 Introduction
In this chapter, we present a novel method to enhance a depth video both spatially and
temporally by addressing two aspects of these problems: 1) efficiently and effectively
enforcing the temporal consistency where it is necessary, and 2) enabling online pro-
cessing. A common fact is that regions in one frame with various motion patterns e.g.,
static, slowly/fast moving and etc.) belong to different objects or structures and re-
quire temporal consistencies with different levels. For instance, the static region needs
a long-range temporal enhancement to ensure that it is static over a long duration,
while dynamic regions with slow/rapid motions expect short-term or no temporal con-
sistency. However, it is difficult to accurately enhance arbitrary and complex dynamic
contents in the temporal domain without apparent motion blurs or depth distortions.
Thus we propose an intuitive compromise to cancel the temporal enhancement in the
dynamic region as long as its spatial enhancement is sufficiently satisfactory, in which
the necessary depth variation will not be distorted while the temporal artifacts are not
as easy as those in the static region to be perceived. Therefore, we aim at strengthening
long-range temporal consistency around the static region whilst maintaining necessary
depth variation in the dynamic content. To accurately separate the static and dynamic
regions, we online track and incrementally refine a probabilistic model called static
structure, which acts as a medium to indicate the region that is static in the current
frame. By online fusing the static region of the current frame into the static struc-
ture with an efficient variational fusion scheme, this structure has implicitly gathered
all the temporal data at and before the current frame that belong to it. Substituting
the static region by the updated static structure, it is thus temporally consistent and
45
46 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
stable in a long range accordingly. Moreover, it is also suitable for online processing
the streaming depth videos (3D teleconference, 3DTV and etc.) without the necessity
of storing amounts of adjacent frames, thus is memory and computationally efficient.
Overall, the temporally consistent depth video enhancement is performed at two
layers: 1) the static region of the input frame revealing the static structure is enhanced
spatially and temporally by an online fusion technique combining it with the static
structure, and 2) the dynamic content is enhanced spatially without temporal smooth-
ness. In addition to the advantages stated aforementioned, enhancing the static and
dynamic regions separately also effectively eliminates artifacts that frequently occur
in conventional depth video enhancements, like the blurring artifacts or the unreli-
able depth propagation, across the boundaries between dynamic objects and static
objects/background. Furthermore, when the depth video contains severe holes, the
static structure can fill static holes convincingly and leave the rest holes filled by the
dynamic content so as to avoid the inpainting artifacts. Since fully dynamic depth
videos usually have weak temporal consistency thus our proposed algorithm is rele-
gated to a spatial enhancement approach, which does not force the enhanced depth
video to bear unnecessary temporal smoothness.
The rest of the chapter is organized as follows. Section 4.2 reviews existing works
in spatial and temporal depth video enhancement, as well as approaches on static scene
reconstruction, which is indeed related to our formulation of the static structure. Sec-
tion 4.3 describes our proposed framework of online estimation of the static structure
and the approach regarding temporally consistent depth video enhancement. Experi-
mental results and discussions of our method can be found in Section 4.4. Discussions
about its limitations and applications are presented in Section 4.5. Concluding remarks
and discussion on future work are given in Section 4.6.
4.2 Related Work
Spatial enhancement On the aspect of global optimization, the pioneering work was
done by Diebel et. al. [11] utilizing the pixel-wise MRF model with the guidance of tex-
ture to denoise the depth map. Several augmented models were also proposed to handle
inpainting and super-resolution [12–16], with special choices of the data and smoothness
§ 4.2. Related Work 47
terms as well as additional regularization terms [16–24], enabling a reasonable perfor-
mance even without texture information [16]. But the high computational cost of these
methods hinders real-time applications. Another choice is high-dimensional filtering.
One variant is high-dimensional average filtering [1; 25; 27; 28; 30], whose weights are
defined by the spatial nearness and feature proximity. The feature can be texture/depth
intensities or patches [27; 31] and other user-defined ones. The main problems here are
edge blurring and texture mapping. Another variant uses the median of the depth
candidate histogram instead [4; 33], producing more robust results but also suffering
from quantization error and slower speed. Weighted mode filtering [5; 34] otherwise
looks for the histogram’s global mode, and has similar artifacts. In addition, spatial
enhancement, especially super-resolution and inpainting, can be performed by patch
matching throughout the depth map, which achieved satisfactory visual results [35; 36]
but with high computational complexity.
Temporal enhancement Existing temporal enhancement approaches usually em-
ploy the guidance of temporal texture consistency, especially by fusing the previous
depth frame onto the current one according to the motion vectors estimated between
the corresponding adjacent color frames [1; 5]. However, the neglect of additional
motion vectors in z-axis reduces the warping accuracy. 3D motion estimation is typ-
ically adopted to solve the problem in [67–69]. Following them, the temporal fusion
between current and warped previous frames are usually based on weighted average
or weighted median filters, and energy minimization as well [1; 5; 70; 71]. Therefore
the performance, on one hand, relies heavily on the accuracy of motion estimation,
which is difficult to be satisfied. On the other hand, the temporal continuity is only
preserved among few adjacent frames, which does not meet the demand of constrain-
ing long-range temporal consistency. To fix such an issue, Lang et. al. [6] proposed to
offline filter the paths which are the vectors of all the pixels that correspond to the
motion of one scene point over time. It provides a practical and remarkable solution
to enhance a depth video with long-range temporal consistency both effectively and
efficiently. Our work is related to, but has essential differences from the layer denoising
and completion proposed by Shen et. al. [72], which offline trained background layer
models beforehand to label the foreground and background of the input depth frame,
and no temporal consistency was strengthened. Conversely, our method estimates the
48 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
static structure in an online fashion and there is no need to have a series of depth frames
capturing purely static scenes. Moreover, the temporal consistency is maintained where
it is required. That aside, only the spatial enhancement is taken into consideration as
presented in [72].
Static scene reconstruction The static structure estimation is related to the static
scene reconstruction by fusing a series of depth maps. A majority of these works are
offline methods [73–77] which fuse a set of depth maps to output a single geomet-
ric structure, while the rest are online approaches that receive depth measurements
sequentially and incrementally estimate the current geometric structure. Offline meth-
ods always extract a batch of depth frames together so that the complexity becomes
unbearable when the number of frames is large. One of the offline approaches by Zit-
nick et. al. [77] employed the consistency of both the multiple view color and disparity,
which is analogous to our constraint of temporal consistency, to regularize the disparity
space distribution so as to bring about the refined disparity map. Most online methods
quantize the 3D space into grids [78–81] to reduce the memory and computational cost.
Thus they are always deficient in sub-grid accuracy, but one additional approach ex-
ploits a weighted sum of truncated signed distance function (TSDF) [79; 80] over depth
measurements. However it is sensitive to outliers and thus not robust to estimate a
static scene containing dynamic objects and heavy outliers. To robustly estimate the
static scene captured by noisy and cluttered data, some researchers have proposed a
variety of measurement models with parameters describing the nature of the noise and
outliers. Several methods [78; 82] need parameters learned from ground truth data
or those tuned empirically. One successful model that requires fewer manually tuned
parameters is the generative model, which has the ability to derive the model of the
noise and clutter characteristics from the input data. Vogiatzis et. al. [83] proposed a
generative Gaussian plus uniform model that simultaneously infers the depth and out-
lier ratio per pixel using an efficient online variational scheme, which meets the clutter
characteristics of depth maps generated by stereo. Our static structure estimation is
similar as an online generative model considering both noise and outliers as well as a
special treatment of the dynamic scenes.
§ 4.3. Approach 49
A moving objectA static object
Static background
Input depth frame
A moving objectA static object
Static background
Static structure
(a)
(b)
Figure 4.1: The illustration of the static structure in comparison with the input depthframe. (a) shows the input depth frame (in blue curve) lies on the captured scene, (b)represents the static structure (in black curve). The depth sensor is above the capturedscene. The static structure includes the static objects as well as the static background.
4.3 Approach
The static structure can be regarded as an intrinsic depth structure (and texture struc-
ture when the registered color video is available) underneath the captured scene1, which
always lies on or behind the surface of the input depth frame. As shown in Figure 4.1,
any moving or foreground object stays in front of the static structure whereas the static
objects or visible static background are usually on it, i.e., the depth value of the static
structure at one pixel is always deeper than that of a dynamic object at the same
place. But it is different from the “background” of a scene, because we focus more
on the “static” geometric structure rather than the distance from the camera. Since
the temporal consistency around static or slowly moving regions are required to be
enforced, the “static” nature is more useful than the idea of “background”.
To handle artifacts like noise, outliers and holes as well as complex dynamic con-
tents in the input depth frame, we propose a probabilistic generative mixture model
to describe the static structure as well as the characteristics of noise and outliers (Sec-
tion 4.3.1). We also define an efficient layer assignment leveraging dense conditional
random fields to accurately label input depth frame into dynamic and static regions
1Within the scope of this chapter, we assume the target depth video is captured by a static depthsensor hence the captured scene is static except the dynamic objects. Although the enhancement ofdepth video captured by moving cameras is a more general topic, we will refer it to our future work.
50 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
Input data
Layer Assignment
VariationalApproximation
Spatial Enhancement Static Structure
Static Structure
Temporally Consistent Depth
Video Enhancement
Online Static Structure Updating Scheme Enhanced depth
frame
Figure 4.2: Flowchart of the overall framework of the proposed method on the esti-mation of static structure and depth video enhancement. Please refer to the text forthe detailed description.
(Section 4.3.4). For the sake of memory and calculation efficiency, as well as the ability
to process streaming data, the static structure is online updated (Section 4.3.5) via
a variational approximation (Section 4.3.2) governed by a first order Markov chain,
which effectively fuses the labeled static region in the current depth frame with the
previous estimated structure. It is further refined spatially to fill holes and regularize
the structure (Section 4.3.5). The updated static structure in turn substitutes the static
region of the input depth frame, resulting in a temporally consistent depth video en-
hancement (Section 4.3.6). The framework of the online static structure update scheme
and temporally consistent depth video enhancement is referred to in the flowchart in
Figure 4.2.
Notation The data sequence is denoted as S and formed by a depth video D =
{Dt|t = 1, 2, . . . , T} as S = D, or as a pair of aligned depth plus color videos as
S = {D, I}, where I = {It|t = 1, 2, . . . , T}. The data in each frame is St = Dt or
{Dt, It}. The pixel location is defined as x, and its depth value at t is dtx and its
corresponding color is Itx. The parameter set for the probabilistic model at each frame
t is denoted as PS,t, and PS,tx is defined for each pixel x, whose elements are defined in
§ 4.3.1. A Probabilistic Generative Mixture Model 51
Static structure
Camera
centerd
State-FState-B
State-I
Figure 4.3: Illustration of three states of input depth measurements with respectto the static structure on one line-of-sight. The current static structure refers to theblue stick in the middle. Decision boundaries are marked as blue dot lines. The depthmeasurement d is categorized into state-I when it is placed around the static structure.When d is in front of this structure, we denote it as state-F. While it is far behind thestatic structure, the state is state-B.
detail in the following sections.
4.3.1 A Probabilistic Generative Mixture Model
At the very beginning, we only consider the case where S = D. Denote the se-
quentially incoming depth samples of pixel x on and before time t as forming a set
Dtx = {dτx|τ = 1, 2, . . . , t}. The depth value of the static structure in the pixel x is
Zx, whose noise is conveniently governed by a Gaussian distribution. We also propose
two individual outlier distributions to describe the outliers before and after the static
structure respectively. Hence, they do not only describe the depth distribution but also
provide evidence to indicate the state to which the current depth sample belongs.
State Description
The three states Ψ = {I, F,B} are illustrated in Figure 4.3 and listed as follows.
State-I: Fitting the static structure If dtx belongs to the static structure, we
assume that it follows a Gaussian distribution centered at Zx as N(dtx|Zx, ξ
2x
), where
ξx denotes the noise standard deviation, and is predefined based on the systematic error
of the depth sensor. For instance, the noise variance of Kinect is actually related to
the depth so it is appropriate to set ξx depth-dependently.
State-F: Forward outliers On the other hand, the depth measurements from moving
objects or outliers in front, follow a clutter distribution like Uf (dtx|Zx) = Uf · 1[dtx<Zx],
where 1[·] is an indicator function that equals to 1 when the input argument is true,
52 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
and 0 otherwise. This state is activated when dtx is smaller than Zx, and switched off
if it is larger than Zx. It can be inferred from this state that not only are the outliers
in front, but also that dynamic objects are at the given location.
State-B: Backward outliers Furthermore, it is possible that the input depth mea-
surements are outliers lying behind the current estimation of the static structure. An-
other similar indicator distribution is introduced as Ub
(dtx|Zx
)= Ub · 1[dtx>Zx]. It can
naturally represent outliers that have larger depth values than a given structure. Mean-
while, it provides a cue to infer the risk whether current static structure estimation is
incorrect.
An additional hidden variable mx =[mI
x,mFx ,m
Bx
]�is introduced as the state
indicator to represent these states, where mkx ∈ {0, 1}, k ∈ Ψ. In this case, only one
specific state mkx = 1 and the rest are 0s, thus
∑k∈Ψmk
x = 1.
A Generative Model
The reason to introduce the generative model is that it can simulate the static structure
as well as its noise and outliers, thus in case there are no observed measurements at
the current frame (e.g., depth holes), we can still give a reasonable static structure.
Moreover, given suitable parametric forms of these distributions, the generative model
can be online estimated and refined by updating the parameters with sequentially
incoming depth samples.
Likelihood Appending the state indicator mx, the likelihood of dtx conditioned on
mx and the static structure Zx is a product of the distributions of the three states as
p(dtx|mx, Zx) = N (dtx|Zx, ξ2x)
mIxUf
(dtx|Zx
)mFx Ub(d
tx|Zx)
mBx . It equals to one required
state distribution by triggering off this state indicator mkx = 1, k ∈ Ψ.
Prior Let the prior for Zx also be a Gaussian distribution with the mean μx and the
standard deviation σx, written as p(Zx) = N(Zx|μx, σ
2x
). σx is different from ξx since
it represents the possible range of the static structure rather than its noise level. The
prior of the chance to activate one state is a categorical distribution Cat(mx|ωx) [64],
where ωx =[ωIx, ω
Fx , ω
Bx
]�and
∑k∈Ψ ωk
x = 1, ωkx ∈ (0, 1). This parameter reveals
the opportunities to induce these states in advance of the input depth samples. And
ωx is further modeled by a Dirichlet distribution p(ωx) = Dir (ωx|αx), where αx =
[αIx, α
Fx , α
Bx ]
�, αkx ∈ R
+ and corresponds to ωkx.
§ 4.3.2. Variational Approximation 53
Posterior Two posteriors are in fact essential for the static structure estimation. One
is p(Zx,ωx|Dtx), which jointly presents the depth distribution of the static structure and
the popularity densities of these three states given the current and all previous depth
frames. The other is the posterior of the state indicator p(mx|Dtx), which represents
the possible states at the current frame. Based on the estimated posteriors, we can
evaluate the most probable depth values of the static structure by calculating the
expectation of p(Zx|Dtx) as Ep(Zx|Dt
x)[Zx]. The reliability of current estimation refers
to Ep(ωx|Dtx)
[ωIx
], which means that the larger the portion of input depth samples that
agree with the model, the more reliable the estimation is. The most possible state that
dtx should occupy is calculated straightforward from argmaxmx p(mx|Dtx).
4.3.2 Variational Approximation
However it is almost unfeasible to solve these posteriors analytically because it is not
independent between Zx and ωx for p(Zx,ωx|Dtx), and p(Zx|Dt
x) and p(ωx|Dtx) do not
exactly follow Gaussian and Dirichlet distributions any more. Therefore, variational
approximation [64] of the posteriors is introduced to provide sufficiently accurate ap-
proximated posteriors efficiently. It minimizes the Kullback-Leibler divergence between
the approximated and the original posteriors. The variationally approximated posteri-
ors are required to own the same parametric forms as the priors thus they also produce
analytical solutions to approximate Ep(Zx|Dtx)[Zx] and Ep(ωx|Dt
x)
[ωIx
]. The approxi-
mation starts from factorizing the posterior p(Zx,ωx|Dtx) into the product of inde-
pendent Gaussian distribution qt(Zx) = N(Zx|μt
x, (σtx)
2)and Dirichlet distribution
qt(ωx) = Dir(ωx|αtx) as
qt (Zx,ωx) = qt(Zx)qt(ωx) ∼ p(Zx,ωx|Dt
x). (4.1)
Not only that, but the exact estimation also depends on all the previous depth
samples Dtx. Too many frames will bring about unbearable complexity and memory
requirement. We admit a first order Markov chain into our framework so as to favor
the online estimation. It means that we can estimate the current posterior just based
on the current likelihood and the posterior of the last frame, therefore it is memory-
and computationally efficient. We reformulate the posterior as a sequential parameter
54 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
0 100 200 300 400 5000
20
40
60
80
100
No. of Frames
(a) Ztx v.s. Dt
x
20 40 60 80 10045
50
55
No. of Frames
approximated confidence interval
approximated μtx
(b) Confidence interval w.r.t. Ztx
0 100 200 300 400 5000
0.2
0.4
0.6
0.8
1
No. of Frames
approximated ωI,tx
approximated ωF,tx
approximated ωB,tx
ideal ωIx
ideal ωFx
ideal ωBx
(c) Evolution of each states’ portions
0 20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
Depth Range
histogram of raw data
approximated distribution
(d) qT (dx|PD,Tx ) v.s. data histogram
Figure 4.4: Variational approximation of the parameter set of the static structurefor a 1D depth sequence. The number of frames is T = 500. (a) The expected depthsequence of the static structure versus the raw depth sequence, where the ideal Zx =50. (b) The confidence interval of Zt
x, the interval is centered μtx and between μt
x ±2σt
x with 95% confidence. (c) The evolution of the portions (defined by the expectedvalue of ωx at frame t, denoted by [ωI,t
x , ωF,tx , ωB,t
x ]) of the three states. The idealportions are ωx = [0.89, 0.1, 0.01]. (d) The estimated distribution qT (dx|PD,T
x ) versusthe normalized histogram estimated by DT
x when T = 500. The estimated depth of thestatic structure goes to the ideal value only with a few samples. Its confidence intervalshrinks rapidly, which means the uncertainty is reduced very fast. The portion of eachstate is evolved with the raw depth sequence, and they match their ideal value withenough depth samples. When T = 500, the estimated data distribution fits the datahistogram compactly.
estimation problem
qt (Zx,ωx) ∼ p(Zx,ωx|Dt
x
)∼ p(dtx|Zx,ωx)q
t−1 (Zx,ωx)/qt(dtx)
= Q(Zx,ωx|dtx),
(4.2)
§ 4.3.3. Improvement with Color Video 55
where the parameters of the left-hand side are estimated by matching moments between
the distributions of left- and right-hand sides [64]. This only considers the current data
samples and the previous estimated parameters to approximate the current parameters.
We define the parameter set estimated at t − 1 is PD,t−1x = {μt−1
x , σt−1x ,αt−1
x }, whilethe required parameter set is PD,t
x . By matching the first and the second moments
between Q(Zx|dtx) and qt(Zx) as well as those between Q(ωx|dtx) and qt(ωx|dtx) [84],
we can obtain a closed-form solution for any parameter in PD,tx . Please refer to the
supplementary materials for their detailed derivations.
Hence, recall the problem addressed in Section 4.3.1, the approximated posterior
with respect to the state indicator mx is qt(mkx = 1|dtx), k ∈ Ψ, which is a suitable
approximation of p(mx|Dtx) and also has a closed-form solution.
Apart from that, the most probable depth value of the static structure at pixel x
and time t is
Ztx = Ep(Zx|Dt
x)[Zx] � μt
x, (4.3)
and the reliability of current estimation of the static structure is the expectation of ωIx
as
rtx = Ep(ωx|Dtx)
[ωIx
]� αI,t
x /∑k∈Ψ
αk,tx . (4.4)
As shown in Figure 4.4, an example of the variational approximation of the param-
eter set for a 1D depth sequence illustrates the potential of the proposed method to
capture the nature of the input depth sequence.
4.3.3 Improvement with Color Video
The above discussion only considers the estimation and update of the static structure
with the depth video. A more complete treatment is together with the registered color
video, in which case an improved probabilistic generative model can be formulated as
follows.
Prior We introduce another prior over Cx, the color value of the static structure at x
as p(Cx) = N (Cx|Ux,Σx) with two parameters: the mean Ux and the variance Σx.
Likelihood The likelihood of input depth and color samples dtx and Itx conditioned
56 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
on mx given Zx and Cx is
p(dtx, Itx|mx, Zx,Cx) = Uf
(dtx|Zx
)mFx Ub
(dtx|Zx
)mBx
×[N
(dtx|Zx, ξ
2x
)N
(Itx|Cx,Ξx
)]mIx , (4.5)
where Ξx denotes the variance matrix for the color noise. A step further we have
the likelihood of dtx and Itx conditioned on Zx and Cx accordingly. This formulation
improves the inference since the input depth sample will belong to the static structure
only when both the depth and color samples agree with the previous model. Therefore,
the risk of false estimation is reduced.
Posterior and variational approximation In a similar fashion in Section 4.3.2,
we can derive the approximated posterior when color video exists. The parameter set
PS,tx =
{μtx, σ
tx,U
tx,Σ
tx,α
tx
}, S = {D, I} can also be estimated online and analytically.
Furthermore, the most probable depth Ztx and color Ct
x of the static structure are
achieved based on μtx and Ut
x. The approximate posteriors qt(mkx|dtx, Itx), k ∈ Ψ are
also derived accordingly.
4.3.4 Layer Assignment
In this section, we would like to find the static region of the input depth frame so
as to robustly update the model of the static structure and find the dynamic region.
Specifically, we label the input depth frame in three layers L = {liss, ldyn, locc}:
• liss: agree with estimated static structure;
• ldyn: belong to dynamic objects in its front; or
• locc: refer to the once occluded structure behind it.
The additional label locc is essential because the regions belonging to the once occluded
structure do not fit the current model, but they reveal the hidden structure behind the
current estimated static structure. It also points out that current estimation produces
bias at these regions, in which the depth structure from the input depth frame Dt
would be a more reasonable substitution to rectify the previous estimation.
One toy example is shown in Figure 4.5, where Dt provides a different layout from
the current static structure. Intuitively, locc occurs when the input depth frame provides
§ 4.3.4. Layer Assignment 57
Static structure
Input depth frame
Color frames
Figure 4.5: One toy example illustrates the layer assignment. The cyan dot lineindicates the current estimated depth structure of the static structure, and the redsolid line is from the input depth frame. If color frames are available, they provideadditional constraints to regularize the assignment, where the upper line correspondsto the current estimated texture structure of the static structure, and the lower onerefers to the input color frame.
larger depth values and exposes the hidden static structure. ldyn, on the contrary,
encourages smaller depth values. Furthermore, the failure of inference due to depth
holes, noise and outliers can be eliminated by the introduction of texture information,
which also provides additional cues to regularize their spatial layout.
To improve the expressive power to label complex structures that is employed
frequently in our case, we exploit a fully connected conditional random field (fully-
connected CRF) [85] to strengthen the spatial long-range relationship. Assume a ran-
dom field L = {lx ∈ L | ∀x} conditioned on the input data St and the previous model
parameter set M = PS,t−1. The Gibbs energy of a label assignment L is
E(L|St,M) =∑x
ψu
(lx|St,M
)+
1
2
∑x �=y
ψp
(lx, ly|St,M
), (4.6)
where x and y are pixel locations. ψu(·) and ψp(·, ·) indicate the unary and pairwise
potentials. St = Dt or {Dt, It}.
Definition of unary and pairwise potentials
We define the unary potentials and pairwise potentials as follows:
Unary potentials The unary potentials are negative logarithms of the approximated
posteriors qt(mx|Stx), indicating the chance that the current depth samples should
follow the previous estimation (i.e., liss requires mIx = 1), or in its front (i.e., ldyn
needs mFx = 1) or at its back (i.e., locc refers to mB
x = 1). In detail, we have
ψu
(lx = lk|St,M
)= − ln qt(mk
x = 1|Stx), and lk and mk
x follow the correspondences
58 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
listed above.
Pairwise potentials The pairwise potential between pixels x and y is a weighted
mixture of Gaussian kernels as
ψp(lx, ly|Stx,Mx) = 1[lx �=ly] ·
{ws exp
(−τα‖x− y‖2/2
)+ wr exp
(−‖Δtfx −Δtfy‖2Σβ
/2− τγ‖x− y‖2/2)}
. (4.7)
We define Δtfx = f I,t−1x −f tx to measure the difference between the features of the static
structure and those of the input data. When St = Dt, f tx and f I,t−1x are the normalized
dtx and Zt−1x , by a whitening process of the overall variance (ξtx)
2 = (σt−1x )2 + ξ2x. If
St = {Dt, It}, let f tx and f I,t−1x be the concatenations of the normalized vectors
[dtx; I
tx
]and
[Zt−1x ;Ct−1
x
]. The color features are normalized with the variance Ξt
x = Ξx+Σt−1x .
The indicator function 1[lx �=ly] lets the pairwise potentials be Potts model. It en-
courages a penalty for nearby pixels that are assigned different labels but they have
similar features. The first kernel is a smoothness kernel that removes small isolated
regions and is adjusted by τα. The second kernel is a range kernel trying to force nearby
pixels with similar depth and/or color variation to share the same label, with a given
parameter τγ to set the degree of nearness. ‖Δtfx−Δtfy‖2Σβis the Mahalanobis distance
between Δtfx and Δtfy, where the covariance matrix Σβ encodes the feature proximity.
The weight of the range kernel is set as wr. If we only have the range kernel, the result
tends to be noisy, while if we only have the smooth kernel, the structure cannot be well
regularized.
Inference
We exploit an efficient mean field inference method for fully-connected CRF when the
pairwise potentials are Gaussian [85]. It turns out to be an iterative estimation process
convolving several runs of real-time high dimensional filtering characterized by the
pairwise potentials (4.7).
4.3.5 Online Static Structure Update Scheme
The online static structure updating scheme is actually a sequential variational pa-
rameter estimation problem with a layer assignment to exclude the dynamic objects
and include the once occluded static structure. A spatial enhancement is appended to
§ 4.3.5. Online Static Structure Update Scheme 59
Algorithm 2: Online Static Structure Update Scheme
Input : Data sequence S = {Sτ |τ = 0, 1, 2, . . .};Initial parameter set PS
init;Output: Current parameter set PS,t ;
// initialization
1 t ← 0, PS,0 ← param init(S0, PSinit);
2 while S �= ∅ do3 t ← t+ 1;
// 1.layer assignment
4 M ← PS,t−1, L ← argminLE(L|St,M);// 2.parameter update
5 for ∀x do
6 if lx = liss then PS,tx ← vari approx(St
x, PS,t−1x ) ;
7 else if lx = locc then PS,tx ← param init(St
x,PSinit) ;
8 else if lx = ldyn then PS,tx ← PS,t−1
x ;
// 3.spatial enhancement
9 Ztx ← μt
x, ∀x;10 Zt ← spatial enhance(Zt,PS,t), μt
x ← Ztx, ∀x;
regularize the spatial layout of the structure. The sketch of the algorithm is given in
Algorithm 1.
An initialization of the parameter set PS is necessary. We set the initial μ0x = d0x,
where d0x ∈ D0 from the first frame of the depth video. Similarly, let U0x = I0x, where
I0x ∈ I0 from the color video. The noise parameters ξx and Ξx are user-specified
constants which should be large enough to enable sufficient variance of input data.
σ0x and Σ0
x will be initialized as large values as well. The parameters of ωx are also
set up with given constants α0x. A convenient setup is αI,0
x = αF,0x = αB,0
x . The
user-given initialization parameter set is PSinit = {ξx, σ0
x,α0x | ∀x} when S = D and
PSinit = {ξx, σ0
x,Ξx,Σ0x,α
0x | ∀x} when S = {D, I}. In addition, the layer assignment
is not applied in the initialization step.
At the tth frame, the layer assignment is applied at first based on the previous
parameter set PS,t−1 and the input data St. The region in which lx = liss will perform
the variational parameter estimation to obtain a renewed PS,tx . If lx = ldyn, it belongs
to a dynamic object so that PS,tx = PS,t−1
x . But on the other hand, if lx = locc, the
parameter set of this pixel is re-initialized as in the initialization step, but μtx = dtx,
Utx = Itx. Furthermore, it is a common phenomenon that the input depth frames
60 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
contain holes without depth measurements. In this case, μtx and λt
x will not be updated
in these special regions.
The spatial enhancement, including hole filling, smoothing and regularization, is
necessary to generate a spatially refined static structure. It is performed after the
parameter estimation in each frame, where we have obtained the most probable depth
map Zt (Ztx ∈ Zt). A variational inpainting method incorporating a TV-Huber norm
and a data term by Mahalanobis distance with the variance (ξtx)2 is employed for spatial
enhancement, which is iteratively solved by a primal-dual approach [16]. Since the
solver requires hundreds of runs to converge, a trade-off between speed and accuracy is
adopted by fixing the number of iterations and borrowing the spatially enhanced result
in the last frame Zt−1 as the initialization. To reduce error propagation, unreliable
pixels in the input depth map Zt are deleted according to the reliability check rtx >
0.5 (c.f., equation (4.4)). Given the most probable color image of the current static
structure Ct, the spatial enhancement in Zt can absorb the texture information to
guide the propagation of the local structures. In the end, the enhanced depth map Ztx
will substitute μtx in PS,t
x .
4.3.6 Temporally Consistent Depth Video Enhancement
Apart from spatial enhancement, it is preferred to employ temporal enhancement to
produce a flicker-free depth video. To enable long-range temporal consistency and allow
online processing, we exploit the static structure of the captured scene as a medium to
find the region in the input frame exhibiting long-range temporal connection. The static
region is enhanced by fusing the input depth measurements with the static structure
according to the online static structure update scheme in Section 4.3.5. Thus the static
regions are well-preserved and incrementally refined over time. The idea behind this
is that we restrict the temporal consistency to be enforced only around static region
or slowly moving objects. This assumption is somewhat restrictive but is still suitable
to process normal depth videos. One additional advantage of the proposed method is
that it can prevent bleeding artifacts that propagate depth values from moving objects
into the static background as long as the layer assignment is robust.
Given the resulting layer assignment of the current frame, the static region is where
§ 4.4. Experiments and Discussions 61
lx ∈ {liss, locc}, including the regions referring to the static structure and those belong-
ing to the once occluded static structure. They both expose the current visible static
structure of the captured scene, thus shall be enhanced separately from the dynamic
objects. The enhanced version is obtained by substituting it with its counterpart in
the static structure, which has already been updated in the temporal domain and en-
hanced in the spatial domain (see Section 4.3.5). The dynamic region can be enhanced
by various approaches explored in the literature, while in this chapter we exploit a con-
ventional joint bilateral filter, both to fill holes and to perform edge-preserving filtering
in the dynamic region.
The proposed method is both memory- and computationally efficient. The memory
requested for the proposed method only goes to storing the parameter set for each pixel,
thus is efficient to process streaming videos or long sequences of high quality. Excepting
the cost of the spatial enhancement, the complexity for temporal enhancement hinges on
that of the online static structure update scheme, in which all the required parameters
have analytical solutions whilst the layer assignment is efficient thanks to the constant-
time implementations in solving the fully-connected CRF model. Provided with an
efficient spatial enhancement approach, for example, the domain transform filter [32]
or the proposed one with the help of multi-thread techniques or GPGPUs [86], the
entire temporally consistent depth video enhancement procedure can be achieved in
real-time.
4.4 Experiments and Discussions
In this section, we present our experiments on synthetic and real data to demonstrate
the effectiveness and robustness of our static structure estimation and depth video
enhancement.
Section 4.4.1 numerically evaluates the performance of our method for static struc-
ture estimation using synthetic depth videos2 generated from the Middlebury dataset [87;
88]. Our method is not sensitive to the user-given parameters, and outperforms vari-
ous methods about static scene estimation with running time comparable to temporal
median filtering.
2The depth of one pixel in the depth frame is proportional to the reciprocal of the disparity at thesame place in the corresponding disparity frame.
62 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
(a) Reindeer (b) I : (10−2, 2) (c) II : (10−2, 2)
Figure 4.6: Sample frames of the input depth video with two types of noise andoutliers. (a) is the sample color frame, (b) and (c) are the contaminated depth frameswith σn = 2 and ωn = 10−2. (b) is type-I but (c) is type-II. Type-II error is worse thantype-I error with the same parameters.
In Section 4.4.2, we evaluate the performance on real data captured by Kinect and
ToF cameras. Both static and dynamic indoor scenes are taken into consideration.
Apart from the estimation of static structure, we also evaluate the performance of
the static scene reconstruction and most importantly, the temporally consistent depth
video enhancement in Section 4.4.3.
Initial parameters are simply set as α0x = [1, 1, 1]�, σ0
x is the 10% of the depth range
of the input scene. And initial parameter Σ0x is a diagonal matrix with each diagonal
entity the square of 10% of the color range.
4.4.1 Numerical Evaluation of the Static Structure Estimation By Synthe-
sized Data
We used two types of noise and outliers, which are illustrated in Figure 4.6, to contam-
inate the depth video so that we could evaluate the performance of our method with
respect to different kinds of errors from different types of depth sensors.
Type-I: We contaminated the depth map via p(dx|Zx) = (1 − ωn)N(dx|Zx, σ
2n
)+
ωnU (dx), where U(dx) is the reciprocal of the depth range. It is a general model of
noise and outliers.
Type-II:We damaged the disparity map by p(ddispx |Zdispx ) = (1−ωn)N (ddispx |Zdisp
x , σ2n)+
ωnU(ddispx ) and rounded it. The disparity map was transformed into the depth map.
U(ddispx ) was the reciprocal of the disparity range. It mimicked the outliers in common
depth videos captured by stereo or Kinect.
§ 4.4.1. Numerical Evaluation of the Static Structure Estimation By Synthesized Data 63
outlier param u
stdparam
σ
1e−5 1e−3 1e−1
0
10
20 1e0
1e1
1e2
1e3
(a) I : (10−3, 1)
outlier param u
stdparam
σ
1e−5 1e−3 1e−1
0
10
20 1e0
1e1
1e2
1e3
(b) I : (10−2, 2)
outlier param u
stdparam
σ
1e−5 1e−3 1e−1
0
10
20 1e0
1e1
1e2
1e3
(c) I : (10−1, 4)
outlier param u
stdparam
σ
1e−5 1e−3 1e−1
0
10
20 1e0
1e1
1e2
1e3
(d) II : (10−3, 1)
outlier param u
stdparam
σ
1e−5 1e−3 1e−1
0
10
20 1e0
1e1
1e2
1e3
(e) II : (10−2, 2)
outlier param u
stdparam
σ
1e−5 1e−3 1e−1
0
10
20 1e0
1e1
1e2
1e3
(f) II : (10−1, 4)
Figure 4.7: RMSE maps with varying u and σ under different noise and outlierparameter pairs (ωn, σn). (a)-(c) were contaminated by type-I, while (d)-(e) were con-taminated by type-II.
0 50 100100
101
102
103
Frame Order
RMSE
(10−3.5
, 20)
(10−3.3 , 3.2)
(a) (10−1, 4)
0 50 100100
101
102
103
Frame Order
RMSE
(10−3.7
, 20)
(10−3.7 , 2.2)
(b) (10−2, 2)
0 50 100100
101
102
103
Frame Order
RMSE
(10−5, 20)
(10−4.8 , 2.2)
(c) (10−3, 1)
Figure 4.8: Performance comparisons between the constant and depth-dependent ξxunder different type-II noise and outlier parameter pairs (ωn, σn). The red curve is bydepth-dependent ξx, and the blue curve is by constant ξx. Each curve is obtained atits own optimal parameter pair (u, σ), as shown in the legends.
Analysis of user-given parameters
We first evaluated the user-given parameters for the outlier parameters Uf , Ub and the
noise standard deviation ξx. In case-I, we set ξx = σ as a constant throughout the pixel
domain. For case-II, the choice of ξx should be suitable to dispose of the non-uniform
64 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
quantization error due to disparity-depth conversion as ξx = σ d2xfB .3 Meanwhile, we set
Uf = Ub = u. The experiments were evaluated by the RMSE score with varying u and σ
under different levels of noise (σd) and outliers (ωd). The results are shown in Figure 4.7,
where the test video had 100 frames. We set σ ∈ [0, 20] and u ∈ [10−5, 10−1]. Notice
that the tested scene was static thus there was NO need to perform layer assignment.
The spatial enhancement was also skipped.
The proposed method achieves satisfactory performances and is insensitive to ξx,
but a slightly bigger ξx turns out to be more robust. On the other hand, we obtain low
RMSE scores when u is around or smaller than the reciprocal of the depth range (≤ 10−3
in the test depth videos). Although smaller u can still achieve good performance, its
range tends to be narrower when noise level is increased. In practice, setting the Uf
and Ub to be the reciprocal of the depth range is sufficient and convenient, since it
actually means that the outliers may uniformly occur inside the depth range.
In addition, the depth-dependent noise parameter ξx performs superior to the con-
stant ξx in dealing with type-II error. A shown in Figure 4.8, comparisons of the results
by optimal parameter pairs (u, σ) of both cases4 reveal that a larger constant ξx is re-
quired to catch severer noise presented at larger depth values due to the property of
type-II error. In comparison with the depth-dependent noise, constant ξx might be
sufficient for slightly noisy depth videos as shown in Figure 4.8(c), but lacks capability
to catch severe noise, as shown in Figure 4.8(a) and (b).
Comparison of synthetic static scenes
As some online 3D scene reconstruction methods can also successfully perform the
static scene estimation in an online fashion, we numerically compared several state-
of-the-art candidates, i.e., the truncated signed distance function (TSDF) [79; 80] in
KinectFusion, the temporal median filter (t-MF) and the generative model for depth
fusion (g-DF) [81], with our method. The grid number per pixel was set as 100, for
both TSDF and g-DF. The temporal window size of t-MF was 5 in our experiments.
As shown in Figure 4.9, as with all other methods, our methods tend to decrease the
RMSE progressively with more frames included. However, our method is robust to the
3f is the focal length and B is the baseline, both of which are provided in the Middlebury dataset.The conversion relationship is derived in the supplementary materials.
4The optimal results were obtained by exhaustive search of 400 uniformly-sampled parameter pairsin the range σ ∈ [0, 20] and u ∈ [10−5, 10−1].
§ 4.4.1. Numerical Evaluation of the Static Structure Estimation By Synthesized Data 65
0 20 40 60 80 10010−1
100
101
102
103
Frame Order
RMSE
Ours
TSDF
t -MF
g -DF
Input
(a) I : (10−3, 1)
0 20 40 60 80 10010−1
100
101
102
103
Frame Order
RMSE
Ours
TSDF
t -MF
g -DF
Input
(b) II : (10−3, 1)
0 20 40 60 80 10010−1
100
101
102
103
Frame Order
RMSE
Ours
TSDF
t -MF
g -DF
Input
(c) I : (10−2, 2)
0 20 40 60 80 10010−1
100
101
102
103
Frame Order
RMSE
Ours
TSDF
t -MF
g -DF
Input
(d) II : (10−2, 2)
0 20 40 60 80 10010−1
100
101
102
103
Frame Order
RMSE
Ours
TSDF
t -MF
g -DF
Input
(e) I : (10−1, 4)
0 20 40 60 80 10010−1
100
101
102
103
Frame Order
RMSE
Ours
TSDF
t -MF
g -DF
Input
(f) II : (10−1, 4)
Figure 4.9: Comparison with other methods on static structure estimation of thesynthetic static scenes. Three levels of noise and outlier parameter pairs (ωn, σn) weretested. (a), (c) and (e) were of type-I. (b), (d) and (f) were of type-II. The x-axis marksthe frame order, and y-axis is the RMSE score.
noise and outliers for both the type-I and type-II errors, and has a faster rate, i.e., uses a
smaller number of frames to converge and achieve a stable performance. The severer the
noise is, the more superior the proposed method can be. Because TSDF is always slower
to converge and g-DF suffers from quantization errors, they cannot usually achieve the
same performance our method was able to achieve. In fact with a very large window
66 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
t
Static structure w/o spatial enhancement
Static structure w/ spatial enhancement (w/o texture)
Static structure w/ spatial enhancement (w/ texture)
t = 0 t = 5 t = 10
Raw depth sequence Raw color sequence
(a) Indoor_Scene_1
Figure 4.10: Visual evaluation on real indoor static scenes. (a) is the result of a realindoor scene Indoor Scene 1. The first row shows the raw depth sequences and colorsequences. The second row is the selected results of the estimated static structureswithout spatial enhancement at frame t = 0, 5, 10 respectively. The third row showscorresponding spatially enhanced static structure without texture information, whilethe last row exhibits the results with the guidance of texture information. The yellowcolor in the second row marks missed depth values (holes). Gray represents depthvalue, lighter meaning a nearer distance from the camera. Best viewed in color.
size, t-MF might obtain RMSE scores lower even than those of our method, but would
require more memory and will tend to be slower. Furthermore, t-MF does not provide
confidence of its output as our method does. Due to the quantization artifact of g-
DF, even in an optimal setting, g-DF will generally exhibit a lower performance than
that of the proposed method. The occupancy grid forbids g-DF to obtain a sub-grid
accuracy [81].
§ 4.4.1. Numerical Evaluation of the Static Structure Estimation By Synthesized Data 67
Static structure w/o spatial enhancement
Static structure w/ spatial enhancement (w/o texture)
Static structure w/ spatial enhancement (w/ texture)
t = 0 t = 5 t = 10
Raw depth sequence Raw color sequence
t
(b) Indoor_Scene_2
Figure 4.11: Visual evaluation on real indoor static scenes. (b) is the results of a realindoor scene Indoor Scene 2. The first row shows the raw depth sequences and colorsequences. The second row is the selected results of the estimated static structureswithout spatial enhancement at frame t = 0, 5, 10 respectively. The third row showscorresponding spatially enhanced static structure without texture information, whilethe last row exhibits the results with the guidance of texture information. The yellowcolor in the second row marks missed depth values (holes). Gray represents depthvalue, lighter meaning a nearer distance from the camera. Best viewed in color.
Algorithms t-MF (w=5/10) g-DF TSDF Ours
Running time (s) 0.0188 / 0.0309 1.9186 0.6847 0.0223
Table 4.1: Per-frame running time comparison (MATLAB platform)
The per-frame running time comparison is listed in table 4.1, where our method
68 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
is comparable with t-MF. The t-MF with window size 5 has a slightly smaller com-
putational cost, but when the window size is 10, its running time exceeds that of our
method. g-DF and TSDF require much more time to process a single frame, but their
performances are still not comparable to our method.
4.4.2 Evaluation of the Static Structure Estimation By Real Data
To validate our algorithm with the real data, we picked several depth video sequences
captured by Kinect and ToF cameras. Both static and dynamic scene were tested.
Static scenes
Figures 4.10 and 4.11 show the results of two real indoor scenes captured by Kinect. The
first row shows the raw depth and color video sequences. Notice that there are severe
holes presented, and fine details of the scene are susceptible to be missed or in fault
depth values. Nevertheless, their corresponding color frames are always well-defined
everywhere to provide enough cues to regularize the structures.
We first estimate the static structure just by raw depth frames without spatial
enhancement. See the second rows in Figures 4.10 and 4.11. Our method can robustly
fill holes as long as sufficient depth samples in previous frames are available. In the
case where only depth video is applicable, spatial enhancement is only constrained
by the depth information. Even though the results are more spatially regular than
those without spatial enhancement, the inpainting artifacts occur inside sufficient large
holes, and edges are blurred. Furthermore, wrong measurements in the depth frames
will be retained in the static structure and cannot be eliminated. As illustrated in
the last roww of Figures 4.10 and 4.11, spatial enhancement based on both depth and
texture information produces refined static structures which are both reliable and user-
acceptable. The results in green boxes show the differences between two types of spatial
enhancements.
Directly employing spatial enhancement in raw depth frames cannot obtain stable
results since randomly occurring holes and outliers destroy the consistency between
frames and prevent the regularizing of the depth map into a temporally stable one.
The static structure, in contrast, enforces the long-range temporal connection and
incrementally refines the static scene. As shown in red circles in Figures 4.10 and 4.11,
§ 4.4.2. Evaluation of the Static Structure Estimation By Real Data 69
(a) Indoor Scene 1
(b) Indoor Scene 2
Figure 4.12: Reliability maps of two test sequences of indoor static scenes.
the missed structures cannot be inferred satisfactorily just by conventional methods,
but they are refined and converged as time goes on.
The reliability of the estimated static structure (shown in Figure 4.12) is measured
by the proportion of samples that agree with the static structure as per equation (4.4),
which indicates that flat or smooth surfaces in the static structure are of high reliabil-
ity. Simply marking unreliable pixels by rtx ≤ 0.5, many unreliable pixels are around
discontinuities or occlusions. It is reasonable that measurements around such regions
tend to be unreliable due to the systematic limitations of Kinect and related depth
sensors. The static structure can be spatially regularized further in conjunction with
the reliability map by reducing the data confidence in the unreliable region. Our reli-
ability map is data-driven unlike those by heuristic methods [30] that need user-tuned
parameters.
70 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
(a)
(b)
(c)
(d)
(e)
#1 #2 #3 #4
#1 #2 #3 #4
#1 #2 #3 #4
#1 #2 #3 #4
#1 #2 #3 #4
#5
#5
#5
#5
#5
Figure 4.13: Static structure estimation on dyn kinect tl. (a) and (b) are thefirst five frames of the input sequence. (c) shows the layer assignment results. Red,green, blue denote liss, ldyn, locc, respectively. (d) represents the depth map of thestatic structure, and (e) shows the corresponding color map. The first frame is forinitialization.
Dynamic Scenes
Our method can effectively extract the dynamic content from a static scene and further
estimate and refine the static structure in the static region. Two videos were evalu-
ated. One was captured by Kinect, a real indoor scene with people moving around
(dyn kinect tl). The second was a hand sequence by a ToF camera (dyn tof tl).
Kinect sequence dyn kinect tl is a time-lapse (30×) Kinect sequence. Figure 4.13
shows the results of the first five frames. The parameter set for layer assignment:
wr = 5, ws = 10, τα = 16−2, τγ = 3−2,Σβ = I. Our proposed method can rapidly
capture the static structure (both the depth and color) with very few frames. The
artifact in Figure 4.13(d) is partially due to unreliable initialization, and partially
because of the limited number of iterations of hole filling in the spatial enhancement.
§ 4.4.3. Temporally Consistent Depth Video Enhancement 71
#1 #2 #3 #4 #5
(a)
(b)
(c)
#1 #2 #3 #4 #5
#1 #2 #3 #4 #5
Figure 4.14: Static structure estimation on dyn tof tl. (a) shows the first 5 framesof the input sequence. (b) shows the layer assignment results. Red, green, blue denoteliss, ldyn, locc, respectively. (c) represents the depth map of the static structure. Thefirst frame is for initialization.
The latter one can be solved gradually after a few frames, as shown in the 3rd and 4th
frames in (d). The former problem will be relieved by deleting unreliable area in the
future frames according to the reliability map.
ToF sequence The ToF sequence dyn tof tl [1] is time-lapse (10×) and has no
color sequence embedded, as shown in Figure 4.14. The parameter set for layer as-
signment: wr = 20, ws = 10, τα = 5−2, τγ = 1−2,Σβ = I. Similar to the results
from dyn kinect tl, the layer assignment can effectively exclude depth values from
dynamic foregrounds (lx = ldyn) and include those from once occluded static structures
(lx = locc). Nevertheless, the blurs around boundaries and high noise level in the raw
depth frames lead to halo artifacts in the resultant static structures at the first few
frames, because in this case the layer assignment cannot definitively point out the ex-
act boundaries between layers. Fortunately later frames provide more reliable depth
samples in such regions, thus eliminating these artifacts. See the difference from the
3rd to the 5th frame in Figure 4.14 (c).
4.4.3 Temporally Consistent Depth Video Enhancement
Our depth video enhancement works in conjunction with the online static structure up-
date scheme. The quality of the static structure determines the resulting performance
72 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
t
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Figure 4.15: Comparison on depth video enhancement. (a) and (b) are se-lected frames from the test RGB-D video sequences. From left to right: the113rd, 133th, 153th, 173th, 193th and 213th frame. (c) shows the results by CSTF [1], and(d) by WMF [5]. (e) by Lang et. al. [6] (f) is generated by the proposed method. (g)compares the performances among these methods in the enlarged sub-regions (shownin raster-scan order). Best viewed in color.
from enhancing the tested frame spatially and temporally. Thanks to the robustness
and effectiveness of our proposed method, this temporally consistent enhancement out-
performs most existing representative approaches and shows comparable results with
current state-of-the-art long-range temporally consistent depth video enhancement [6].
We tested several RGB-D sequences to verify our conclusion and highlight the advan-
tages of the proposed method. These videos and their results by the proposed method
and the reference approaches are available in the supplementary materials.
As shown in Figure 4.15, the selected frames from the sequence dyn kinect 1 are
§ 4.4.3. Temporally Consistent Depth Video Enhancement 73
(a) dyn_kinect_2
Figure 4.16: Comparison on depth video enhancement. (a) are selected frames froman RGB-D video sequence dyn kinect 2. From top to bottom: the RGB frames, theraw depth frames, results by Lang et. al. [6] and results by the proposed method. Bestviewed in color.
113th, 133rd, 153rd, 173rd, 193rd and 213th, from left to right. Severe holes occurring in
each frame are partially because of occlusion and partially due to the absorbent or
reflecting materials in the captured scene. Worse still, the depth values around the
boundaries of captured objects tend to be erratic. The raw depth and color frames
are shown in Figure 4.15(a) and (b). The reference methods are the coherent spatio-
temporal filtering [1] (CSTF), the weighted mode filtering [5] (WMF) and temporally
consistent depth upsampling by Lang et. al. [6]. Their parameters were set up as their
default values as shown in their papers. The reference results are shown in (c), (d) and
(e) of Figure 4.15 and the results of the proposed method are listed in Figure 4.15(f).
CSTF is inclined to be more blurring than the rest of the methods, especially inside
the holes around the boundaries between the foreground objects and the background
scene. WMF needs to quantize the depth frame into finite bins (in this experiment, 256
74 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
(b) dyn_kinect_3
Figure 4.17: Comparison on depth video enhancement. (b) are selected frames froman RGB-D video sequence dyn kinect 3. From top to bottom: the RGB frames, theraw depth frames, results by Lang et. al. [6] and results by the proposed method. Bestviewed in color.
bins were applied), thus resulting in quantization artifacts even though it encourages
sharper boundaries without blurring. Referring to any frame in Figure 4.15(c) and
Figure 4.15(d), neither of these two methods can fill the depth holes with satisfactory
accuracy, and the latter one performs worse in stabilizing these holes. On one hand,
the reason is that they are not able to fill large holes without propagating wrong
depth structure when the texture is less informative. On the other hand, the temporal
consistency is enhanced only within a small temporal window, thus the structure insides
the holes cannot be preserved over a long time.
A recent practical and remarkable improvement attributable to Lang et. al. [6] is
a practical long-range temporal consistency enhancement. Its results shown in Fig-
ure 4.15(e) present its superiority both in structure regularization as well as temporal
§ 4.4.3. Temporally Consistent Depth Video Enhancement 75
stabilization over the previous two reference methods. Not only does the method by
Lang et. al.temporally stabilize the static objects and/or background, but also enforces
the long-range temporal consistency of the dynamic objects. In comparison with it, the
proposed method cannot preserve the temporal consistency inside the dynamic objects.
However, the bleeding artifacts in the hole regions still cannot be eliminated immedi-
ately and are vulnerable to be propagated over the adjacent frames. Although this
method is efficient in calculation thanks to the approximation solver by constant-time
domain transform filtering [32], this method is globally optimized thus it often requires
to store all frames into memory.
In comparison with the prior arts, the proposed method outperforms CSTF and
WMF both spatially and temporally. Furthermore, it generally has a performance
comparable to that of Lang et. al., sometimes even superior around static holes be-
tween dynamic objects and the static background, and in stabilizing the static region
of each frame. Figure 4.15(g) compares the results of the enlarged sub-regions denoted
by the red boxes in the original frames, in which our method features superior per-
formance in regularizing these depth structures. In addition, by observing the static
background behind the moving people, the proposed method offers much more stable
results around regions where there were large holes, e.g., the black computer cases and
monitors placed on and under the white tables. It both preserves the long-range stabil-
ity of the depth structure in the holes of the static region and at the same time prevents
depth propagation from the dynamic objects to the static background. Meanwhile, the
spatially enhanced static structure by the proposed method can incrementally refine it-
self by following the guidance of the corresponding color map, and gradually converges
to a stable output, just as discussed in Section 4.4.2.
Two additional results by the proposed method and Lang et. al. [6] are presented
in Figures 4.16 and 4.17, in which the proposed method provides comparable quality
while encourages even more delicate details around the hands and heads, as well as blur-
free boundaries between the human and the background, owing to the success of layer
assignment in Section 4.3.4. However, because the proposed method cannot extract a
static foreground object from the static background, blurring artifacts or false depth
propagation may happen around their boundaries, just as with the aforementioned
state-of-the-art method by Lang et. al.and the filtering-based approaches like CSTF
76 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
(a)
(b)
Figure 4.18: Failure cases of the proposed method. (a) and (b) are two representativeresults. From left to right: color frame, raw depth frame and the enhanced depth frame.Artifacts are bounded by the red dot boxes.
and WMF. As referring to the standing person near the background in Figure 4.17:
both the proposed method and that by Lang et. al.falsely propagated the depth values
from his left arm into the computer case in the background.
4.5 Limitations and Applications
4.5.1 Limitations
One limitation is that the proposed method has only been tested with indoor Kinect
and ToF depth videos. To verify the reliability and generality of the proposed method,
more diverse sources of depth data, e.g., depth videos capturing indoor or outdoor
scenes, by Kinect, ToF or laser scanners, as well as stereo vision, should be evaluated
thoroughly.
For RGB-D video enhancement, the proposed method is constrained by the as-
sumption that the static structure is “static” both in the depth and color channels.
The static structure estimation may thus fail if the captured scene has varying illu-
mination, in which case, the spatio-temporal enhancement turns into a conventional
spatial enhancement approach. Another possible drawback of the proposed method is
that the false estimation in the static structure cannot be eliminated if future frames
cannot provide enough reliable depth samples at the same location. For example, the
§ 4.5.1. Limitations 77
(a) RGB frame (b) Raw depth frame (c) Ours
(d) Lang et. al. [6] (e) CSTF [1] (f) WMF [5]
Figure 4.19: Examples of the background subtraction. Best viewed in color.
artifacts marked by the red dotted boxes in the enhanced depth frames (c.f. Fig-
ure 4.18) correspond to the holes in the input depth frames. The input depth frames
cannot provide effective and reliable depth samples at these regions thus the artifacts
cannot explicitly be detected by the proposed model. One possible improvement might
heuristically define a threshold to delete such regions from the static structure when
no reliable depth samples are received within a sufficient long time.
The proposed method only models the captured scene with dynamic and static
layers, and is not capable to immediately extend to multiple (e.g., more than 3) layers.
Although it is a tough question to define and model these layers properly, we believe
that more accurate results are possible by introducing such extension. For instance, the
relationship between different dynamic objects can be well-defined if multiple dynamic
layers compactly represent the local statistics of these objects. In this case, the spatial
enhancement of each object can be handled separately and/or hierarchically, while
the temporal enhancement can be adjusted to fit their distinctive motion patterns.
Therefore, this meaningful extension is worthy being explored in depth as a future
topic.
78 CHAP. 4. TEMPORAL ENHANCEMENT BASED ON STATIC STRUCTURE
(a) (b) (c)
(d) (e) (f)
Figure 4.20: Examples of the novel view synthesis. (a) and (b) are the input RGBand depth frames. (c) is the enhanced depth frame by the proposed method. (d) isthe synthesized view by the raw depth frame and the RGB frame. Image holes in (d)is filled by the static structure, as shown in (e). (f) is the synthesized view based onthe enhanced depth frame and the image holes are also filled by the estimated staticstructure. Best viewed in color.
4.5.2 Applications
A high quality depth video improves various applications in the fields of image and
graphics processing, and computer vision as well. In the following two successful appli-
cations, the enhanced depth videos by the proposed method act as an effective cue to
improve performance.
Background Subtraction
We can use the processed RGB-D videos to improve the segmenting of the foreground
objects from the background. As shown in Figure 4.19, we tested one pair of RGB-D
frames for background subtraction by simply extracting the region with depth values
smaller than a constant threshold (in this case, we set the threshold as 1500mm) and
replacing the background by blue color. Note that there was no boundary matting
applied in all the cases. The proposed method (c.f. Figure 4.19(c)) shows a much more
refined and complete foreground segment than those by the reference methods.
§ 4.6. Summary 79
Novel View Synthesis
A variant of novel view synthesis, named depth image-based rendering (DIBR) [89]
applies the depth information to guide the warping of the texture map of one view to
another synthesized view. It is a popular technique for immersive telecommunication
or 3D and freeview TVs. However, the performance is hampered by the quality of
the depth video. As presented in Figure 4.20, the novel view generated by the raw
depth frame and the registered RGB frame contains severe holes and cracks, as well
as structure distortion. The static structure is appropriate to fill the image holes,
but it may replace the structure of the foreground objects by mistake. The enhanced
depth frame by the proposed method can preserve the depth structures well so that
less structure distortion occurs in its synthesized view. Thus the synthesized view is
visually plausible without apparent artifacts.
4.6 Summary
In this chapter, we present a novel method for robust temporally consistent depth en-
hancement by introducing the static structure of the captured scene, which is estimated
online by a probabilistic generative mixture model with efficient parameter estimation,
spatial enhancement and update scheme. After segmenting the input frame with an
efficient fully-connected CRF model, the dynamic region is enhanced spatially while the
static region is substituted by the updated static structure so as to favor a long-range
spatio-temporal enhancement. Quantitative evaluation shows the robustness of the
parameters estimation on the static structure and illustrates a superior performance
in comparison to various static scene estimation approaches. Qualitative evaluation
demonstrates that our method operates well on various indoor scenes and two kinds of
sources (Kinect and ToF camera), and proves that the proposed temporally consistent
depth video enhancement works satisfactory in comparison with existing methods.
As our future work, an extension to deal with moving cameras will be a meaningful
topic for study. Furthermore, we will improve the algorithm to reduce the effect of
wrong estimation and design an efficient reliability check to increase the accuracy of
the estimated static structure. Last but not the least, a more general probabilistic
framework to handle multiple dynamic and static layers is necessary to explore for
inherently increasing the performance of the proposed method.
Chapter 5
A Generative Model for Robust 3D Facial Pose
Tracking
5.1 Introduction
Our approach unifies the 3D facial pose tracking and online identity adaptation based
on a parameterized generative face model. This generative model is parameterized by
a 3D multilinear tensor model [7; 90] integrating the descriptions of shape, identity
and expression, which does not only effectively model the identity but also provide the
statistical interpretation for the expression. Different from the discriminative methods,
the generative model possesses the flexibility to generate and predict the distribution
and uncertainty underlying the face model. By tracing the identity distribution during
the tracking process in a generative way, the face model can be gradually adapted to
the captured user with sequential inputted depth frames. The occlusion-aware pose
estimation is achieved by minimizing an information-theoretic ray visibility score that
regularizes the visibility of the face model in current depth frame. It is induced from
an intuition that the visible face model points must be overlapped with the input point
cloud, otherwise they must be occluded by the input point cloud. This method does not
need explicit correspondence detection, but it both accurately estimates the facial pose
and handles the occlusions well. In each frame, we progressively adapt the face model
to the current user after the facial pose has been successfully collected. In summary,
we make the following contributions:
• A framework that unifies pose tracking and face model adaptation on-the-fly,
offering highly accurate, occlusion-aware and uninterrupted 3D realtime facial
pose tracking.
80
§ 5.2. Related Work 81
• A generative multilinear face model that both models the identity and expres-
sion, facilitating the on-the-fly face model personalization without the interference
caused by the expression variations.
• A ray visibility score that enables the correspondence-free and occlusion-aware
facial pose tracking.
5.2 Related Work
Conventionally, the facial pose tracking and model regression typically employs the
monocular RGB video sequences due to their availability. These approaches often
tracked the dynamics of sparse 2D or 3D facial features e.g., the face landmarks, optical
flow and etc., that correspond to the parametric 2D or 3D face models [37–40]. Accom-
panied with reliable feature detection methods, the facial pose can be tracked well under
moderate occlusions and motion patterns. Active appearance models (AAM) [41] and
constrained local models (CLM) [42] enabled real-time sparse 2D facial feature track-
ing under the data-driven manner, but they may fail when met with complex motions
or large facial deformations even though a user-specific training phase has been in-
volved. Recent advances about discriminative real-time 2D tracking based on random
forests [43], landmark prediction [44] and supervised descent methods [45] have shown
promising results in comparison with previous methods. In addition, explicit modeling
of the occlusions have been taken into account [43].
With the popularity of the consumer-level depth sensors, a variety of 3D facial pose
tracking and model personalization frameworks have been proposed. One category of
approaches achieved reliable tracking performance without the introduction of a 3D
model or template. Some of these methods employ depth features, such as facial fea-
tures defined by surface curvatures [46], nose detector [47], or triangular surface patch
descriptors [48]. However, these methods fail when the features cannot be detected,
e.g., highly noisy depth data, extreme poses or large occlusions. The other part of
these methods apply the discriminative methods. For instance, Fanelli et. al. used the
random classification and regression forests with depth image patches for face detection
and pose estimation [49; 50]. Riegler et. al. [51] trained a deep Hough network to simul-
taneously detect the face and estimate the facial pose. Another kind of discriminative
variants does not explicitly estimate the facial pose but instead determine the dense
82 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
correspondence field between the input depth image and a pre-defined canonical face
model. The facial pose is thus estimated by regressing the face model to the input
depth image under this correspondence field. Inspired by the primal works of [91; 92],
the dense correspondence field can be generated according to a random classification
and regression forests with simple depth features, as proposed by Kazemi et. al. [52].
Apart from the random forests, convolutional neural networks (CNN) are also sufficient
for the dense correspondence field estimation, which have already been proven success-
ful in human pose estimation and body reconstruction [93]. Although these methods
are promising to provide sufficient accurate results, they require extensive and sophis-
ticated supervised training with large scale datasets. Moreover, these methods provide
weaker generalization to handle depth data captured from an unfamiliar depth sensor
that is not involved in the training dataset.
Another category matches a 3D face model to the input depth images with rigid
or non-rigid registration methods. For example, a common strategy is to fit a user-
specific face model by 3D morphable models [8; 94–103], or brute-force per-vertex 3D
face reconstruction [104–106]. Although helpful for accurate facial tracking systems,
most of them require offline initialization or user calibration for user-specific face model
generation. In contrast, there are prior arts that gradually refine the 3D morphable
model as more data is being collected, paralleling to the facial pose tracking thread [8;
100–102]. The proposed method falls into this category and the whole pipeline is re-
interpreted in the generative way. The 3D morphable models can be roughly categorized
into three classes: (1) the wireframe model (WFM) [97; 98]; (2) the Basel face model
(BFM) [8; 40; 52; 100; 101]; and (3) the multilinear face model (MFM) [7; 90; 95;
99] that models the identity and expression. Unlike the wireframe model that is too
sparse to produce a detailed face model and the Basel face model that cannot eliminate
biased reconstruction results caused by the expression variations, the multilinear face
model both describes the identity and expression [7; 90]. By treating the multilinear
face model in a generative way, the uncertainty of the expression variations can be
explicitly modeled and the reconstructed face model is thus less vulnerable to these
distortions.
The occlusion-aware registration problem is a long-standing issue for facial pose
§ 5.3. Probabilistic 3D Face Parameterization 83
Figure 5.1: Sample face meshes in the FaceWarehouse dataset. This dataset containsface meshes from a comprehensive set of expressions and a variety of identities includingdifferent ages, genders and races.
tracking. Despite the discriminative ways that label the occlusions through face seg-
mentation [101; 107] or patch-based feature learning [49–52], the rigid or non-rigid
ICP-based face model registration framework suffers the correspondence ambiguities
if the distance or normal vector compatibility criterion [9; 101; 105; 106] is applied.
Possible remedies are to apply the global optimization, e.g., particle swarm optimiza-
tion [108], through delicate objective functions [8]. Assume the multi-view visibility
consistency among partial depth scans, the occlusions and partial registration can also
be well handled [109]. The proposed ray visibility score observes the visibility con-
straint between the face model and the input point cloud, which is similar as what
Wang et. al. [109] had stated in the multi-view visibility consistency. The proposed ray
visibility score is formulated by an information-theoretic manner under the generative
perspective, which is more robust for uncertainties in the 3D morphable face model,
and less vulnerable for local minima that frequently occur in ICP-based methods.
5.3 Probabilistic 3D Face Parameterization
This section introduces the 3D face model with a probabilistic interpretation, which
acts as an effective prior for head pose estimation and face identity adaptation from a
streaming depth video.
84 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
5.3.1 Multilinear Face Model
We apply the multilinear model [7; 90] to parametrically generate arbitrary 3D faces
that are adaptive to different identities and expressions. It is controlled by a three
dimensional tensor C ∈ R3NM×Nid×Nexp with each dimension corresponds to shape,
identity and expression, respectively. The multilinear model represents a 3D face f =
(x1, y1, z1, . . . , xNM , yNM , zNM)� consisting of NM vertices (xn, yn, zn)� as
f = f + C ×2 w�id ×3 w
�exp, (5.1)
where wid ∈ RNid and wexp ∈ R
Nexp are linear weights for identity and expression,
respectively. ×i denotes the i-th mode product. f is the mean face in the training
dataset. The tensor C, or called the core tensor, encoding the subspaces that span
the shape variations of faces, is calculated by high-order singular value decomposition
(HOSVD) to the training dataset, i.e., C = T ×2Uid×3Uexp. Uid and Uexp are unitary
matrices from the mode-2 and mode-3 HOSVD to the data tensor T ∈ R3NM×Nid×Nexp .
T is a 3D tensor that collects the offsets against the mean face f from face meshes with
varying identities and expressions in the training dataset.
To produce compact and complete representations for arbitrary faces by equa-
tion (5.1), we train the mean face f and the core tensor C from the well-known Face-
Warehouse dataset [7]. As visualized in Figure 5.1, this dataset contains face meshes
ranging from 150 identities and 47 expressions, and includes different ages, genders,
races and etc. Its diversity enables the subspace of face shape variations which covers
most common identities and expressions.
To represent a face model compactly but efficiently, the core tensor C can be safely
truncated with respect to the dimensions of identity and expression. As the principal
shape variations are stored in the top-left 3NM × Nid × Nexp sub-tensor Cr, the face
model f can still be reconstructed by Cr without apparent distortion.
5.3.2 A Statistical Prior
It is sufficient to treat the multilinear tensor model as a statistical prior. Instead of
the conventional methods, we do not discriminatively employ one exact face template
(which may disagree with the face of the current user or be incompatible with local
§ 5.3.2. A Statistical Prior 85
(b) Variance by (c) Variance by
mm
(a) Mean face (d) Variance by
Figure 5.2: Illustration of the generic multilinear face model trained by the FaceWare-house dataset [7]. (a) The mean face f . (b) Illustration of per-vertex shape variationcaused jointly by wid and wexp. (c)–(d) Illustration of per-vertex shape variation withrespect to wid and wexp, respectively. The shape variation is represented as the stan-dard deviation of the marginalized per-vertex distribution. The shape variations in(b)–(d) are overlaid on the same neutral face model μM. Best viewed in electronicversion.
variations by user’s expression) to fit the target face point cloud or track its motion
with a set of heuristic parameters. With the help of statistical modeling, the face shape
and its distribution can be generatively synthesized and the dynamics of the tracked
face can be reliably predicted. In addition, with the introduction of the statistical prior,
there are fewer user-provided parameters in the proposed system than the conventional
discriminative methods.
Identity and Expression Priors
It is tractable to assume the identity weight wid and expression weight wexp follow two
independent Gaussian distributions, wid = μid + εid, εid ∼ N (εid|0,Σid) and wexp =
μexp+εexp, εexp ∼ N (εexp|0,Σexp). These prior distributions can be estimated from the
training data. Indeed, we simply apply μid = U�id1 and μexp = U�
exp1. The variance
matrices are identity matrices with scales, i.e., Σid = σ2idI, and Σexp = σ2
expI. And the
parameters σ2id = 1
Nidand σ2
exp = 1Nexp
are learned from the training set.
86 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
Multilinear Face Model Prior
The canonical face model M with respect to wid and wexp is analogically in the form
f = f + C ×2 μid ×3 μexp + C ×2 εid ×3 μexp
+C ×2 μid ×3 εexp + C ×2 εid ×3 εexp.(5.2)
If εid and εexp have smaller magnitudes than μid and μexp, the last term can be elim-
inated from (5.2) since it usually produces much smaller shape variations than those
caused solely by identity or expression variations. Therefore, the face model M ap-
proximately follows a Gaussian distribution as
pM(f) = N (f |μM,ΣM), (5.3)
and its neutral face is μM = f + C ×2 μid ×3 μexp, and its variance matrix is given by
ΣM = PidΣidP�id +PexpΣexpP
�exp. The projection matrices Pid and Pexp for identity
and expression are obtained by permuting (as the operation Π(·)) the tensor expres-
sions into the matrix forms: Pid = Π(C ×3 μexp) ∈ R3NM×Nid ,Pexp = Π(C ×2 μid) ∈
R3NM×Nexp .
Since in this work, we are interested in the facial pose tracking and identity adapta-
tion that is insensitive to expression variations, the joint distribution of the face model
and the identity parameter is necessarily introduced as
p(f ,wid) = pM(f |wid)p(wid)
= N (f |f +Pidwid,ΣE)N (wid|μid,Σid), (5.4)
where the variance of the expression ΣE = PexpΣexpP�exp is absorbed in the likelihood
p(f |wid). It is therefore robust to local shape variations led by expression, and the
posterior of wid will be less affected by the user’s expression in the current frame.
Moreover, the expression variance ΣE is adjusted by the identity, which is adapted to
current user and increases the robustness in pose estimation.
As shown in Figure 5.2, the joint shape variations by ΣM is varying from vertex to
vertex, but the facial region bears the most frequent shape distortions than the rest of
head regions. By decomposing ΣM into the shape variance by identity ΣI = PidΣidP�id
and the shape variance by expression ΣE , i.e., ΣM = ΣI+ΣE , we can observe that the
§ 5.4. Probabilistic Facial Pose Tracking 87
Rigid Motion Tracking
Identity Adaptation
Input
Output
Pose Parameters
Identity distribution
Color Image
Point Cloud
Face Model
Figure 5.3: System overview. We propose a unified probabilistic framework for robustfacial pose estimation and online identity adaption. In both threads, the generative facemodel acts as the key intermediate and it is updated immediately with the feedback ofthe identity adaptation. The input data is the depth map while the output is the rigid
pose parameter θ(t) and the updated face identity parameters {μ(t)id ,Σ
(t)id } that encode
the identity distribution p(t)(wid).
majority of shape variations are caused by the identities rather than the expressions.
Conversely, the shape uncertainties caused by the expressions are localized around the
mouth and chin, as well as the regions around the cheek and eyebrow. Meanwhile, the
neutral face μM is nearly the same as the mean face f in the training dataset, it reveals
that the priors of wid and wexp do not bias the face model M for the representation of
the training dataset.
In comparison with the Basel Face Model (BFM) [40] that parameterizes the face
model by principal component analysis on 200 3D face meshes with neutral expression,
and the Blendshapes [110] that encode the face expressions as a linear combination
of user-specific basic expression units from facial action coding system (FACS) [111],
the proposed multilinear model explicitly describes both the identity and expression
variations by a fully generative interpretation. It conveys more description power for
a general human face and introduces robustness to the local shape variations coming
from expression, both for the facial pose tracking and online identity adaptation.
5.4 Probabilistic Facial Pose Tracking
In this section, the pipeline of the proposed probabilistic facial pose tracking is intro-
duced. Our architecture is shown in Figure 5.3. There are two main components in
88 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
our system: robust facial pose tracking, and online identity adaptation. The identity
adaptation branch runs concurrently to the facial pose tracking branch. Both branches
are performed under a probabilistic framework.
Robust Facial Pose Tracking. The facial pose tracking is achieved by fitting a
3D face model to every captured 3D point cloud. In this work, the face model is
a probabilistic multilinear model M as depicted in Equations (5.1) and (5.3), with
the prior over the identity parameter is being updated to match the current identity,
but the prior of the expression parameter has to be kept fixed. The rigid motion
is estimated between the input data and the synthesized face model updated in the
previous frame. Outliers and occlusions are robustly eliminated according to a novel
ray visibility constraint, while the pose parameter is obtained by minimizing the ray
visibility score based on the Kullback-Leibler divergence [64] between the face model
and the surface distribution. The pose parameters θ do not only include the rotation
angles ω and translation vector t, but also the scale s for the first few frames, as the
face model may not match the input point cloud because of the scale difference. s will
be fixed when the identity has converged.
Online Identity Adaptation. The face model M is initialized with the generic mul-
tilinear model trained by the FaceWarehouse dataset and described by Equation (5.3).
It is gradually adapted to the user’s identity during tracking. Accounting for the entire
history of identity specification, the posterior with respect to the identity parameterwid
is recursively updated based on the assumed-density filtering and the first order Markov
chain [64]. As the face model takes the local shape variation caused by expression into
consideration as discussed in Section 5.3.2, the identity adaptation automatically alle-
viates these distortions.
5.4.1 Robust Facial Pose Tracking
Prior to tracking, we need to detect the face position for the first frame, or when the
tracking has failed. A variety of methods are applicable in our test scenario. In this
work, we employ a simple head detection method by Meyer et. al. [8], then crop the
input depth map to get a depth patch centered at the detected head center and within
a radius of r = 100 pixels. Denote the point cloud extracted from this depth patch as
P, with NP = |P| as the number of points in P.
§ 5.4.1. Robust Facial Pose Tracking 89
Self-occlusion Occluded by hair
Occluded by accessories Occluded by hand/gesture
Figure 5.4: Samples of the occluded faces. The occlusions are caused by multiplefactors. For instance, the face is occluded by itself, or the face is occluded by otherobjects like hair, accessories, hands and etc.
The pose parameters are θ = {ω, t, α} indicating the rotation angles, translation
vector, and the logarithm of the scale s, i.e., s = eα > 0, ∀α ∈ R. A canonical face
model point fn is rigidly warped into qn, n ∈ {n, . . . , NM} with the encoded orientation,
position and scale,
qn = T(θ) ◦ fn = eαR(ω)fn + t, (5.5)
where R(ω) is the rotation matrix that transformed from ω. The transformation
T(θ) ◦ fn describes this rigid warping. Therefore, the warped face model Q possesses a
similar distribution for each qn ∈ Q, given the same prior as Equation (5.3):
pQ(qn;θ) = N (qn|T(θ) ◦ μM,[n], e2αΣM,[n]), (5.6)
and μM,[n] = fn+(C ×2 μid × μexp)[n] as the nth vertex in the face model. The variance
is adapted by the scale factor and ΣM,[n] is the submatrix in ΣM that corresponds to
point fn. To find an optimal pose parameter that matches the warped face model Qand the input point cloud P, we require the surface distribution of P to be within the
range spanned by the distribution of face model Q.
90 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
Ray visibility Constraint
It is well known that occlusions are inevitable in uncontrolled scenarios. For exam-
ple, the occluded human faces are always behind the occluding objects, like hair, fin-
gers/hands, glasses, accessories, as shown in Figure 5.4. Suppose the face model Q and
the input point cloud P are correctly aligned, Q may be partially fitted with a subset
point cloud in P, but other points in Q must be occluded by points in P. In other
words, the only parts of Q that should be visible from the camera view are those that
overlap with P. Therefore, instead of the correspondence-based methods like spatial
distance threshold and normal vector compatibility check [101] that are commonly ap-
plied in 3D registration, we estimate the pose of the face model by proposing a ray
visibility constraint (RVC) to regularize the visibility of each face model point.
Formally, let us define the ray connecting a face model point qn and the camera
center as �v(qn,pn), where pn is a point in P that is nearest to this ray. In this case, pn
can be found by matching the pixel locations with qn, which is a lookup-table search
in the depth map [101; 106]. If qn is visible, it should be around the surface generated
from P, otherwise qn should be behind the surface and occluded. However, if qn is
in front of the surface point along the ray, it should suffer obligatory penalty for the
purpose of pushing the face model Q farther away so as to let qn be around the surface
of P. Eventually, the face model will be tightly and/or partially fitted with a subset of
points in P while leaving the rest of the points as occlusions.
One simple way to describe the surface of a point cloud is the local linear regression,
which is equivalent to fit the points in a local neighborhood by a 3D plane. Thus if
a model point qn is linked to an input point pn through the ray �v(qn,pn), the signed
distance of qm to the surface is
Δ(qn;pn) = n�nqn + bn. (5.7)
nn and bn are the normal vector and the offset of the plane centered at point pn.
Therefore, the signed distance of face model point qn = T(θ) ◦μM,[n] to the surface of
P, follows pQ→P(yn;θ) as
pQ→P(yn;θ) = N(yn|Δ(T(θ) ◦ μM,[n];pn), σ
2o + e2αn�
nΣM,[n]nn
), (5.8)
§ 5.4.1. Robust Facial Pose Tracking 91
(a) Case-I (b) Case-II (c) Case-III
Face point is visible Face point is occludedγn = 1 γn = 0
Figure 5.5: Illustration of the ray visibility constraint. A profiled face model and acurve in the surface of the input point cloud are presented in front of a depth camera.Three cases are presented. (a) Case-I: a partial face region is fitted to the input pointcloud, while the rest facial regions are occluded. (b) Case-II: the face model is com-pletely occluded. (c) Case-III: a part of face region is visible and in front of the pointcloud, and the rest face regions are occluded. Best viewed in electronic version.
σ2o is the data noise variance from the surface modeling and the sensor’s systematic
error. This distribution is derived by marginalizing the face model distribution as
pQ→P(yn;θ) =∫fnN (yn|Δ(qn;pn), σ
2o)pQ(qn;θ)dqn.
Therefore, we can classify the point qn based on its visibility according to the ray
visibility constraint and label them γ = {γn}NMn=1, where γn = {0, 1}:
i) The face model point is visible (γn = 1). If the point qn is visible along the
ray �v(qn;pn), the majority of the possible signed distance yn should be around or in front
of the surface centered at pn. We can intuitively assume that Δ(T(θ) ◦ μM,[n];pn) is
within the bandwidth of the distribution pQ→P(yn) or is negative1:
Δ(T(θ) ◦ μM,[n];pn) ≤√
σ2o + e2αn�
nΣM,[n]nn.
ii) The face model point is occluded (γn = 0). Similarly, the point qn is
assumed to be occluded when its signed distance yn is positive and beyond the effective
1We keep nn pointing to the captured scene. Thus the negative signed distance yn means qn is infront of the surface.
92 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
region of pQ→P(yn;θ):
Δ(T(θ) ◦ μM,[n];pn) >√
σ2o + e2αn�
nΣM,[n]nn.
The ray visibility constraint is associated with the ray visibility score that measures
the compatibility between the visible face model points and the input point cloud, as
well as to what degree that the face model is occluded.
Ray visibility Score
By applying the ray visibility constraint for the current face model Q with the pose
parameter θ, we form a visibility label set γ = {γn}NMn=1. The ray visibility score (RVS)
measures the compatibility between the distributions of the face model and the input
point cloud.
Consider a ray �v(qn,pn) connecting a face model point qn and input point pn. The
distribution of pn is simply as
pP(yn) = N (yn|0, σ2o)
γnUO(yn)1−γn , (5.9)
where UO(yn) = UO is a pseudo-uniform distribution. pP(yn) is controlled by γn: if qn
is visible, it should be near to the surface centered at pn, i.e., it has to be compatible
with the surface distribution N (yn|0, σ2o). However, if qn is occluded, the position of
pn can be arbitrary as long as it is in front of qn. Thus a uniform distribution UO(yn)
is suitable. Moreover, the projected face model distribution of qn onto pn is
pQ(yn;θ) = N(yn|Δ(T(θ) ◦ μM,[n];pn), e
2αn�nΣM,[n]nn
). (5.10)
Therefore, for all rays {�v(qn,pn)}NMn=1 that intersect the face model, the ray visibility
score S(Q,P;θ) is defined to measure the similarity between pP(y) =∏NM
n=1 pP(yn)
and pQ(y;θ) =∏NM
n=1 pQ(yn;θ). A convenient way is to apply the Kullback-Leibler
divergence as
S(Q,P;θ) = DKL [pQ(y;θ)||pP(y)] (5.11)
so that the more similar pP(y) and pQ(y;θ) are, the smaller S(Q,P;θ) is. It is trivial
to observe that an optimal solution for θ contributes to the minimization of the ray
visibility score. Since the distributions of the visible face model points pQ(yn;θ) are
§ 5.4.1. Robust Facial Pose Tracking 93
(a) Color image (b) Point cloud (c) Initial alignment (d) Ours
Figure 5.6: Examples of the proposed rigid pose estimation. (a) and (b) are the colorimages and the corresponded point cloud. (c) shows the initial alignment provided bythe head detection method [8], and (d) visualizes the proposed rigid pose estimationresults. Notice that only generic face model is applied. It robustly estimates difficultface poses from the partial scans with heavy occlusions by hands and hairs, as well asthe profiled faces with strong self-occlusions. Best viewed in electronic version.
optimally matched to the surface distributions pP(yn), while each of the remaining
points will suffer a constant penalty introduced by the occlusion distribution UO(yn).
Despite of the occlusion description in the ray visibility score, the occlusion distribution
UO(y) produces a larger number of visible face model points than the occluded points.
It is guaranteed if DKL[pQ(yn;θ)||UO(yn)] is usually larger than the incompatibility
between pQ(yn;θ) and N (yn|0, σ2o) that is led by the local shape variation of the face
model.
94 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
Rigid Pose Estimation
The rigid pose estimation for a current face model can be calculated by minimizing the
ray visibility score
θ� = argminθ
S(Q,P;θ). (5.12)
However, S(Q,P;θ) is highly nonlinear, thus there are no off-the-shelf closed-form
solutions in principal.
Actually, we apply a recursive estimation method to solve this problem. Thus, in
each loop, we alternatively solve two subproblems to estimate the intermediate hyperpa-
rameters θ(t) and γ(t), respectively. In the first subproblem, we apply the quasi-Newton
update θ(t) = θ(t−1) +Δθ using the trust region approach with respect to the ray vis-
ibility score S(Q,P;θ(t−1)) under the previous visibility label set γ(t−1). The second
subproblem is to update the visibility label set γ(t) = {γ(t)n }NMn=1 from the current pose
parameters θ(t−1). This iterative process will terminate until convergence or beyond
the pre-defined iteration loops.
To augment the performance of the proposed rigid pose estimation, the random
consensus method like the particle swarm optimization (PSO) [8; 108] is further brought
into this system. In detail, within the randomly sampled initial particles around the
initial pose parameters, a small set of seed particles that own the lowest ray visibility
scores and are divergent with each other, are updated using the recursive estimation
depicted above. The rest particles are clustered into several subsets according to their
nearness with respect to the seed particles, and they are updated based on the standard
PSO procedure. This augmentation effectively eliminates the mis-alignment problem
because of the poor initialization, and rectifies the wrong estimation when it gets stuck
in the wrong local minima of the ray visibility score.
In comparison with some common techniques like iterative closest points (ICP) [9],
the proposed rigid pose estimation only needs to find the set of rays V = {�v(qn,pn)}NMn=1
but does not require explicit correspondences. In addition, ICP fails to handle occlu-
sions if a poor initial pose is chosen, as shown in Figure 5.7(d). Moreover, the ray
visibility score is less vulnerable to bad local minima. It is analogous to approximate
the point cloud distribution pP(y) with the face model distribution pQ(y;θ) but not a
§ 5.4.2. Online Identity Adaptation 95
(a) Color image (b) Point cloud (c) Initial alignment
(d) ICP (e) RVC + ML (f) RVS (g) RVS + PSO
Figure 5.7: Comparison of the rigid pose estimation methods. (a) and (b) showthe color image and its corresponded point cloud. (c) depicts two views of the initialalignment between the generic face model and the point cloud. (d) visualizes theresult by ICP [9], and (e) reports the result by maximizing the likelihood that modeledby the ray visibility constraint (RVC). (f) is the proposed recursive method for theminimization of the ray visibility score (RVS), (g) is the augmented RVS method bythe particle swarm optimization (RVS+PSO). Details refer to the text and notice thatonly the generic face model is applied. Best viewed in electronic version.
point estimate like maximum likelihood (ML) or maximum a posteriori (MAP). For ex-
ample, maximizing the likelihood pQ→P(y;θ) =∏NM
n=1 pQ→P(yn;θ)γnUO(yn)1−γn may
seek a local mode that does not represent the majority of the likelihood, as shown in
Figure 5.7(e). On the contrary, the Kullback-Leibler divergence employed in the ray
visibility score ensures the optimal face model distribution with the estimated θ covers
the majority information conveyed in pP(y). While the modified particle swarm op-
timization further refine the facial pose. Figure 5.7 and 5.6 illustrates the superiority
of the proposed RVS and RVS+PSO methods in handling unconstrained facial poses
with large rotations and interfered by heavy occlusions.
5.4.2 Online Identity Adaptation
Together with the rigid facial pose tracking, the face model is progressively updated
to adapt to the user’s identity. Because the identity is not known in advance when
96 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
Figure 5.8: Examples of face model adaptation. The proposed method can success-fully personalize the face model to identities with different gender and races.
a new user is being captured, we begin with a generic face model M with the initial
identity and expression priors. The identity is then gradually personalized. In this
work, local shape variations caused by the expressions are effectively removed from the
face model generation, and the estimated identity is robust to the local distortions from
the expressions.
Variational Approximation
As depicted in section 5.3.2, the face model for one particular user is identified by
an unique identity distribution p�(wid) = N (wid|μ�id,Σ
�id), from which the other pa-
rameters in the face model can be derived. However, the exact identity distribu-
tion p�(wid) is not known if no adequate depth samples are available, thus the face
identity adaptation is performed through a sequential update algorithm like assumed-
density filtering (ADF) [64]. It approximates the Gaussian distribution p(t)(wid) from
the posterior induced by the current likelihood pL(y(t)|wid;θ(t)) and the previous
best estimate p(t−1)(wid). While provided with sufficient depth frames T , we have
p�(wid) � p(T )(wid).
We need a well defined likelihood pL(y(t)|wid;θ(t)) that both models the distances
from the face model points to the surface of P if the points are visible, and handles the
§ 5.4.2. Online Identity Adaptation 97
occlusions if the points are occluded,
pL(y(t)|wid;θ(t)) =
∑γ
NM∏n=1
pQ→P(y(t)n |wid;θ(t))γnUO(y(t)n )1−γnp(γ), (5.13)
where the p(t)(γ) =∏NM
n=1(π(t)n )γn(1− π
(t)n )1−γn is binomial with respect to each model
point. In contrast to the rigid pose estimation, the labels are not discriminatingly
given but generated with a prior distribution, enabling a soft label assignment about
whether a face model point is occluded. The projection distribution pQ→P(y(t)n |wid;θ
(t))
is similar as the form of pQ→P(y(t)n ;θ(t)) but with the mean value
mn = Δ(T(θ(t)) ◦ (fn +Pidwid);pn
)(5.14)
and covariance matrix by ξ2 = σ2o+e2α
(t)n(t)�n Σ
(t−1)E,[n] n
(t)n . To eliminate the quantization
errors in the input depth image, we introduce a robust modification for the projection
distance Δ(qn;pn) = sign(Δ(qn;pn))max{|Δ(qn;pn)| − ε, 0}.The identity distribution p(t)(wid) = N (wid|μ(t)
id ,Σ(t)id ) is estimated by minimiz-
ing the Kullback-Leibler divergence DKL[p(t)(wid)||p(wid|y(t))] [64], in other word, we
expect the true posterior
p(wid|y(t)) =pL(y(t)|wid;θ
(t))p(t−1)(wid)
p(y(t))� p(t)(wid). (5.15)
The parameters of p(t)(wid) is estimated through the variational Bayes framework [64].
We empirically find that this process will be convergent within 3 ∼ 5 iterations.
To fast capture the identity of a new user when the face model has been personal-
ized, we add a relaxation to the covariance matrix of p(t)(wid) as Σ(t)id ← (λ + 1)Σ
(t)id
immediately after the identity adaptation. This process is analogous to adding more
variances to μ(t)id from the identity space Σ
(t)id , thus it will neither lose the ability to
describe a new face that is different from the current face model, nor fail to preserve
the structure of the estimated identity space. The hyperparameter λ is empirically set
to 0.25.
Online Adaptation
The identity of the face model is adapted online through a two-step procedure for each
frame.
98 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
(a)
(b)
…
…
…
…
(c)
Figure 5.9: We continuously adapt the identities of the face model to different users.(a)-(c) are two examples showing that the face model can be gradually personalizedwhen the facial depth data from different poses are captured during the tracking process.The face model is initialized with the generic face model as shown in Figure 5.2.
At first, given the previous identity distribution p(t−1)(wid), we generate the dis-
tribution p(t−1)M (f) of the face model M(t−1) via the Equation (5.3). According to the
ray visibility constraint based on the previous face model p(t−1)M (f) and current surface
§ 5.5. Experiments and Discussions 99
model of P(t), we obtain the ray visibility score S(Q(t−1),P(t);θ). After the iterative
optimization of this score, the current rigid facial pose θ(t) is achieved.
Secondly, the face model is updated through the variational approximation given
the optimal rigid pose θ(t). In particular, π(t) encourages a soft assignment set between
the face model points and the input point cloud, and the robust projection function
reduces the quantization errors. In the end, the identity of the face model is updated
to p(t)(wid) = N (wid|μ(t)id ,Σ
(t)id ). We can further estimate the remaining parameters
and gather them together with the identity parameters as the face model parameter set
θ(t)F = {μ(t)
M,Σ(t)M,μ
(t)id ,Σ
(t)id ,Σ
(t)E ,Σ
(t)I }. These parameters help to generate the updated
face model distribution p(t)M(f) and facilitate the rigid facial pose estimation and identity
adaptation in the next frame.
5.5 Experiments and Discussions
In this section, we present the experiments on public depth-based facial pose datasets
and real scenarios to demonstrate the effectiveness of our robust 3D facial pose tracking
algorithm with the generative face model.
Section 5.5.1 introduces the datasets we employed for the evaluation and compar-
ison, and the system setup of the proposed method. We then quantitatively evaluate
the proposed method in comparison with the state-of-the-art algorithms on these pub-
lic datasets, and qualitatively visualize the performances of facial pose tracking and
identity adaptation in Section 5.5.2. At the end of this section, we list some limitations
in Section 5.5.3.
5.5.1 Datasets And System Setup
Datasets
We evaluate the performance of the proposed method and compare it with the state-of-
the-art algorithms on three public datasets, i.e., the Biwi Kinect head pose dataset [49]
and ICT 3D head pose (ICT-3DHP) dataset [94]. The dataset information is summa-
rized in Table 5.1.
Biwi Dataset: Biwi dataset contains over 15K RGB-D images of 20 subjects
(different genders and races) in 24 sequences, with large ranges in rotations and trans-
lations. The recorded faces suffer the occlusions from hair and face shape variations
100 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
Nseq Nfrm Nsubj occlusions ωmax
BIWI [49] 24 ∼15K 25accessories ±75◦ yaw
hair ±60◦ pitch
ICT-3DHP [94] 10 ∼14K 10accessories ±75◦ yaw
hair ±45◦ pitch
Table 5.1: Facial Pose Datasets Summarization
Figure 5.10: Tracking results on the Biwi dataset with the personalized face models.Our system is robust to profiled faces due to large rotations and occlusions from hairand accessories. The 1st and 2nd rows show the corresponded color and depth imagepairs. The third row visualizes the extracted point clouds of the head regions and theoverlaid personalized face models. Best viewed in electronic version.
from expressions. The Biwi dataset provides the ground-truth head pose parameters
for each frame by an off-the-shelf software Faceshift2, and the pixel-wise binary masks
for detected face regions.
ICT-3DHP Dataset: 10 Kinect RGB-D sequences including 6 males and 4 fe-
males are provided by the ICT-3DHP dataset. The data contain similar occlusions and
distortions like Biwi dataset. Each subjects in this dataset also has arbitrary expres-
sion variations. The ground truth rotation parameters were externally measured by
the Polhemus Fastrack flock of birds tracker [94] attached to a cap on each subject, but
the translation parameters were not reliable.
2http://www.faceshift.com/
§ 5.5.1. Datasets And System Setup 101
Figure 5.11: Tracking results on the ICT-3DHP dataset. The proposed system is alsoeffective to the expression variations. Best viewed in electronic version.
System Setup
We implemented the proposed 3D facial pose tracking algorithm on a MATLAB platform
equipped with the parallel computing toolbox. The results were measured on a 3.4
GHz Intel Core i7 processor with 16GB RAM. No GPU acceleration was applied.
Here we define the hyperparameters utilized in the proposed system. The dimen-
sion of the face model is NM = 11510, Nid = 150, Nexp = 47, while the truncated
face model has smaller dimensions of identity and expression Nid = 28, Nexp = 7.
The generic face model owns the identity and expression priors N (wid|U�id1,
1150I) and
N (wexp|U�exp1,
147I). The noise variance along the surface of the input point cloud is
σ2o = 25, while the outlier distribution is characterized by UO(y) = UO = 1
2500 . Note
that the measurement unit used in this work is millimeter (mm).
The proposed algorithm online adapts the identity for a period of frames and it
either stops until a pre-defined number of iterations is reached (in this work, we choose
50 frames) or the evolution of the adapted face model is converged. The online face
adaptation is performed every 10 frames so as not only to capture different facial parts
but also reduce the redundancy because of the subtle difference of the visible face
coverages between adjacent frames.
102 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
5.5.2 Quantitative and Qualitative Evaluations
Table 5.2 shows the average absolute errors for the rotations angles and the average
Euclidean errors for the translation by the proposed method and the reference methods
on the Biwi dataset. The rotational errors were further quantified through the average
absolute errors with respect to the yaw, pitch and roll angles, respectively. Similarly
in Table 5.3, we evaluated the average absolute angle errors for yaw, pitch and roll on
the ICT-3DHP dataset.
The proposed method
Observing the tracking performances between the generic and personalized face model,
the latter one has a better result both with the rotation and the translation metrics.
By allowing a gradually adapted face model to fit to each subject, the personalized
distributions for the shape and the expression enable the face model to fit compactly
with the input point cloud and makes the estimated facial pose robust to changes in
the personalized expressions. Figure 5.10 and Figure 5.11 demonstrate some successful
tracking poses on Biwi and ICT-3DHP datasets based on the personalized face models.
The performance based on the generic face model also revealed its superiority over
challenging cases like occlusions and expression variations, as shown in Figure 5.6.
As with the rigid pose tracking, the proposed ray visibility constraint, as shown in
Figure 5.10 and 5.11, Figure 5.6 and 5.7, efficiently infers the occlusions caused by hairs,
accessories and hands, as well as the self-occlusions like profiled faces. In contrast, if we
apply the point-to-plane ICP [9], it cannot always distinguish the occlusions and the
face model since it is not constrained by the visibility cue. In addition, the proposed
ray visibility score inherently suggests that the more visible vertices in the face model,
the lower S(Q,P;θ) will be. Optimally, the number of visible facial points should be
maximized. Similar observations have been explored and proven helpful to increase the
pose tracking accuracy in the references [8; 47]. S(Q,P;θ) ensure a optimal coverage
between the distributions of the warped face model and the surface of the input point
cloud, thus owns a more robust estimation than solutions based on point estimate, such
as MAP or ML estimations based on point-to-plane ICP or ray visibility constraint,
and etc.
The online identity adaptation can be progressively adapted to the test subject.
§ 5.5.2. Quantitative and Qualitative Evaluations 103
Figure 5.12: The proposed system can automatically adapt a face model from oneidentity to another. Top: Three identities are presented successively in adjacent threeframes. Bottom: The tracking face models that are adaptive to the current identity.Please note the differences of head and nose shapes among the visualized face models.
Based on the property of the assumed density filters applied in the online identity
adaptation, the speed of convergence depends on the portion of visible face model points
revealed in each frame. For example, a subject with no occlusions will usually has a
faster speed of convergence than a heavy occluded subjects. Moreover, the covariance
Σ(t)id will not goes to infinity with infinite frames of one subject coming into this system.
However, the shape of the identity distribution described by Σ(t)id is preserved. This
property offers a promising ability that the online identity adaptation can be switched
from one subject to a new one with a smooth facial identity transfer, as visualized in
Figure 5.12.
Comparison with the state-of-the-arts
A number of prior arts [8; 48; 49; 94; 102; 108; 112] for depth-based 3D facial pose
tracking have also been evaluated on the Biwi [49] and ICT-3DHP [94] datasets (listed
in Table 5.2 and Table 5.3), as the references for the performance evaluation of the
proposed method. The results by the reference methods are reported from their authors.
On the Biwi dataset, the proposed method produced the lowest errors for rota-
tion among the depth-based head pose tracking algorithms, such as the discriminative
methods like the random forests [49], the generative model-fitting methods like CLM-
Z [94], Martin et. al. [112] and Meyer et. al. [8], as well as the feature-based methods
104 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
MethodErrors
Yaw (◦) Pitch (◦) Roll (◦) Translation (mm)
ours 2.3 2.0 1.9 6.9
RF [49] 8.9 8.5 7.9 14.0Martin [112] 3.6 2.5 2.6 5.8CLM-Z [94] 14.8 12.0 23.3 16.7TSP [48] 3.9 3.0 2.5 8.4PSO [108] 11.1 6.6 6.7 13.8Meyer [8] 2.1 2.1 2.4 5.9Li� [102] 2.2 1.7 3.2 −
Table 5.2: Evaluations on Biwi dataset
MethodErrors
Yaw (◦) Pitch (◦) Roll (◦)ours 3.4 3.2 3.3
RF [49] 7.2 9.4 7.5CLM-Z [94] 6.9 7.1 10.5Li� [102] 3.3 3.1 2.9
Table 5.3: Evaluations on ICT-3DHP dataset
like triangular surface patch [48]. Despite the missing of appearance information intro-
duces uncertainties with respect to the estimated facial pose, the proposed approach
performs comparable with current state-of-the-art method [102] (marked with � in Ta-
ble 5.2 and 5.3) that employed the RGB-D data. Similar conclusion can also be drawn
on the ICT-3DHP dataset, where the proposed method also presents a superior per-
formance on estimating the rotational parameters in comparison with the depth-based
approaches like the random forests [49] and CLM-Z [94]. Its performance is similar as
Li [102] even though no color information is provided.
As with the translational parameters, the proposed method also presents the state-
of-the-art performance in comparison to the depth-based approaches on the Biwi dataset3.
The sight degradation against Meyer et. al. [8] on the translation parameters may be
because of the incompatibility of model centers between the groundtruth face model in
Biwi dataset and the proposed multilinear face model.
3No reliable groundtruth translation parameters are available for ICT-3DHP datasets [94].
§ 5.5.3. Limitations 105
5.5.3 Limitations
The proposed system is inevitably vulnerable when the input depth video is contam-
inated by heavy noise, outliers and quantization errors. For example, a Kinect depth
video capturing a long-distance user may tremendously quantize his/her face structure
so as not to ensure a stable facial pose estimation. On the other hand, effective clues
like facial landmarks are inaccessible due to the color information is not available, thus
hard facial poses receiving less confidence from the ray visibility constraint may still be
unreliable. However, this kind of unreliability can be relieved by constraining the tem-
poral coherency of facial poses among adjacent frames, like Kalman filtering or other
temporal smoothness techniques.
5.6 Summary
We introduce a robust facial pose tracking method for commodity depth sensors that
brings about the state-of-the-art performances on two popular facial pose datasets. The
proposed generative face model and the ray visibility score ensure a robust tracking that
is effective to handle heavy occlusions, profiled faces due to large rotation angles, and
expression variations. The generative model is adaptive to the identities with different
ages, races and genders. Its uncertainties of the identity and expression ambiguities
enable a groupwise optimization of the facial poses that is optimal for all identities
and expressions encoded in the face model. Its separation of identity and expression
parameters also avoid interference of the expression variations for the face model per-
sonalization. The ray visibility constraint focuses on the visibility of face model points
but not the explicit correspondences, and its information-theoretic ray visibility score
offers a more robust treatment for the facial pose estimation.
A number of future directions are beneficial for a more stable and accurate facial
pose tracking system. Effective temporal coherency deserves more attention since it
provides smoother tracking trajectories and predicts reliable future facial poses given
the previous motion patterns. The scene flow problem might be another interesting
direction as it will provide subtle per-point motion variations both from global rigid
pose and local expression variations, introducing new constraints for the estimation
of facial pose and expression recognition. Moreover, developing a more robust depth-
based features are helpful as it will give semantic correspondences between the face
106 CHAP. 5. A GENERATIVE MODEL FOR ROBUST 3D FACIAL POSE TRACKING
model and the measured face data.
Chapter 6
Conclusions and Future Work
This thesis mainly presents the spatio-temporal RGB-D video enhancement and ap-
plications about image/video processing and computer vision based on the RGB-D
videos. In particular, with the probabilistic generative models, this thesis attempts
to solve three problems: (1) spatial enhancement for eliminating the noise, outliers
and depth-missing holes in a raw depth image; (2) temporal enhancement for adaptive
long-range temporal consistency to the content of the RGB-D video; and (3) robust 3D
facial pose tracking with online face model personalization under uncontrolled scenarios
and heavy occlusions. In this chapter, we will conclude our work in Section 6.1. The
future work will be discussed in Section 6.2.
6.1 Conclusions
This thesis first demonstrates a new guided depth image enhancement approach, which
is a hybrid strategy merging the filtering-based depth interpolation and segment-based
parametric structure propagation. Thanks to a novel arbitrary-shape and texture-
constrained patch matching method for a robust structure inference, the segments in
the depth holes can be reliably aligned with parametric structures with similar texture
and/or depth statistics. Experiments reveals that the proposed method outperforms
the reference methods with respect to depth hole filling and surface smoothing problem.
In the second place, this thesis proposes a novel weighted structure filters based
on parametric structural decomposition. In detail, a novel distribution construction
method is demonstrated for accelerating the weighted median/mode filters by a sepa-
rable kernel based on the probabilistic generative models that adaptive to the structure
of the input image. Different from the traditional brute-force methods with hundreds of
filtering operations for a sufficiently accurate performance, the proposed approach only
107
108 CHAP. 6. CONCLUSIONS AND FUTURE WORK
requires a very small amount of filtering operations based on the structure of the input
image. The accelerated weighted median filter and weighted mode filter are effective
for various applications from depth map enhancement, joint depth upsampling, detail
enhancement and so on.
This thesis also presents a novel method for robust temporally consistent depth en-
hancement by introducing a probabilistic intermediate static structure. The dynamic
region of the input depth video is enhanced spatially while the static region is sub-
stituted by the updated static structure so as to favor a long-range spatio-temporal
enhancement. Quantitative evaluation shows the efficiency and robustness of the pa-
rameters estimation on the static structure and illustrates a superior performance in
comparison to various static scene estimation approaches. Qualitative evaluation re-
veals that the proposed method operates well on various indoor scenes and different
depth cameras, and proves that the proposed temporally consistent depth video en-
hancement works satisfactory in comparison with existing methods.
At last, this thesis introduces a robust facial pose tracking system with an adaptive
face model personalization, which is specific for commodity depth sensors and brings
about the state-of-the-art performances on two popular facial pose datasets. The pro-
posed generative face model and the visibility-constrained information-theoretic rigid
pose estimation techniques enable a more efficient and effective facial pose tracking
method than the prior arts. Qualitative and quantitative results faithfully demon-
strate that the proposed method can effectively handle unconstrained facial tracking
cases like heavy occlusions, profiled faces with strong rotation angles, and expression
changes during the tracking procedure. Moreover, the proposed probabilistic multi-
linear face model possesses sufficient descriptive power for a plenty of identities from
different ages, races and genders with varying expressions.
6.2 Future Work
While we have listed potential future work for each problem at the end of their corre-
sponding chapter, we highlight several other suggestions here.
In addition to the parametric mixture model for the weighted local distribution
approximation, we can consider a non-parametric representation for describing the local
image statistics. This weighted structural prior centered at a pixel x can be analogous
§ 6.2. Future Work 109
to non-parametric kernel density estimator, but is augmented with fully-connected
pixel-wise relationship (which is similar as the fully convolutional conditional random
field as discussed in Section 4.3.4) between any pair of pixels {x,y}. Combining it with
suitable data likelihood, the underlying structure map can be discovered by maximizing
a posteriori (MAP) through efficient variational mean field approximation. This prior
guarantees the extracted structure map is piece-wise smooth everywhere within a same
piece of image structure but distinct among the image discontinuities. It also responses
to the observations and assumptions as discussed in Section 3.3.
Appendix A
Approximation for the Gaussian Kernel
Given a set of manifolds that within the domain of f as {ηk ∈ Rd|k = 1, 2, . . . ,K}, the
weighted distribution constructed by the Gaussian kernel can be further derived as
h(x, f) =1
Z(x)
∑y∈Ωx
w(x,y)φ(fy − f ;ΣF ) (A.1)
∝∑y∈Ωx
w(x,y)
∫ηx∈Rd
φ(fy − ηx;ΣF −Σx)φ(ηx − f ;Σx)dηx (A.2)
�∑y∈Ωx
w(x,y) · sK ·K∑k=1
φ(fy − ηky;Σky)φ(f − ηkx;Σ
kx) (A.3)
=∑y∈Ωx
w(x,y) · sK ·K∑k=1
px(f |k)py(fy|l). (A.4)
Note that the Gaussian-Hermite quadrature rule is applied in this derivation. The ap-
proximation is valid when the local manifolds ηk is sufficient smooth and the summation
of the variances at pixel x and y should be around ΣF . For a detailed interpretation,
please refer to [31].
110
Appendix B
Generative Model for Static Structure
B.1 Probabilistic Generative Mixture Model
The proposed static structure is modeled by a probabilistic generative mixture model.
Three states are introduced to describe the cases that the input depth samples may
occupy, each of which has a distribution as
• State-I: Fitting the static structure p(dx|Zx,mIx = 1) = N
(dx|Zx, ξ
2x
);
• State-F: Forward Outliers p(dx|Zx,mFx = 1) = Uf (d
tx|Zx) = Uf · 1[dtx<Zx];
• State-B: Backward Outliers p(dx|Zx,mBx = 1) = Ub(d
tx|Zx) = Ub · 1[dtx>Zx].
For the purpose to combine all the three states into a united model and describe
the overall likelihood that the input depth samples fit the current static structure, we
use a mixture model similar to the Gaussian Mixture Model [64]. Together with prior
distributions of the hidden variable mx and the static structure Zx, we can further
estimate the posterior with respect to Zx to infer the most possible static structure
given the input depth samples, and the posterior with respect to mx to indicate the
states that the input depth samples belong to.
B.1.1 Likelihood
The likelihood of the input depth sample dtx with respect to the depth value of the
static structure Zx and the hidden variable for the state indicator mx is
p(dtx|mx, Zx) = N (dtx|Zx, ξ2x)
mIx · Uf
(dtx|Zx
)mFx · Ub(d
tx|Zx)
mBx , (B.1)
which switches among these states by setting one specific mkx = 1, k ∈ Ψ = {I, F,B}
and the rest as 0s.
111
112 CHAP. B. GENERATIVE MODEL FOR STATIC STRUCTURE
B.1.2 Prior Distributions
Given the likelihood as well as suitable prior distributions, we will have a tractable
joint distribution. Thus the choices of the priors are essential to ensure tractable and
efficient estimation of the joint distribution as well as the posteriors.
To be compatible with the likelihood in Sec. B.1.1, we also introduce a Gaussian
distribution for Zx as
p(Zx) = N (Zx|μx, σ2x). (B.2)
The prior for mx needs to cope with the switching property that mx offers. Thus we
employ the categorical distribution since it outputs a probability ωkx when a state mk
x
is activated
p(mx|ωx) = Cat(mx|ωx) =∏k∈Ψ
(ωkx
)mkx, given
∑k∈Ψ
ωkx = 1. (B.3)
This distribution otherwise introduce an additional parameter ωx, which also needs an
explicit distribution [64]. We apply the Dirichlet distribution as
p(ωx) = Dir(ωx|αI
x, αFx , α
Bx
)= Dir (ωx|αx) , given αk
x ≥ 0, k ∈ Ψ. (B.4)
The reason to introduce p(ωx) is that we want to model the chance that one state
may happen so that we can judge the reliability of the estimated static structure.
Furthermore, given a prior distribution for ωx, we can further estimate the posterior
with respect to ωx when a series of data come into the model.
B.1.3 Joint Distribution
Given the input depth sample dtx, the joint distribution can be written as
p(dtx, Zx,mx,ωx;Px) = p(dtx|Zx,mx; ξx)p(Zx;μx, σx)p(mx|ωx)p(ωx;αx), (B.5)
in which the parameter set is Px = {ξx, μx, σx,αx}. By marginalizing the hidden vari-
able, we can have a joint distribution that only contains two variables: the depth value
Zx and the chance of each state ωx, as well as the observation dtx and the parameters
§ B.1.4. Data Evidence 113
Px. It will result in a distribution as
p(dtx, Zx,ωx;Px) = p(Zx;μx, σx)p(ωx;αx)
×[ωIxN (dtx|Zx, ξ
2x) + ωF
x Uf
(dtx|Zx
)+ ωB
x Ub
(dtx|Zx
)], (B.6)
which is a weighted combination of the three state densities multiplied with the prior
distributions of Zx and ωx.
B.1.4 Data Evidence
The data confidence p(dtx;Px) is simply calculated by marginalizing the variables Zx
and ωx a step further as
p(dtx;Px) =
∫Zx
∫ωx
p(dtx, Zx,ωx;Px)dZxdωx
=1∑
k∈Ψ αkx
{αIxN (dtx|μx, ξ
2x + σ2
x) +(αBxUb − αF
xUf
)Φ
(dtx − μx
σx
)+ αF
xUf
}.
(B.7)
B.1.5 Posteriors with First-order Markov Chain
In this paper, we want to estimate the posterior under an online fashion, it means
the posterior is estimated frame by frame, with new data to sequentially increase the
confidence of the static structure.
p(Zx,ωx|Dtx;Px) =
1
p(dtx|Dt−1x ;Px)
p(dtx|Zx,ωx;Px)p(Zx,ωx|Dt−1x ;Px) (B.8)
The posterior with respect to the hidden variable mx indicates the distributions of
the states that the input depth sample may occupy, it is similar as equation (B.8).
p(mx|Dtx;Px) =
1
p(dtx|Dt−1x ;Px)
×∫Zx
∫ωx
p(dtx|mx, Zx;Px)p(mx|ωx)p(Zx,ωx|Dt−1
x ;Px
)dZxdωx (B.9)
The posteriors seem complex and are not easy to estimate, thus we employ the vari-
ational approximation so that the posterior p(Zx,ωx|Dtx;Px) can be factorized into the
product of an independent Gaussian distribution qt(Zx) and an independent Dirichlet
distribution qt(ωx) with suitable parameters. The posterior p(mx|Dt−1x ;Px) can also be
114 CHAP. B. GENERATIVE MODEL FOR STATIC STRUCTURE
rewritten by substituting p(Zx,ωx|Dtx;Px) with the approximated posterior qt(Zx,ωx).
B.2 Derivations of the Results in Variational Approximation
In this section, we show the detailed derivations of the results present in Section 4.3.2.
For brevity, we omit the related superscripts and subscripts of parameters and variables
from {dtx, Zx,ωx} to {d, Z,ω}, and from {μt−1x , σt−1
x , μtx, σ
tx, ξx} to {μ, σ, μnew, σnew, ξ}.
Moreover,{αI,t−1x , αF,t−1
x , αB,t−1x , αI,t
x , αF,tx , αB,t
x ,∑
k∈Ψ αk,t−1x ,
∑k∈Ψ αk,t
x
}is transferred
to {α1, α2, α3, αnew1 , αnew
2 , αnew3 , α0, α
new0 } for brevity.
B.2.1 Approximated Joint Distributions
Approximated Joint Distributions Q(Z,ω, d)
Incorporating the properties of Gaussian and Dirichlet distribution, the approximated
joint distribution is a mixture of products of Gaussian and Dirichlet distributions.
Q(Z,ω, d) = p(d|Z,ω)qt−1 (Z,ω) (B.10)
=[ωIN (d|Z, ξ2) + ωBUb(d|Z) + ωFUf (d|Z)
]N (Z|μ, σ2)Dir(ω|α1, α2, α3)
=α1
α0N
(d|μ, ξ2 + σ2
)N
(Z∣∣∣ξ2μ+ σ2d
ξ2 + σ2,
ξ2σ2
ξ2 + σ2
)Dir (ω|α1 + 1, α2, α3)
+α2
α0Uf (d|Z)N
(Z|μ, σ2
)Dir (ω|α1, α2 + 1, α3)
+α3
α0Ub(d|Z)N
(Z|μ, σ2
)Dir (ω|α1, α2, α3 + 1) . (B.11)
Approximated Joint Distributions Q(Z, d) and Q(ω, d)
It is easy to calculate the moments related to Z and ω respectively by estimating the
moments of the approximated posterior Q(Z|d) and Q(ω|d). Specifically, we need to
calculate the joint distribution with respect to Z and d as
Q(Z, d) =
∫ωQ(Z,ω, d)dω (B.12)
=α1
α0N
(d|μ, ξ2 + σ2
)N
(Z∣∣∣ξ2μ+ σ2d
ξ2 + σ2,
ξ2σ2
ξ2 + σ2
)
+α2
α0Uf (d|Z)N
(Z|μ, σ2
)+
α3
α0Ub(d|Z)N
(Z|μ, σ2
), (B.13)
§ B.2.2. Approximated Data Evidence For The Observation 115
and the joint distribution with respect to ω and d as
Q(ω, d) =
∫ωQ(Z,ω, d)dZ
=α1
α0N
(d|μ, ξ2 + σ2
)Dir (ω|α1 + 1, α2, α3)
+α3
α0UbΦ
(d− μ
σ
)Dir (ω|α1, α2, α3 + 1)
+α2
α0Uf
(1− Φ
(d− μ
σ
))Dir (ω|α1, α2 + 1, α3) . (B.14)
B.2.2 Approximated Data Evidence For The Observation
Similarly, the approximated data evidence is presented below:
qt(d) =
∫Z
∫ωQ(Z,ω, d)dZdω (B.15)
=α1
α0N
(d|μ, ξ2 + σ2
)+
α3
α0UbΦ
(d− μ
σ
)+
α2
α0Uf
(1− Φ
(d− μ
σ
)), (B.16)
which is also analytic as long as the parameters are known. The posteriors Q(Z|d)and Q(ω|d) are calculated accordingly as dividing the joint distribution Q(Z, d) and
Q(ω, d) by the data evidence qt(d).
B.2.3 Parameter Updating for the Approximated Static Structure
The parameter estimation for qt(Z) is to match first and second moments between
qt(Z) and Q(Z|d). The first moment between qt(Z) and Q(Z|d) is
μnew = EQ(Z|d)[Z] (B.17)
=1
qt(d)α0
{α1N
(d|μ, ξ2 + σ2
) ξ2μ+ σ2d
ξ2 + σ2
+ α2Uf
[μ
(1− Φ
(d− μ
σ
))+ σ2N
(d|μ, σ2
)]
+ α3Ub
[μΦ
(d− μ
σ
)− σ2N
(d|μ, σ2
)]}, (B.18)
which can be further written as
μnew =1
qt(d)α0
{α1N
(d|μ, ξ2 + σ2
) ξ2μ+ σ2d
ξ2 + σ2+ α2Ufμ
+ (α3Ub − α2Uf )
[μΦ
(d− μ
σ
)− σ2N
(d|μ, σ2
)]}.
116 CHAP. B. GENERATIVE MODEL FOR STATIC STRUCTURE
The second moment is under a similar fashion as
μ2new + σ2
new = EQ(Z|d)[Z2] (B.19)
=1
qt(d)α0
{α1N
(d|μ, ξ2 + σ2
) [(ξ2μ+ σ2d
ξ2 + σ2
)2
+ξ2σ2
ξ2 + σ2
]
+ (α3Ub − α2Uf )
[(μ2 + σ2)Φ
(d− μ
σ
)− (d+ μ)σ2N
(d|μ, σ2
)]}
+ α2Uf
(μ2 + σ2
). (B.20)
B.2.4 Parameter Updating for the Approximated State Frequencies
The parameters αnewk , k ∈ {1, 2, 3} are calculated by introducing new variables mi and
m(2)i , i = 1, 2, 3, which defines the first moments and the second moments with respect
to ω for qt(ω) [84]. The first moments are calculated according to the property of the
Dirichlet distribution as
m1 =αnew1
αnew0
= EQ(ω1|d)[ω1] =α1
α0qt(d)N
(d|μ, ξ2 + σ2
) α1 + 1
α0 + 1+
α2
α0qt(d)Uf
(1− Φ
(d− μ
σ
))α1
α0 + 1+
α3
α0qt(d)UbΦ
(d− μ
σ
)α1
α0 + 1, (B.21)
m2 =αnew2
αnew0
= EQ(ω2|d)[ω2] =α1
α0qt(d)N
(d|μ, ξ2 + σ2
) α2
α0 + 1+
α2
α0qt(d)Uf
(1− Φ
(d− μ
σ
))α2 + 1
α0 + 1+
α3
α0qt(d)UbΦ
(d− μ
σ
)α2
α0 + 1, (B.22)
m3 =αnew3
αnew0
= EQ(ω3|d)[ω3] =α1
α0qt(d)N
(d|μ, ξ2 + σ2
) α3
α0 + 1+
α2
α0qt(d)Uf
(1− Φ
(d− μ
σ
))α3
α0 + 1+
α3
α0qt(d)UbΦ
(d− μ
σ
)α3 + 1
α0 + 1. (B.23)
§ B.2.5. Approximated Posterior for the State Frequencies 117
The second moments are calculated as follows:
m(2)1 = EQ(ω1|d)[(ω1)2] =
αnew1 (αnew
1 + 1)
αnew0 (αnew
0 + 1)(B.24)
=α1
α0qt(d)N
(d|μ, ξ2 + σ2
) (α1 + 1)(α1 + 2)
(α0 + 1)(α0 + 2)
+α2
α0qt(d)Uf
(1− Φ
(d− μ
σ
))α1(α1 + 1)
(α0 + 1)(α0 + 2)
+α3
α0qt(d)UbΦ
(d− μ
σ
)α1(α1 + 1)
(α0 + 1)(α0 + 2), (B.25)
m(2)2 = EQ(ω2|d)[(ω2)2] =
αnew2 (αnew
2 + 1)
αnew0 (αnew
0 + 1)(B.26)
=α1
α0qt(d)N
(d|μ, ξ2 + σ2
) α2(α2 + 1)
(α0 + 1)(α0 + 2)
+α2
α0qt(d)Uf
(1− Φ
(d− μ
σ
))(α2 + 1)(α2 + 2)
(α0 + 1)(α0 + 2)
+α3
α0qt(d)UbΦ
(d− μ
σ
)α2(α2 + 1)
(α0 + 1)(α0 + 2), (B.27)
m(2)3 = EQ(ω3|d)[(ω3)2] =
αnew3 (αnew
3 + 1)
αnew0 (αnew
0 + 1)(B.28)
=α1
α0qt(d)N
(d|μ, ξ2 + σ2
) α3(α3 + 1)
(α0 + 1)(α0 + 2)
+α2
α0qt(d)Uf
(1− Φ
(d− μ
σ
))α3(α3 + 1)
(α0 + 1)(α0 + 2)
+α3
α0qt(d)UbΦ
(d− μ
σ
)(α3 + 1)(α3 + 2)
(α0 + 1)(α0 + 2). (B.29)
The parameters are thus estimated with the help of the introduced variables as
αnew0 =
∑3i=1mi −m
(2)i∑3
i=1m(2)i −m2
i
, αnewi = αnew
0 mi, i = 1, 2, 3. (B.30)
B.2.5 Approximated Posterior for the State Frequencies
Similarly, the approximated posterior with respect to each state is
• State-I: Fitting the static structure
qt(mx = I|dtx) =αIxN (dtx|μt−1
x , ξ2x + (σt−1x )2)(
qt(dtx)∑
k∈Ψ αkx
) ; (B.31)
• State-F: Forward Outliers
qt(mx = F |dtx) =αFxUf (1− Φ((dtx − μt−1
x )/σt−1x ))(
qt(dtx)∑
k∈Ψ αkx
) ; (B.32)
118 CHAP. B. GENERATIVE MODEL FOR STATIC STRUCTURE
• State-B: Backward Outliers
qt(mx = B|dtx) =αBxUbΦ((d
tx − μt−1
x )/σt−1x )(
qt(dtx)∑
k∈Ψ αkx
) . (B.33)
Appendix C
The Choice of Depth Noise Standard Deviation
C.1 Depth Map from Stereo or Kinect
Since depth map obtained by Stereo or Kinect is actually estimated via the disparity
estimation technique, in which case the conversion between depth and disparity is
ddisp
B=
f
d=⇒ d =
fB
ddisp. (C.1)
ddisp is the disparity and d is the depth. f is the focal length of the camera, B is the
baseline between stereo sensors.
The noise and outliers in the depth map are originated at the errors in the disparity
map. Assume the Gaussian noise and uniform outliers in the disparity map, we try to
find their characteristics in the corresponding depth map. Define a universal Gaussian
noise standard deviation σdispn in the disparity map, it results in a noise disparity value
ddispn from a mean μdispn . Converting the noisy disparity value into the depth, we have
dn =fB
ddispn
(C.2)
=fB
μdispn + (ddispn − μdisp
n )(C.3)
=fB
μdispn
1
1 + (ddispn − μdispn )/μdisp
n
≈ fB
μdispn
(1 +
μdispn − ddispn
μdispn
)
= 2μn − μnddispn
μdispn
. (C.4)
Here the mean μn = fB/μdispn . It needs one constraint |μdisp
n −ddispn | < μdispn , which can
be satisfied in a general setting. Thus the mean value for dn is E[dn] = 2μn−μnE[ddispn ]
μdispn
=
119
120 CHAP. C. THE CHOICE OF DEPTH NOISE STANDARD DEVIATION
μn, its variance is
σ2n = E[(dn − μn)
2] (C.5)
=μ2n
(μdispn )2
E([μdispn − ddispn )2] (C.6)
=
(μn
μdispn
)2 (σdispn
)2(C.7)
=μ4n
(fB)2
(σdispn
)2=⇒ σn = σdisp
n
μ2n
fB. (C.8)
The outliers in the depth map is still modeled by uniform distribution.
Therefore, to better model the static structure estimation, we set the depth noise
standard deviation ξx ∝ (dtx)2/fB, which is a function of the depth sample dtx. Samples
with larger depth values will require larger standard deviations to fit its noise.
C.2 Depth Map from Other Sources
For depth map obtained by other sources, the noise standard deviation ξx = σ is a
constant over the image domain. If the property of the systematic error for a depth
sensor is available, the standard deviation ξx can be modeled more specifically.
Bibliography
[1] C. Richardt, C. Stoll, N. A. Dodgson, H.-P. Seidel, and C. Theobalt, “Coherent spatio-temporal filtering, upsampling and rendering of RGBZ videos,” Computer Graphics Fo-rum (Proceedings of Eurographics), vol. 31, no. 2, May 2012.
[2] L. Wang, H. Jin, R. Yang, and M. Gong, “Stereoscopic inpainting: Joint color and depthcompletion from stereo images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.IEEE, 2008, pp. 1–8.
[3] M. Kass and J. Solomon, “Smoothed local histogram filters,” ACM Trans. Graph., vol. 29,no. 4, p. 100, 2010.
[4] Z. Ma, K. He, Y. Wei, J. Sun, and E. Wu, “Constant time weighted median filtering forstereo matching and beyond,” in Proc. IEEE Int. Conf. Comput. Vis., 2013.
[5] D. Min, J. Lu, and M. Do, “Depth video enhancement based on weighted mode filtering,”vol. 21, no. 3, pp. 1176–1190, March 2012.
[6] M. Lang, O. Wang, T. Aydin, A. Smolic, and M. Gross, “Practical temporal consistencyfor image-based graphics applications,” ACM Trans. Graph., vol. 31, no. 4, pp. 34:1–34:8,Jul. 2012.
[7] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3D facial expressiondatabase for visual computing,” vol. 20, no. 3, pp. 413–425, 2014.
[8] G. P. Meyer, S. Gupta, I. Frosio, D. Reddy, and J. Kautz, “Robust model-based 3d headpose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 3649–3657.
[9] S. Rusinkiewicz and M. Levoy, “Efficient variants of the icp algorithm,” in 3-D DigitalImaging and Modeling, 2001. Proceedings. Third International Conference on. IEEE,2001, pp. 145–152.
[10] J. Smisek, M. Jancosek, and T. Pajdla, 3D with Kinect. London: Springer London,2013, pp. 3–25. [Online]. Available: http://dx.doi.org/10.1007/978-1-4471-4640-7 1
[11] J. Diebel and S. Thrun, “An application of Markov random fields to range sensing,”in Advances in Neural Information Processing Systems, vol. 18. MIT press, 2005, pp.291–298.
[12] J. Yang, X. Ye, K. Li, and C. Hou, “Depth recovery using an adaptive color-guidedauto-regressive model,” in Proc. Euro. Conf. Comput. Vis. Springer, 2012, pp. 158–171.
[13] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon, “High quality depth mapupsampling for 3D-ToF cameras,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp.1623–1630.
[14] D. Herrera, J. Kannala, J. Heikkila et al., “Depth map inpainting under a second-ordersmoothness prior,” in Image Analysis. Springer, 2013, pp. 555–566.
[15] C. D. Herrera, J. Kannala, P. Sturm, and J. Heikkila, “A learned joint depth and intensityprior using Markov random fields,” in Proc. IEEE 3DTV-CON, 2013, pp. 17–24.
121
122 BIBLIOGRAPHY
[16] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems withapplications to imaging,” Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp.120–145, 2011.
[17] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruther, and H. Bischof, “Image guided depth up-sampling using anisotropic total generalized variation,” in Proc. IEEE Int. Conf. Comput.Vis., 2013, pp. 993–1000.
[18] X. Shen, C. Zhou, L. Xu, and J. Jia, “Mutual-structure for joint filtering,” in Proc. IEEEInt. Conf. Comput. Vis., 2015, pp. 3406–3414.
[19] B. Ham, M. Cho, and J. Ponce, “Robust image filtering using joint static and dynamicguidance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2015, pp. 4823–4831.
[20] Y. Kim, B. Ham, C. Oh, and K. Sohn, “Structure selective depth superresolution forrgb-d cameras,” vol. 25, no. 11, pp. 5227–5238, 2016.
[21] B. Ham, D. Min, and K. Sohn, “Depth superresolution by transduction,” vol. 24, no. 5,pp. 1524–1535, 2015.
[22] S. Lu, X. Ren, and F. Liu, “Depth enhancement via low-rank matrix completion,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3390–3397.
[23] K. Matsuo and Y. Aoki, “Depth image enhancement using local tangent plane approxima-tions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2015, pp. 3574–3583.
[24] D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. N. Do, “Fast global image smoothingbased on weighted least squares,” vol. 23, no. 12, pp. 5638–5653, 2014.
[25] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Joint bilateral upsampling,”in ACM Trans. Graph., vol. 26, no. 3, 2007, p. 96.
[26] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proc.IEEE Int. Conf. Comput. Vis., 1998, pp. 839–846.
[27] B. Huhle, T. Schairer, P. Jenke, and W. Straßer, “Fusion of range and color imagesfor denoising and resolution enhancement with a non-local filter,” Comput. Vis. ImageUnderstanding, vol. 114, no. 12, pp. 1336–1345, 2010.
[28] J. Dolson, J. Baek, C. Plagemann, and S. Thrun, “Upsampling range data in dynamicenvironments,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1141–1148.
[29] D. Chan, H. Buisman, C. Theobalt, S. Thrun et al., “A noise-aware filter for real-timedepth upsampling,” in Workshop on Multi-camera and Multi-modal Sensor Fusion Algo-rithms and Applications-M2SFA2, 2008.
[30] F. Garcia, B. Mirbach, B. Ottersten, F. Grandidier, and A. Cuesta, “Pixel weightedaverage strategy for depth sensor data fusion,” in Proc. IEEE Int. Conf. Image Process.,2010, pp. 2805–2808.
[31] E. S. L. Gastal and M. M. Oliveira, “Adaptive manifolds for real-time high-dimensionalfiltering,” ACM Trans. Graph., vol. 31, no. 4, pp. 33:1–33:13, 2012.
[32] ——, “Domain transform for edge-aware image and video processing,” ACM Trans.Graph., vol. 30, no. 4, pp. 69:1–69:12, Jul. 2011.
[33] Q. Yang, N. Ahuja, R. Yang, K.-H. Tan, J. Davis, B. Culbertson, J. Apostolopoulos, andG. Wang, “Fusion of median and bilateral filtering for range image upsampling,” vol. 22,no. 12, pp. 4841–4852, Dec 2013.
[34] Q. Yang, R. Yang, J. Davis, and D. Nister, “Spatial-depth super resolution for rangeimages,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2007, pp. 1–8.
BIBLIOGRAPHY 123
[35] J. Lu, H. Yang, D. Min, and M. Do, “Patch match filter: Efficient edge-aware filteringmeets randomized search for fast correspondence field estimation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., June 2013, pp. 1854–1861.
[36] L. Sheng and K. N. Ngan, “Depth enhancement based on hybrid geometric hole fillingstrategy,” in Proc. IEEE Int. Conf. Image Process., Sept 2013, pp. 2173–2176.
[37] H. Li, P. Roivainen, and R. Forchheimer, “3-D motion estimation in model-based facialimage coding,” vol. 15, no. 6, pp. 545–555, Jun 1993.
[38] M. J. Black and Y. Yacoob, “Tracking and recognizing rigid and non-rigid facial motionsusing local parametric models of image motion,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun 1995, pp. 374–381.
[39] D. DeCarlo and D. Metaxas, “Optical flow constraints on deformable models withapplications to face tracking,” International Journal of Computer Vision, vol. 38, no. 2,pp. 99–127, 2000. [Online]. Available: http://dx.doi.org/10.1023/A:1008122917811
[40] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3D faces,” in Proceedingsof the 26th annual conference on Computer graphics and interactive techniques. ACMPress/Addison-Wesley Publishing Co., 1999, pp. 187–194.
[41] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” no. 6, pp.681–685, 2001.
[42] D. Cristinacce and T. Cootes, “Automatic feature localisation with constrained localmodels,” Pattern Recognition, vol. 41, no. 10, pp. 3054–3067, 2008.
[43] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regressiontrees,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1867–1874.
[44] J. M. Saragih, S. Lucey, and J. F. Cohn, “Deformable model fitting by regularized land-mark mean-shift,” International Journal of Computer Vision, vol. 91, no. 2, pp. 200–215,2011.
[45] X. Xiong and F. Torre, “Supervised descent method and its applications to face align-ment,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[46] Y. Sun and L. Yin, “Automatic pose estimation of 3d facial models,” in Proc. IEEE Int.Conf. Pattern Recognit. IEEE, 2008, pp. 1–4.
[47] M. D. Breitenstein, D. Kuettel, T. Weise, L. Van Gool, and H. Pfister, “Real-time facepose estimation from single range images,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. IEEE, 2008, pp. 1–8.
[48] C. Papazov, T. K. Marks, and M. Jones, “Real-time 3D head pose and facial landmarkestimation from depth images using triangular surface patch features,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2015, pp. 4722–4730.
[49] G. Fanelli, T. Weise, J. Gall, and L. Van Gool, “Real time head pose estimation fromconsumer depth cameras,” in Pattern Recognition. Springer, 2011, pp. 101–110.
[50] G. Fanelli, J. Gall, and L. Van Gool, “Real time head pose estimation with randomregression forests,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2011,pp. 617–624.
[51] G. Riegler, D. Ferstl, M. Ruther, and H. Bischof, “Hough networks for head pose es-timation and facial feature localization,” in Proceedings of the British Machine VisionConference. BMVA Press, 2014.
[52] V. Kazemi, C. Keskin, J. Taylor, P. Kohli, and S. Izadi, “Real-time face reconstructionfrom a single depth image,” in 3D Vision (3DV), 2014 2nd International Conference on,vol. 1. IEEE, 2014, pp. 369–376.
124 BIBLIOGRAPHY
[53] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk, “Slic superpixelscompared to state-of-the-art superpixel methods,” vol. 34, no. 11, pp. 2274–2282, Nov2012.
[54] S. Perreault and P. Hebert, “Median filtering in constant time,” vol. 16, no. 9, pp. 2389–2394, 2007.
[55] D. Cline, K. White, and P. Egbert, “Fast 8-bit median filtering based on separability,” inProc. IEEE Int. Conf. Image Process., vol. 5, Sept 2007, pp. V – 281–V – 284.
[56] J. Van de Weijer and R. Van den Boomgaard, “Local mode filtering,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., vol. 2, 2001, pp. II–428.
[57] E. Parzen, “On estimation of a probability density function and mode,” Annals of Math-ematical Statistics, vol. 33, pp. 1065–1076, Sep. 1962.
[58] D. Barash and D. Comaniciu, “A common framework for nonlinear diffusion, adaptivesmoothing, bilateral filtering and mean shift,” Image and Vision Computing, vol. 22,no. 1, pp. 73–81, 2004.
[59] K. He, J. Sun, and X. Tang, “Guided image filtering,” in Proc. Euro. Conf. Comput. Vis.Springer, 2010, pp. 1–14.
[60] S. Paris and F. Durand, “A fast approximation of the bilateral filter using a signal pro-cessing approach,” in Proc. Euro. Conf. Comput. Vis. Springer, 2006, pp. 568–580.
[61] J. Chen, S. Paris, and F. Durand, “Real-time edge-aware image processing with thebilateral grid,” in ACM Trans. Graph., vol. 26, no. 3, 2007, p. 103.
[62] A. Adams, N. Gelfand, J. Dolson, and M. Levoy, “Gaussian kd-trees for fast high-dimensional filtering,” in ACM Trans. Graph., vol. 28, no. 3, 2009, p. 21.
[63] A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional filtering using the permu-tohedral lattice,” in Computer Graphics Forum, vol. 29, no. 2, 2010, pp. 753–762.
[64] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning. Springer,2006.
[65] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented naturalimages and its application to evaluating segmentation algorithms and measuring ecologicalstatistics,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 2, July 2001, pp. 416–423.
[66] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereocorrespondence algorithms,” Int. Journal of Comput. Vis., vol. 47, no. 1-3, pp. 7–42,2002.
[67] S. Vedula, S. Baker, P. Rander, R. Collins, and T. Kanade, “Three-dimensional sceneflow,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 2, 1999, pp. 722–729.
[68] C. Vogel, K. Schindler, and S. Roth, “3D scene flow estimation with a rigid motion prior,”in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 1291–1298.
[69] ——, “Piecewise rigid scene flow,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp.1377–1384.
[70] S.-Y. Kim, J.-H. Cho, A. Koschan, and M. Abidi, “Spatial and temporal enhancementof depth images captured by a time-of-flight depth sensor,” in Proc. IEEE Int. Conf.Pattern Recognit., Aug 2010, pp. 2358–2361.
[71] J. Zhu, L. Wang, J. Gao, and R. Yang, “Spatial-temporal fusion for high accuracy depthmaps using dynamic MRFs,” vol. 32, no. 5, pp. 899–909, 2010.
[72] J. Shen and S.-C. S. Cheung, “Layer depth denoising and completion for structured-light RGB-D cameras,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp.1187–1194.
BIBLIOGRAPHY 125
[73] R. Szeliski, “A multi-view approach to motion and stereo,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., vol. 1, 1999, pp. 157–163.
[74] P. Merrell, A. Akbarzadeh, L. Wang, P. Mordohai, J.-M. Frahm, R. Yang, D. Nister, andM. Pollefeys, “Real-time visibility-based fusion of depth maps,” in Proc. IEEE Int. Conf.Comput. Vis., Oct 2007, pp. 1–8.
[75] S. Liu and D. Cooper, “A complete statistical inverse ray tracing approach to multi-viewstereo,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2011, pp. 913–920.
[76] Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, and S. Thrun, “Multi-viewimage and ToF sensor fusion for dense 3D reconstruction,” in Proc. IEEE Int. Conf.Comput. Vis. Workshops, 2009, pp. 1542–1549.
[77] C. Zitnick, S. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High-quality videoview interpolation using a layered representation,” in ACM SIGGRAPH, vol. 23, no. 3,August 2004, pp. 600–608.
[78] K. Pathak, A. Birk, J. Poppinga, and S. Schwertfeger, “3D forward sensor modeling andapplication to occupancy grid based sensor fusion,” in Proc. IEEE/RSJ Int. Conf. Intell.Robots. Syst., 2007, pp. 2059–2064.
[79] B. Curless and M. Levoy, “A volumetric method for building complex models from rangeimages,” in Proc. ACM SIGGRAPH, 1996, pp. 303–312.
[80] R. A. Newcombe, A. J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux,S. Hodges, D. Kim, and A. Fitzgibbon, “KinectFusion: Real-time dense surface mappingand tracking,” in Proc. IEEE Int. Symp. Mixed Augmented Reality, 2011, pp. 127–136.
[81] O. J. Woodford and G. Vogiatzis, “A generative model for online depth fusion,” in Proc.Euro. Conf. Comput. Vis. springer, 2012, pp. 144–157.
[82] S. Thrun, “Learning occupancy grids with forward models,” in Proc. IEEE/RSJ Int.Conf. Intell. Robots. Syst., vol. 3, 2001, pp. 1676–1681.
[83] G. Vogiatzis and C. Hernandez, “Video-based, real-time multi-view stereo,” Image andVision Computing, vol. 29, no. 7, pp. 434–441, 2011.
[84] T. P. Minka, “A family of algorithms for approximate Bayesian inference,” Ph.D. disser-tation, Massachusetts Institute of Technology, 2001.
[85] P. Krahenbuhl and V. Koltun, “Efficient inference in fully connected CRFs with Gaussianedge potentials,” in Advances in Neural Information Processing Systems. MIT press,2011, pp. 109–117.
[86] D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruether, and H. Bischof, “Image guided depth up-sampling using anisotropic total generalized variation,” in Proc. IEEE Int. Conf. Comput.Vis., December 2013.
[87] D. Scharstein and C. Pal, “Learning conditional random fields for stereo,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., June 2007, pp. 1–8.
[88] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions for stereo matching,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2007, pp. 1–8.
[89] C. Fehn, “Depth-image-based rendering (DIBR), compression, and transmission for anew approach on 3D-TV,” in Electronic Imaging. International Society for Optics andPhotonics, 2004, pp. 93–104.
[90] D. Vlasic, M. Brand, H. Pfister, and J. Popovic, “Face transfer with multilinear models,”in ACM Trans. Graph., vol. 24, no. 3. ACM, 2005, pp. 426–433.
126 BIBLIOGRAPHY
[91] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon, “The vitruvian manifold: Infer-ring dense correspondences for one-shot human pose estimation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. IEEE, 2012, pp. 103–110.
[92] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, andA. Blake, “Real-time human pose recognition in parts from a single depth image,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, June 2011.
[93] L. Wei, Q. Huang, D. Ceylan, E. Vouga, and H. Li, “Dense human body correspondencesusing convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2016.
[94] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “3D constrained local model for rigid andnon-rigid facial tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. IEEE,2012, pp. 2610–2617.
[95] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial anima-tion,” ACM Trans. Graph., vol. 32, no. 4, p. 41, 2013.
[96] Y. Cai, M. Yang, and Z. Li, “Robust head pose estimation using a 3D morphable model,”Mathematical Problems in Engineering, vol. 2015, 2015.
[97] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang, “3d deformable face tracking with a com-modity depth camera,” in Proc. Euro. Conf. Comput. Vis. Springer, 2010, pp. 229–242.
[98] C. Chen, H. X. Pham, V. Pavlovic, J. Cai, and G. Shi, “Depth recovery with face priors,”in Proc. Asia Conf. Comput. Vis. Springer, 2014, pp. 336–351.
[99] A. Brunton, A. Salazar, T. Bolkart, and S. Wuhrer, “Review of statistical shape spacesfor 3d data with comparative analysis for human faces,” Computer Vision and ImageUnderstanding, vol. 128, pp. 1–17, 2014.
[100] S. Bouaziz, Y. Wang, and M. Pauly, “Online modeling for realtime facial animation,”ACM Trans. Graph., vol. 32, no. 4, p. 40, 2013.
[101] P.-L. Hsieh, C. Ma, J. Yu, and H. Li, “Unconstrained realtime facial performance cap-ture,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1675–1683.
[102] S. Li, K. Ngan, R. Paramesran, and L. Sheng, “Real-time head pose tracking with onlineface template reconstruction.” 2015.
[103] M. Storer, M. Urschler, and H. Bischof, “3D-MAM: 3D morphable appearance modelfor efficient fine head pose estimation from still images,” in Computer Vision Workshops(ICCV Workshops), 2009 IEEE 12th International Conference on. IEEE, 2009, pp.192–199.
[104] S. Tulyakov, R.-L. Vieriu, S. Semeniuta, and N. Sebe, “Robust real-time extreme headpose estimation,” in Proc. IEEE Int. Conf. Pattern Recognit. IEEE, 2014, pp. 2263–2268.
[105] T. Weise, S. Bouaziz, H. Li, and M. Pauly, “Realtime performance-based facial anima-tion,” in ACM Trans. Graph., vol. 30, no. 4. ACM, 2011, p. 77.
[106] H. Li, J. Yu, Y. Ye, and C. Bregler, “Realtime facial animation with on-the-fly correc-tives.” ACM Trans. Graph., vol. 32, no. 4, pp. 42–1, 2013.
[107] S. Saito, T. Li, and H. Li, “Real-time facial segmentation and performance capture fromRGB input,” arXiv preprint arXiv:1604.02647, 2016.
[108] P. Padeleris, X. Zabulis, and A. A. Argyros, “Head pose estimation on depth data basedon particle swarm optimization,” in Computer Vision and Pattern Recognition Workshops(CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 42–49.
[109] R. Wang, L. Wei, E. Vouga, Q. Huang, D. Ceylan, G. Medioni, and H. Li, “Capturingdynamic textured surfaces of moving targets,” arXiv preprint arXiv:1604.02801, 2016.
BIBLIOGRAPHY 127
[110] H. Li, T. Weise, and M. Pauly, “Example-based facial rigging,” ACM Trans. Graph.,vol. 29, no. 4, p. 32, 2010.
[111] P. Ekman and W. Friesen, “Facial action coding system: a technique for the measurementof facial movement,” Consulting Psychologists, San Francisco, 1978.
[112] M. Martin, F. Van De Camp, and R. Stiefelhagen, “Real time head model creation andhead pose estimation on consumer depth cameras,” in 3D Vision (3DV), 2014 2nd In-ternational Conference on, vol. 1. IEEE, 2014, pp. 641–648.