lib.jsut.edu.cnlib.jsut.edu.cn/_upload/article/files/ae/b4/d110c6c04e988… · web...

学科文献信息计算机专题（ScienceDirect 数据库）第 6 期（总 15 期） 2017 年 10 月 15 日

（请把光标放在文献题名上，按住 Crtl 键单击题名可打开文献全文）Computer Vision and Image Understanding

Volume 162, Pages 1-184 (September 2017)

1. Face alignment in-the-wild: A Survey Original Research ArticlePages 1-22Xin Jin, Xiaoyang TanAbstractOver the last two decades, face alignment or localizing fiducial facial points on 2D images has received increasing attention owing to its comprehensive applications in automatic face analysis. However, such a task has proven extremely challenging in unconstrained environments due to many confounding factors, such as pose, occlusions, expression and illumination. While numerous techniques have been developed to address these challenges, this problem is still far away from being solved. In this survey, we present an up-to-date critical review of the existing literatures on face alignment, focusing on those methods addressing overall difficulties and challenges of this topic under uncontrolled conditions. Specifically, we categorize existing face alignment techniques, present detailed descriptions of the prominent algorithms within each category, and discuss their advantages and disadvantages. Furthermore, we organize special discussions on the practical aspects of face alignment in-the-wild, towards the development of a robust face alignment system. In addition, we show performance statistics of the state of the art, and conclude this paper with several promising directions for future research.

2. Efficient single image dehazing and denoising: An efficient multi-scale correlated wavelet approach

Original Research ArticlePages 23-33Xin Liu, He Zhang, Yiu-ming Cheung, Xinge You, Yuan Yan TangAbstractImages of outdoor scenes captured in bad weathers are often plagued by the limited visibility and poor contrast, and such degradations are spatially-varying. Differing from most previous dehazing approaches that remove the haze effect in spatial domain and often suffer from the noise problem,

编者：陆雪梅联系电话（3548）E-mail：[email protected] 共 21 第 1 页

http://www.sciencedirect.com/science/article/pii/S1077314217301455



this paper presents an efficient multi-scale correlated wavelet approach to solve the image dehazing and denoising problem in the frequency domain. To this end, we have heuristically found a generic regularity in nature images that the haze is typically distributed in the low frequency spectrum of its multi-scale wavelet decomposition. Benefited from this separation, we first propose an open dark channel model (ODCM) to remove the haze effect in the low frequency part. Then, by considering the coefficient relationships between the low frequency and high frequency parts, we employ the soft-thresholding operation to reduce the noise and synchronously utilize the estimated transmission in ODCM to further enhance the texture details in the high frequency parts adaptively. Finally, the haze-free image can be well restored via the wavelet reconstruction of the recovered low frequency part and enhanced high frequency parts correlatively. The proposed approach aims not only to significantly increase the perceptual visibility, but also to preserve more texture details and reduce the noise effect as well. The extensive experiments have shown that the proposed approach yields comparative and even better performance in comparison with the state-of-the-art competing techniques.

3. Three-layer graph framework with the sumD feature for alpha matting Original Research ArticlePages 34-45Chao Li, Ping Wang, Xiangyu Zhu, Huali PiAbstractAlpha matting, the process of extracting opacity mask of the foreground in an image, is an important task in image and video editing. All of the matting methods need exploit the relationships between pixels. The traditional propagation-based methods construct constrains based on nonlocal principle and color line model to reflect the relationships. However, these methods would produce artifacts if the constrains are not reliable. So we improve this problem in three points. Firstly, we design a novel feature called sumD feature to increase the pixel discrimination. This feature is simple and could encourage pixels with similar texture to have similar feature values. Secondly, we design a three-layer graph framework to construct nonlocal constrains. This framework finds constrains in multi-scale range and selects reliable constrains, then unifies nonlocal constrains according to their reliabilities. Thirdly, we develop a new label extension method to add hard constrains. Experimental results confirm that the effectiveness of the three changes, and the proposed method achieves high rank on the benchmark dataset.

4. Naturally combined shape-color moment invariants under affine transformations Original Research ArticlePages 46-56Ming Gong, You Hao, Hanlin Mo, Hua LiAbstractWe proposed a kind of naturally combined shape-color affine moment invariants (SCAMI), which consider both shape and color affine transformations simultaneously in one single system. In the real scene, color and shape deformations always exist in images simultaneously. Simple shape invariants or color invariants cannot be qualified for this situation. The conventional method is just to make a simple linear combination of the two factors. Meanwhile, the manual selection of weights is a complex issue. Our construction method is based on the multiple integration framework. The integral kernel is assigned as the continued product of the shape and color invariant cores. It is the




first time to directly derive an invariant to dual affine transformations of shape and color. The manual selection of weights is no longer necessary, and both the shape and color transformations are extended to affine transformation group. With the various of invariant cores, a set of lower-order invariants are constructed and the completeness and independence are discussed detailed. A set of SCAMIs, which called SCAMI24, are recommended, and the effectiveness and robustness have been evaluated on both synthetic and real datasets.

5. A local feature with multiple line descriptors and its speeded-up matching algorithm Original Research ArticlePages 57-70Jiacha Shi, Xuanyin WangAbstractThis paper introduces a local feature with multiple line descriptors and its unique matching algorithm. Previous approaches describe the local feature based on image patch that uses single feature point as the approximate center. But there is no accurate information about orientation or scale in the image patch. On the contrary, line segment possesses it. For this reason, we extract a line descriptor from a model of line segment that links two randomly selected feature points. There forms a mesh topology due to the fact that a line descriptor links two feature points and meanwhile a feature point links multiple line descriptors. But as a price to pay for it, there comes a large number of line descriptors that is bad for matching descriptors. In order to speed up the matching process, we design a unique matching algorithm by exploiting the mesh topology. The result shows that the local feature with multiple line descriptors outperforms other classical features based on image patch on robustness.

6. mdBRIEF - a fast online-adaptable, distorted binary descriptor for real-time applications using calibrated wide-angle or fisheye cameras

Original Research ArticlePages 71-86Steffen Urban, Martin Weinmann, Stefan HinzAbstractFast binary descriptors build the core for many vision based applications with real-time demands like object detection, visual odometry or SLAM. Commonly it is assumed, that the acquired images and thus the patches extracted around keypoints originate from a perspective projection ignoring image distortion or completely different types of projections such as omnidirectional or fisheye. Usually the deviations from a perfect perspective projection are corrected by using standard undistortion models. The latter, however, introduce artifacts if the camera’s field-of-view gets larger. In addition, many applications (e.g. monocular SLAM) require only undistorted points and holistic undistortion of every image for descriptor extraction could be eluded. In this paper, we propose a distorted and masked version of the BRIEF descriptor for calibrated cameras, called dBRIEF and mdBRIEF respectively. Instead of correcting the distortion holistically, we distort the binary tests and thus adapt the descriptor to different image regions. The implementation of the proposed method along with evaluation scripts can be found online at https://github.com/urbste/mdBRIEF.


https://github.com/urbste/mdBRIEF




7. Stand-alone quality estimation of background subtraction algorithms Original Research ArticlePages 87-102Diego Ortego, Juan C. SanMiguel, José M. MartínezAbstractForeground segmentation is a key stage in multiple computer vision applications, where existing algorithms are commonly evaluated making use of ground-truth data. Reference-free or stand-alone evaluations that estimate segmented foreground quality are an alternative methodology to overcome the limitations inherent to ground-truth based evaluations. In this work, we survey and explore existing stand-alone measures proposed in related research areas to determine good object properties for estimating the segmentation quality in background subtraction algorithms. We propose a new taxonomy for stand-alone evaluation measures and analyze 21 proposals. We demonstrate the utility of the selected measures to evaluate the segmentation masks of eight background subtraction algorithms. The experiments are performed over a large heterogeneous dataset with varied challenges (CDNET2014) and identify which properties of the measures are the most effective to estimate quality. The experiments also demonstrate that qualitative performance levels can be distinguished and background subtraction algorithms can be ranked without the need of ground-truth.

8. Multi-object tracking through learning relational appearance features and motion patterns Original Research ArticlePages 103-115Jeonghwan GwakAbsractMulti-object tracking (MOT) is to simultaneously track multiple targets, e.g., pedestrians in this work, through locating them and maintaining their identities to make their individual trajectories. Despite of recent advances in object detection, MOT based on the tracking-by-detection principle is a still yet challenging and difficult task in complex and crowded conditions. For example, due to occlusion, missed object detection, and frequent entering and leaving of object in a scene, tracking failures such as identity switches and trajectory fragmentation can often occur. To tackle the issues, a new data association approach, namely, the relational appearance features and motion patterns learning (RAFMPL)-based data association, is proposed for facilitating MOT. In RAFMPL-MOT, the proposed relational features-based appearance model is different from conventional approaches in that it generates tracklets based on relational information by selecting one reference object and utilizing the feature differences between the reference object and the other objects. In addition, the motion patterns learning-based motion model enables linear and nonlinear confident motions patterns to be considered in data association. The proposed approach can effectively cover the key difficulties of MOT. In particular, using RAFMPL-MOT, it is possible to assign the same ID for the object that has been disappeared (even for moderately long period) and then is reappeared in the scene more robustly. Further, it also improves its robustness for occlusion problems frequently occurring in real situations. The experimental results show that the RAFMPL-MOT could generally achieve outperformance compared to the existing competitive MOT approaches.




9. Cluster-based adaptive SVM: A latent subdomains discovery method for domain adaptation problems

Original Research ArticlePages 116-134Azadeh Sadat Mozafari, Mansour JamzadAbstractMachine learning algorithms often suffer from good generalization in testing domains especially when the training (source) and test (target) domains do not have similar distributions. To address this problem, several domain adaptation techniques have been proposed to improve the performance of the learning algorithms when they face accuracy degradation caused by the domain shift problem. In this paper, we focus on the non-homogeneous distributed target domains and propose a new latent subdomain discovery model to divide the target domain into subdomains while adapting them. It is expected that applying adaptation on subdomains increase the rate of detection in comparing with the situation that the target domain is seen as one single domain. The proposed division method considers each subdomain as a cluster which has the definite ratio of positive to negative samples, linear discriminability and conditional distribution similarity to the source domain. This method divides the target domain into subdomains while adapting the trained target classifier for each subdomain using Adapt-SVM adaptation method. It also has a simple solution for selecting the appropriate number of subdomains. We call our proposed method Cluster-based Adaptive SVM or CA-SVM in short. We test CA-SVM on two different computer vision problems, pedestrian detection and image classification. The experimental results show the advantage in accuracy rate for our approach in comparison to several baselines.

10. Stylizing face images via multiple exemplars Original Research ArticlePages 135-145Yibing Song, Linchao Bao, Shengfeng He, Qingxiong Yang, Ming-Hsuan YangAbstractWe address the problem of transferring the style of a headshot photo to face images. Existing methods using a single exemplar lead to inaccurate results when the exemplar does not contain sufficient stylized facial components for a given photo. In this work, we propose an algorithm to stylize face images using multiple exemplars containing different subjects in the same style. Patch correspondences between an input photo and multiple exemplars are established using a Markov Random Field (MRF), which enables accurate local energy transfer via Laplacian stacks. As image patches from multiple exemplars are used, the boundaries of facial components on the target image are inevitably inconsistent. The artifacts are removed by a post-processing step using an edge-preserving filter. Experimental results show that the proposed algorithm consistently produces visually pleasing results.

11. Non-rigid registration based model-free 3D facial expression recognition Original Research ArticlePages 146-165Arman Savran, Bülent SankurAbstractWe propose a novel feature extraction approach for 3D facial expression recognition by






incorporating non-rigid registration in face-model-free analysis, which in turn makes feasible data-driven, i.e., feature-model-free recognition of expressions. The resulting simplicity of feature representation is due to the fact that facial information is adapted to the input faces via shape model-free dense registration, and this provides a dynamic feature extraction mechanism. This approach eliminates the necessity of complex feature representations as required in the case of static feature extraction methods, where the complexity arises from the necessity to model the local context; higher degree of complexity persists in deep feature hierarchies enabled by end-to-end learning on large-scale datasets. Face-model-free recognition implies independence from limitations and biases due to committed face models, bypassing complications of model fitting, and avoiding the burden of manual model construction. We show via information gain maps that non-rigid registration enables extraction of highly informative features, as it provides invariance to local-shifts due to physiognomy (subject invariance) and residual pose misalignments; in addition, it allows estimation of local correspondences of expressions. To maximize the recognition rate, we use the strategy of employing a rich but computationally manageable set of local correspondence structures, and to this effect we propose a framework to optimally select multiple registration references. Our features are re-sampled surface curvature values at individual coordinates which are chosen per expression-class and per reference pair. We show the superior performance of our novel dynamic feature extraction approach on three distinct recognition problems, namely, action unit detection, basic expression recognition, and emotion dimension recognition.

12. Self-calibration of omnidirectional multi-cameras including synchronization and rolling shutter

Original Research ArticlePages 166-184Thanh-Tin Nguyen, Maxime LhuillierAbstract360° and spherical cameras become popular and are convenient for applications like immersive videos. They are often built by fixing together several fisheye cameras pointing in different directions. However their complete self-calibration is not easy since the consumer fisheyes are rolling shutter cameras which can be unsynchronized. Our approach does not require a calibration pattern. First the multi-camera model is initialized thanks to assumptions that are suitable to an omnidirectional camera without a privileged direction: the cameras have the same setting and are roughly equiangular. Second a frame-accurate synchronization is estimated from the instantaneous angular velocities of each camera provided by monocular structure-from-motion. Third both inter-camera poses and intrinsic parameters are refined using multi-camera structure-from-motion and bundle adjustment. Last we introduce a bundle adjustment that estimates not only the usual parameters but also a sub-frame-accurate synchronization and the rolling shutter. We experiment using videos taken by consumer cameras mounted on a helmet and moving along trajectories of several hundreds of meters or kilometers, and compare our results to ground truth.

Journal of Visual Communication and Image Representation




Volume 47, Pages 1-72 (August 2017)

13. A joint dictionary learning and regression model for intensity estimation of facial AUs Original Research ArticlePages 1-9M.R. Mohammadi, E. Fatemizadeh, M.H. MahoorAbstractAutomated intensity estimation of spontaneous Facial Action Units (AUs) defined by Facial Action Coding System (FACS) is a relatively new and challenging problem. This paper presents a joint supervised dictionary learning (SDL) and regression model for solving this problem. The model is casted as an optimization function consisting of two terms. The first term in the optimization concerns representing the facial images in a sparse domain using dictionary learning whereas the second term concerns estimating AU intensities using a linear regression model in the sparse domain. The regression model is designed in a way that considers disagreement between raters by a constant biasing factor in measuring the AU intensity values. Furthermore, since the intensity of facial AU is a non-negative value (i.e., the intensity values are between 0 and 5), we impose a non-negative constraint on the estimated intensities by restricting the search space for the dictionary learning and the regression function. Our experimental results on DISFA and FERA2015 databases show that this approach is very promising for automated measurement of spontaneous facial AUs.

14. Per-pixel mirror-based method for high-speed video acquisition Original Research ArticlePages 23-35J.A.S. Lima, C.J. Miosso, M.C.Q. FariasAbstractHigh-speed imaging requires high-bandwidth, fast image sensors that are generally only available in high-end specialized cameras. Nevertheless, with the use of compressive sensing theory and computational photography techniques, new methods emerged that use spatial light modulators to reconstruct high-speed videos with low speed sensors. Although these methods represent a big step in the field, they still present some limitations, such as low light efficiency and the generation of measurements with time dependency. To tackle these problems, we propose a per-pixel mirror-based acquisition method that is based on a new kind of light modulator. The proposed method uses moving mirrors to scramble the light coming from different positions, thus ensuring better light efficiency and generating time independent measurements. Our results show that the proposed method and its variations perform better than methods available in the literature, generating videos that are less noisy and that display better content separation.

15. A new method for inpainting of depth maps from time-of-flight sensors based on a modified closing by reconstruction algorithm

Original Research ArticlePages 36-47Marco Antonio Garduño-Ramón, Ivan Ramon Terol-Villalobos, Roque Alfredo Osornio-Rios, Luis






Alberto Morales-HernandezAbstractTime-of-Flight (ToF) sensors are popular devices that extract 3D information from a scene but result to be susceptible to noise and loss of data creating holes and gaps in the boundaries of the objects. The most common approaches to tackling this problem are supported by color images with good results, however, not all ToF devices produce color information. Mathematical morphology provides operators that can manage the problem of noise in single depth frames. In this paper, a new method for the filtering of single depth maps, when no color image is available, is presented, based on a modification to the morphological closing by reconstruction algorithm. The proposed method eliminates noise, emphasizing a high contour preservation, and it is compared, both qualitative and quantitatively, with other state-of-the-art filters. The proposed method represents an improvement to the closing by reconstruction algorithm that can be applied for filter depth maps of ToF devices.

16. Video synthesis from stereo videos with iterative depth refinement Original Research ArticlePages 48-61Chen-Hao Wei, Shang-Hong Lai, Chen-Kuo ChiangAbstractWe propose a novel depth maps refinement algorithm and generate multi-view video sequences from two-view video sequences for modern autostereoscopic display. In order to generate realistic contents for virtual views, high-quality depth maps are very critical to the view synthesis results. Therefore, refining the depth maps is the main challenging problem in the task. We propose an iterative depth refinement algorithm, including error detection and error correction, to correct errors in depth map. Error detection aims at two types of error: across-view color-depth-inconsistency error and local color-depth-inconsistency error. Then, error pixels are corrected based on sampling local candidates. A trilateral filter that considers intensity, spatial and temporal terms into the filter weighting is applied to enhance the spatial and temporal consistency across frames. So the virtual views can be better synthesized according to the refined depth maps. To combine both warped images, disparity-based view interpolation is introduced to alleviate the translucent artifacts. Finally, a directional filter is applied to reduce the aliasing around the object boundaries to generate multiple high-quality virtual views between the two views. We demonstrate the superior image quality of the synthesized virtual views by using the proposed algorithm over the state-of-the-art view synthesis methods through experiments on benchmarking image and video datasets.

17. Performance evaluation of local descriptors for maximally stable extremal regions Original Research ArticlePages 62-72Man Hee Lee, In Kyu ParkAbstractVisual feature descriptors are widely used in most computer vision applications. Over the past several decades, local feature descriptors that are robust to challenging environments have been proposed. Because their characteristics differ according to the imaging condition, it is necessary to compare their performance consistently. However, no pertinent research has attempted to establish a benchmark for performance evaluation, especially for affine region detectors, which are mainly used in object classification and recognition. This paper presents an intensive and informative




performance evaluation of local descriptors for the state-of-the-art affine-invariant region detectors, i.e., maximally stable extremal region detectors. We evaluate patch-based and binary descriptors, including SIFT, SURF, BRIEF, FREAK, the shape descriptor, LIOP, DAISY, GSURF, RFDg, and CNN descriptors. The experimental results reveal the relative performance and characteristics of each descriptor.

Volume 48, Pages 1-514 (October 2017)

18. Internet cross-media retrieval based on deep learning Original Research ArticlePages 356-366Bin Jiang, Jiachen Yang, Zhihan Lv, Kun Tian, Qinggang Meng, Yan YanAbstractWith the development of Internet, multimedia information such as image and video is widely used. Therefore, how to find the required multimedia data quickly and accurately in a large number of resources, has become a research focus in the field of information process. In this paper, we propose a real time internet cross-media retrieval method based on deep learning. As an innovation, we have made full improvement in feature extracting and distance detection. After getting a large amount of image feature vectors, we sort the elements in the vector according to their contribution and then eliminate unnecessary features. Experiments show that our method can achieve high precision in image-text cross media retrieval, using less retrieval time. This method has a great application space in the field of cross media retrieval.

19. Hybrid textual-visual relevance learning for content-based image retrieval Original Research ArticlePages 367-374Chaoran Cui, Peiguang Lin, Xiushan Nie, Yilong Yin, Qingfeng ZhuAbstractLearning effective relevance measures plays a crucial role in improving the performance of content-based image retrieval (CBIR) systems. Despite extensive research efforts for decades, how to discover and incorporate semantic information of images still poses a formidable challenge to real-world CBIR systems. In this paper, we propose a novel hybrid textual-visual relevance learning method, which mines textual relevance from image tags and combines textual relevance and visual relevance for CBIR. To alleviate the sparsity and unreliability of tags, we first perform tag completion to fill the missing tags as well as correct noisy tags of images. Then, we capture users’ semantic cognition to images by representing each image as a probability distribution over the permutations of tags. Finally, instead of early fusion, a ranking aggregation strategy is adopted to sew up textual relevance and visual relevance seamlessly. Extensive experiments on two benchmark datasets well verified the promise of our approach.




20. Multimedia venue semantic modeling based on multimodal data Original Research ArticlePages 375-385Wei-Zhi Nie, Wen-Juan Peng, Xiang-yu Wang, Yi-liang Zhao, Yu-Ting SuAbstractA huge amount of text and multimedia (images and videos) data concerning venues is constantly being generated. To model the semantics of these venues, it is essential to analyze both text and multimedia user-generated content (UGC) in an integral manner. This task, however, is difficult for location-based social networks (LBSNs) because their text and multimedia UGCs tend to be uncorrelated. In this paper, we propose a novel multimedia location topic modeling approach to address this problem. We first utilize Recurrent Convolutional Networks to build the correlation between multimedia UGCs and text. Then, a graph model is structured according to these correlations. Next, we employ a graph clustering method to detect the latent multimedia topics for each venue. Based on the obtained venue semantics, we propose techniques to model multimedia location topics and perform semantic-based location summarization, venue prediction and image description. Extensive experiments are conducted on a cross-platform dataset, and the promising results demonstrate the superiority of the proposed method.

21. Multimedia annotation via semi-supervised shared-subspace feature selection Original Research ArticlePages 386-395Zhiqiang Zeng, Xiaodong Wang, Yuming ChenAbstractWith the rapid development of social network and computer technologies, we always confront with high-dimensional multimedia data. It is time-consuming and unrealistic to organize such a large amount of data. Most existing methods are not appropriate for large-scale data due to their dependence of Laplacian matrix on training data. Normally, a given multimedia sample is usually associated with multiple labels, which are inherently correlated to each other. Although traditional methods could solve this problem by translating it into several single-label problems, they ignore the correlation among different labels. In this paper, we propose a novel semi-supervised feature selection method and apply it to the multimedia annotation. Both labeled and unlabeled samples are sufficiently utilized without the need of graph construction, and the shared information between multiple labels is simultaneously uncovered. We apply the proposed algorithm to both web page and image annotation. Experimental results demonstrate the effectiveness of our method.

22. Representative band selection for hyperspectral image classification Original Research ArticlePages 396-403Ronglu Yang, Lifan Su, Xibin Zhao, Hai Wan, Jiaguang SunAbstractHigh dimensional curse for hyperspectral images is one major challenge in image classification. In this work, we introduce a novel spectral band selection method by representative band mining. In the proposed method, the distance between two spectral bands is measured by using disjoint information. For band selection, all spectral bands are first grouped into clusters, and representative bands are selected from these clusters. Different from existing clustering-based band selection





methods which select bands from each cluster individually, the proposed method aims to select representative bands simultaneously by exploring the relationship among all band clusters. The optimal representative band selection is based on the criteria of minimizing the distance inside each cluster and maximizing the distance among different representative bands. These selected bands can be further applied in hyperspectral image classification. Experiments are conducted on the 92AV3C Indian Pine data set. Experimental results show that the disjoint information-based spectral band distance measure is effective and the proposed representative band selection approach outperforms state-of-the-art methods for high dimensional image classification.

23. Contextual aerial image categorization using codebook Original Research ArticlePages 404-410Zhijun Meng, Yan Wang, Xinyu Wu, Yating Yin, Teng LiAbstractEffective categorization of the millions of aerial images from unmanned planes is a useful technique with several important applications. Previous methods on this task usually encountered such problems: (1) it is hard to represent the aerial images’ topologies efficiently, which are the key feature to distinguish the arial images rather than conventional appearance, and (2) the computational load is usually too high to build a realtime image categorization system. Addressing these problems, this paper proposes an efficient and effective aerial image categorization method based on a contextual topological codebook. The codebook of aerial images is learned with a multitask learning framework. The topology of each aerial image is represented with the region adjacency graph (RAG). Furthermore, a codebook containing topologies is learned by jointly modeling the contextual information, based on the extracted discriminative graphlets. These graphlets are integrated into a Bag-of-Words (BoW) representation for predicting aerial image categories. Contextual relation among local patches are taken into account in categorization to yield high categorization performance. Experimental results show that our approach is both effective and efficient.

24. Graph-regularized concept factorization for multi-view document clustering Original Research ArticlePages 411-418Kun Zhan, Jinhui Shi, Jing Wang, Feng TianAbstractWe propose a novel multi-view document clustering method with the graph-regularized concept factorization (MVCF). MVCF makes full use of multi-view features for more comprehensive understanding of the data and learns weights for each view adaptively. It also preserves the local geometrical structure of the manifolds for multi-view clustering. We have derived an efficient optimization algorithm to solve the objective function of MVCF and proven its convergence by utilizing the auxiliary function method. Experiments carried out on three benchmark datasets have demonstrated the effectiveness of MVCF in comparison to several state-of-the-art approaches in terms of accuracy, normalized mutual information and purity.




25. Combining passive visual cameras and active IMU sensors for persistent pedestrian tracking

Original Research ArticlePages 419-431Wenchao Jiang, Zhaozheng YinAbstractVision based pedestrian tracking becomes a hard problem when long-term/heavy occlusion happens or pedestrian temporarily moves out of the visual field. In this paper, a novel persistent pedestrian tracking system is presented which combines visual signal from surveillance cameras and sensor signals from Inertial Measurement Unit (IMU) carried by pedestrians themselves. IMU tracking performs Dead Reckoning (DR) approach utilizing accelerometer, gyroscope and magnetometer. IMU tracking has nothing to do with visual occlusion, so it keeps working even when pedestrians are visually occluded. Meanwhile, visual tracking assists in calibrating IMU to avoid the bias drift during DR. The experimental results show that the IMU and visual tracking are complementary to each other and their combination performs robust pedestrian tracking in many challenging scenarios.

26. High-level background prior based salient object detection Original Research ArticlePages 432-441Gang Wang, Yongdong Zhang, Jintao LiAbstractSalient object detection is a fundamental problem in computer vision. Existing methods using only low-level features failed to uniformly highlight the salient object regions. In order to combine high-level saliency priors and low-level appearance cues, we propose a novel Background Prior based Salient detection method (BPS) for high-quality salient object detection.Different from other background prior based methods, a background estimation is added before performing saliency detection. We utilize the distribution of bounding boxes generated by a generic object proposal method to obtain background information. Three background priors are mainly considered to model the saliency, namely background connectivity prior, background contrast prior and spatial distribution prior, allowing the proposed method to highlight the salient object as a whole and suppress background clutters.Experiments conducted on two benchmark datasets validate that our method outperforms 11 state-of-the-art methods, while being more efficient than most leading methods.

27. Collaborative sparse representation leaning model for RGBD action recognition Original Research ArticlePages 442-452Z. Gao, S.H. Li, Y.J. Zhu, C. Wang, H. ZhangAbstractMulti-modalities action recognition becomes a hot research topic, and this paper proposes a collaborative sparse representation leaning model for RGB-D action recognition where RGB and depth information are adaptive fused. Specifically, dense trajectory feature is firstly extracted and Bag-of-Word (BoW) weight scheme is employed for RGB modality, and then for depth modality, the human pose representation model (HPM) and temporal modeling (TM) representation are utilized. Meanwhile, the collaborative reconstruction structure and corresponding objective



http://www.sciencedirect.com/science/article/pii/S104732031730041X



functions for the multiple modalities are designed, and then the proposed model is collaboratively optimized which is used to discover the latent complementary information between RGB and depth data. Finally, the collaborative reconstruction error is employed as our classification scheme. Large scale experimental results on challenging and public DHA, M2I and Northwestern-UCLA action datasets show that the performances of our model on two modalities are much better than traditional sole modality, which can boost the performance of human action recognition by taking advance of complementary characteristics from both RGB and depth modalities.

28. Multi-view representation learning for multi-view action recognition Original Research ArticlePages 453-460Tong Hao, Dan Wu, Qian Wang, Jin-Sheng SunAbstractAlthough multiple methods have been proposed for human action recognition, the existing multi-view approaches cannot well discover meaningful relationship among multiple action categories from different views. To handle this problem, this paper proposes an multi-view learning approach for multi-view action recognition. First, the proposed method leverages the popular visual representation method, bag-of-visual-words (BoVW)/fisher vector (FV), to represent individual videos in each view. Second, the sparse coding algorithm is utilized to transfer the low-level features of various views into the discriminative and high-level semantics space. Third, we employ the multi-task learning (MTL) approach for joint action modeling and discovery of latent relationship among different action categories. The extensive experimental results on M2I and IXMAS datasets have demonstrated the effectiveness of our proposed approach. Moreover, the experiments further demonstrate that the discovered latent relationship can benefit multi-view model learning to augment the performance of action recognition.

29. Bilinear dynamics for crowd video analysis Original Research ArticlePages 461-470Shuang Wu, Hang Su, Hua Yang, Shibao Zheng, Yawen Fan, Qin ZhouAbstractIn this paper, a novel crowd descriptor, termed as bilinear CD (Curl and Divergence) descriptor, is proposed based on the bilinear interaction of curl and divergence. Specifically, the curl and divergence activation maps are computed from the normalized average flow. A local curl patch and the corresponding divergence patch are cropped respectively from the activation maps. The outer product of the two local patches is defined as the bilinear CD vector. Through sliding a window on the activation maps, we can get hundreds to thousands local bilinear CD vectors. To encode them into a compact representation, fisher vector pooling and PCA algorithms are applied on the local descriptors. Experiments on the CUHK crowd dataset show that the proposed bilinear dynamics can improve the performance of video classification and retrieval by a noticeable margin when compared with the existing crowd features.

30. Identifying source camera using guided image estimation and block weighted average Original Research ArticlePages 471-479





Le-Bing Zhang, Fei Peng, Min LongAbstractSensor pattern noise (SPN) has been widely used in source camera identification. However, the SPN extracted from natural image may be contaminated by its content and eventually introduce side effect to the identification accuracy. In this paper, an effective source camera identification scheme based on guided image estimation and block weighted average is proposed. Before the SPN extraction, an adaptive SPN estimator based on image content is implemented to reduce the influence of image scene and improve the quality of the SPN. Furthermore, a novel camera reference SPN construction method is put forward by using some ordinary images, instead of the blue sky images in the previous schemes, and a block weighted average approach is used to suppress the influence of the image scenes in the reference SPN. Experimental results and analysis indicate that the proposed method can effectively identify the source of the natural image, especially in actual forensics environment with a small number of images.

31. Hierarchical image resampling detection based on blind deconvolution Original Research ArticlePages 480-490Yuting Su, Xiao Jin, Chengqian Zhang, Yawei ChenAbstractResampling detection is a helpful tool in multimedia forensics; however, it is a challenge task in cases with compression and noisy. In this paper, by modeling the recovery of edited images using an inverse filtering process, we propose a novel resampling detection framework based on blind deconvolution. Different interpolation types in the resampling process can be distinguished by our algorithm, which is significant for practical forensics scenarios. Furthermore, in contrast to traditional resampling detection algorithms, our method can effectively avoid interference caused by JPEG block artifacts. As the experimental results show, our method is more robust than other state-of-the-art approaches in the case of strong JPEG compression and substantial Gaussian noise.

32. Improved marching tetrahedra algorithm based on hierarchical signed distance field and multi-scale depth map fusion for 3D reconstruction

Original Research ArticlePages 491-501Dan Guo, Chuanqing Li, Lu Wu, Jianzhong YangAbstract3D reconstruction systems are promoted by developments of both computer hardware and computing technologies. They still remain problems like high expense, low efficiency and inaccuracy. Especially for large-scale scenes, lack of full use of multi-scale depth information will cause blurring and irreal reconstruction results. To solve this problem, we construct the structure of hierarchical signed distance field (H-SDF) and design an improved marching tetrahedra algorithm for multi-scale depth map fusion. In addition, to improve efficiency, we also propose a two-phase search strategy in image feature matching: the bag-of-features model (BOF) is adopted in a coarse search to narrow search scope and then the SIFT descriptor is used in exact matching to pick reconstruction image points. Experiment results indicate that coarse search makes matching time shorter; using the H-SDF to fuse multi-scale depth maps, and isosurface extraction with improved marching tetrahedra algorithm can improve visual effect.





33. A two-stage convolutional sparse prior model for image restoration Review ArticlePages 268-280Jiaojiao Xiong, Qiegen Liu, Yuhao Wang, Xiaoling XuAbstractImage restoration (IR) from noisy, blurred or/and incomplete observed measurement is one of the important tasks in image processing community. Image prior is of utmost importance for recovering a high quality image. In this paper, we present a two-stage convolutional sparse prior model for efficient image restoration. The multi-view features prior is first obtained by convolving the image with the Fields-of-Experts (FoE) filters and then the resulting multi-view features are represented by convolutional sparse coding (CSC) prior. By taking advantage of the convolutional filters, the proposed two-stage model inherits the strengths of multi-view features and CSC priors. The assembled multi-view features contain high-frequency, redundancy, and large range of feature orientations, which are favor to be represented by CSC and consequently for better image recovery. Augmented Lagrangian and alternating direction method of multipliers are employed to decouple the nonlinear optimization problem in order to iteratively approach the optimum solution. The results of various experiments on image deblurring and compressed sensing magnetic resonance imaging (CS-MRI) reconstruction consistently demonstrate that the proposed algorithm efficiently recovers image and presents advantages over the current leading restoration approaches.

34. Fast super-resolution algorithm using rotation-invariant ELBP classifier and hierarchical pattern matching

Original Research ArticlePages 1-15Dong Yoon Choi, Byung Cheol SongAbstractThis paper proposes a fast super-resolution (SR) algorithm using content-adaptive two-dimensional (2D) finite impulse response (FIR) filters based on a rotation-invariant classifier. The proposed algorithm consists of a learning stage and an inference stage. In the learning stage, we cluster a sufficient number of low-resolution (LR) and high-resolution (HR) patch pairs into a specific number of groups using the rotation-invariant classifier, and choose a specific number of dominant clusters. Then, we compute the optimal 2D FIR filter(s) to synthesize a high-quality HR patch from an LR patch per cluster, and finally store the patch-adaptive 2D FIR filters in a dictionary. Also, we present a smart hierarchical addressing method for effective dictionary exploration in the inference stage. In the inference stage, the ELBP of each input LR patch is extracted in the same way as the learning stage, and the best matched FIR filter(s) to the input LR patch is found from the dictionary by the hierarchical addressing. Finally, we synthesize the HR patch by using the optimal 2D FIR filter. The experimental results show that the proposed algorithm produces better HR images than the existing SR methods, while providing fast running time.

35. Using 3D face priors for depth recovery Original Research ArticlePages 16-29Chongyu Chen, Hai Xuan Pham, Vladimir Pavlovic, Jianfei Cai, Guangming Shi, Yuefang Gao, Hui






ChengAbstractFor repairing inaccurate depth measurements from commodity RGB-D sensors, existing depth recovery methods primarily rely on low-level and rigid prior information. However, as the depth quality deteriorates, the recovered depth maps become increasingly unreliable, especially for non-rigid objects. Thus, additional high-level and non-rigid information is needed to improve the recovery quality. Taking as a starting point the human face that is the primary prior available in many high-level tasks, in this paper, we incorporate face priors into the depth recovery process. In particular, we propose a joint optimization framework that consists of two main steps: transforming the face model for better alignment and applying face priors for improved depth recovery. Face priors from both sparse and dense 3D face models are studied. By comparing with the baseline method on benchmark datasets, we demonstrate that the proposed method can achieve up to 23.8% improvement in depth recovery with more accurate face registrations, bringing inspirations to both non-rigid object modeling and analysis.

36. The use of IMUs for video object retrieval in lightweight devices Original Research ArticlePages 30-42László Czúni, Metwally RashadAbstractWe introduce a new object retrieval approach where besides cameras, Inertial Measurement Unit (IMU) sensors are used for the retrieval of 3D objects. Contrary to computationally intensive deep learning recognition and retrieval solutions we focus on lightweight methods which could be utilized in handheld devices and autonomous systems equipped with moderate computing power and memory. We use fast and robust compact image descriptors and the relative orientation of the camera to build multi-view-centered retrieval object models. As for retrieval the Hough transformation paradigm is used to evaluate the results of queries applied on several frames of a video. We analyze the performance of our lightweight approach on several test datasets and with different comparisons, including automatic tracking for the generation of queries. These experiments show the advantages of our proposed techniques since retrieval rate could be significantly increased.

37. Non-texture image inpainting using histogram of oriented gradients Original Research ArticlePages 43-53Vahid K. Alilou, Farzin YaghmaeeAbstractThis paper presents a novel and efficient algorithm for non-texture inpainting of images based on using the dominant orientation of local gradients. It first introduces the concept of a new matrix called orientation matrix and then uses it for faster and better inpainting. The process of propagating information is carried out by using a new formulation which leads to much more efficient computations than the previous methods. The gain is both in terms of computational complexity and visual quality. The promising results in contexts of text, scratch, and block loss inpainting demonstrate the effectiveness of the proposed method.




38. Complex impulse noise removal from color images based on super pixel segmentation Original Research ArticlePages 54-65Lianghai JinAbstractImpulse noise sometimes appears as blob or granular shapes in images, which are irregularly shaped with typically several pixels wide in different directions. Most existing methods are developed to remove only single-point impulse noise and usually perform poor when applied to blob noise removal. This paper presents a new method to suppress such complex blob noise with varying sizes and irregular shapes in color images. First, a noisy image is segmented into super pixels by mean shift filtering followed by a clustering operation based on quaternion color distance. Then, by analyzing the characteristics of super pixels, image pixels are classified into noise-free, blob-noisy, and single-point impulse ones. Finally, a selected recursive vector median filter with adaptive window sizes is employed on the noisy pixels detected. The experimental results exhibit the validity of the proposed solution by showing excellent denoising effect and performance, compared to other color image denoising methods.

39. Thermal-image processing and statistical analysis for vehicle category in nighttime

trafficOriginal Research ArticlePages 88-109Apiwat Sangnoree, Kosin ChamnongthaiAbstractThe automatic tollgate at highway entrance and exit needs to categorize vehicle in order to collect highway passing fee especially at night time. This paper proposes a method of vehicle categorization in nighttime traffic using thermal-image processing and statistical analysis. To recognize the vehicular types, statistical relation between thermal features of engine heat, windscreen and others are utilized in this method. Firstly, appropriate threshold values for classifying the thermal features are automatically determined, entire area of the thermal image is then divided into blocks, and thermal features classified in all blocks by the threshold values are finally integrated for vehicle type categorization. To evaluate the performance of proposed method, experiments with 2937 samples of cars, vans and trucks are categorized, and the results approximately reveal 95.51% accuracy.

40. Combining multi-layer integration algorithm with background prior and label propagation for saliency detection

Original Research ArticlePages 110-121Chenxing Xia, Hanling Zhang, Xiuju GaoAbstractIn this paper, we propose a novel approach to automatically detect salient regions in an image. Firstly, some corner superpixels serve as the background labels and the saliency of other superpixels are determined by ranking their similarities to the background labels based on ranking algorithm. Subsequently, we further employ an objectness measure to pick out and propagate foreground labels. Furthermore, an integration algorithm is devised to fuse both background-based saliency map and foreground-based saliency map, meanwhile an original energy function is acted as refinement







before integration. Finally, results from multiscale saliency maps are integrated to further improve the detection performance. Our experimental results on five benchmark datasets demonstrate the effectiveness of the proposed method. Our method produces more accurate saliency maps with better precision-recall curve, higher F-measure and lower mean absolute error than other 13 state-of-the-arts approaches on ASD, SED, ECSSD, iCoSeg and PASCAL-S datasets.

41. A 3D-DCT video encoder using advanced coding techniques for low power mobile device

Original Research ArticlePages 122-135Jeoong Sung Park, Tokunbo OgunfunmiAbstractThe three-dimensional discrete cosine transform (3D-DCT) has been researched as an alternative to existing dominant video standards based on motion estimation and compensation. Since it does not need to search macro block for inter/intra prediction, 3D-DCT has great advantages for complexity. However, it has not been developed well because of poor video quality while video standards such as H.263(+) and HEVC have been blooming. In this paper, we propose a new 3D-DCT video coding as a new video solution for low power mobile technologies such as Internet of Things (IoT) and Drone. We focus on overcoming drawbacks reported in previous research. We build a complete 3D-DCT video coding system by adopting existing advanced techniques and devising new coding algorithms to improve overall performance of 3D-DCT. Experimental results show proposed 3D-DCT outperforms H.264 low power profiles while offering less complexity. From GBD-PSNR, proposed 3D-DCT provides better performance by average 4.6 dB.

42. Boundary points based scale invariant 3D point feature Original Research ArticlePages 136-148Baowei Lin, Fasheng Wang, Yi Sun, Wen Qu, Zheng Chen, Shuo ZhangAbstractIn this paper, we propose a method for encoding scale invariant 3D point features. We extract a set of boundary points from a point cloud. Next, we apply the scale-space concept on the boundary points to detect the scale invariant point border. We confirm three orthometric axes as the local reference frames. Three distribution matrices are generated by implementing the strategy of SPIN image method, and one-row-vector of descriptors are finally calculated. Experimental results on simulated and real scene point clouds demonstrate that the scale-invariant features of 3D point clouds can be effectively encoded by our method.

43. Frame-wise detection of relocated I-frames in double compressed H.264 videos based on convolutional neural network

Original Research ArticlePages 149-158Peisong He, Xinghao Jiang, Tanfeng Sun, Shilin Wang, Bin Li, Yi DongAbstractRelocated I-frames are a key type of abnormal inter-coded frame in double compressed videos with shifted GOP structures. In this work, a frame-wise detection method of relocated I-frame is







proposed based on convolutional neural network (CNN). The proposed detection framework contains a novel network architecture, which initializes with a preprocessing layer and is followed by a well-designed CNN. In the preprocessing layer, the high-frequency component extraction operation is applied to eliminate the influence of diverse video contents. To mitigate overfitting, several advanced structures, such as 1 × 1 convolutional filter and the global average-pooling layer, are carefully introduced in the design of the CNN architecture. Public available YUV sequences are collected to construct a dataset of double compressed videos with different coding parameters. According to the experiments, the proposed framework can achieve a more promising performance of relocated I-frame detection than a well-known CNN structure (AlexNet) and the method based on average prediction residual.

44. Image contrast enhancement based on intensity expansion-compression Original Research ArticlePages 169-181Shilong Liu, Md Arifur Rahman, Ching-Feng Lin, Chin Yeow Wong, Guannan Jiang, San Chi Liu, Ngaiming Kwok, Haiyan ShiAbstractIn most image based applications, input images of high information content are required to ensure that satisfactory performances can be obtained from subsequent processes. Manipulating the intensity distribution is one of the popular methods that have been widely employed. However, this conventional procedure often generates undesirable artifacts and causes reductions in the information content. An approach based on expanding and compressing the intensity dynamic range is here proposed. By expanding the intensity according to the polarity of local edges, an intermediate image of continuous intensity spectrum is obtained. Then, by compressing this image to the allowed intensity dynamic range, an increase in information content is ensured. The combination of edge guided expansion with compression also enables the preservation of fine details contained in the input image. Experimental results show that the proposed method outperforms other approaches, which are based on histogram divisions and clippings, in terms of image contrast enhancement.

45. Perceptual stereoscopic video coding using disparity just-noticeable-distortion model Original Research ArticlePages 195-204Cheolkon Jung, Qingtao Fu, Fei XueAbstractIn this paper, we propose perceptual stereoscopic video coding using a disparity just-noticeable-distortion (JND) model. We obtain the disparity JND model in stereo videos by disparity masking effects of the human visual system (HVS). The disparity JND model represents the maximum distortion of stereo perception that HVS cannot perceive. Based on the disparity JND model, we adjust prediction residuals to remove the perceptual redundancy of stereo videos. Thus, we achieve significant bit-rate saving while maintaining visual quality. Experimental results demonstrate that the proposed method significantly improves coding efficiency without loss of stereoscopic perceptual quality.

46. Salient object detection via boosting object-level distinctiveness and saliency refinement Original Research Article





Pages 224-237Xiaoyun Yan, Yuehuan Wang, Qiong Song, Kaiheng DaiAbstractMany salient object detection approaches share the common drawback that they cannot uniformly highlight heterogeneous regions of salient objects, and thus, parts of the salient objects are not discriminated from background regions in a saliency map. In this paper, we focus on this drawback and accordingly propose a novel algorithm that more uniformly highlights the entire salient object as compared to many approaches. Our method consists of two stages: boosting the object-level distinctiveness and saliency refinement. In the first stage, a coarse object-level saliency map is generated based on boosting the distinctiveness of the object proposals in the test images, using a set of object-level features and the Modest AdaBoost algorithm. In the second stage, several saliency refinement steps are executed to obtain a final saliency map in which the boundaries of salient objects are preserved. Quantitative and qualitative comparisons with state-of-the-art approaches demonstrate the superior performance of our approach.

47. Detecting image seam carving with low scaling ratio using multi-scale spatial and spectral entropies

Original Research ArticlePages 281-291Dengyong Zhang, Ting Yin, Gaobo Yang, Ming Xia, Leida Li, Xingming SunAbstractSeam carving is the most popular content-aware image retargeting technique. However, it may also be used to correct poor photo composition in photography competition or to remove object from image for malicious purpose. A blind detection approach is presented for seam carved image with low scaling ratio (LSR). It exploits spatial and spectral entropies (SSE) on multi-scale images (candidate image and its down-sampled versions). We observe that when a few seams are deleted from an original image, its SSE distribution is greatly changed. Forty-two features are designed to unveil the statistical properties of SSE in terms of centralized tendency, dispersion tendency and distribution tendency. They are combined with the local binary pattern (LBP)-based energy features to form ninety-six features. Finally, support vector machine (SVM) is exploited as classifier to determine whether an image is original or suffered from seam carving. Experimental results show that the proposed approach achieves superior detection accuracy over the state-of-the-art works, especially for resized image by seam carving with LSRs. Moreover, it is robust against JPEG compression and seam insertion.

48. Contour segment grouping for object detection Original Research ArticlePages 292-309Hui Wei, Chengzhuan Yang, Qian YuAbstractIn this paper, we propose a novel framework for object detection and recognition in cluttered images, given a single hand-drawn example as model. Compared with previous work, our contribution is threefold. (1) Three preprocessing procedures are proposed to reduce the number of irrelevant edge fragments that are often generated during edge detection in cluttered real images. (2) A novel shape descriptor is introduced for conducting partial matching between edge fragments and





model contours. (3) An efficient search strategy is adopted to identify the location of target object hypotheses. In the hypotheses verification stage, an appearance-based (support vector machine on pyramid histogram of oriented gradients feature) method is adopted to verify the hypothesis, identify the object, and refine its location. We do extensive experiments on several benchmark datasets including ETHZ shape classes, INRIA horses, Weizmann horses, and the two classes (anchors and cups) from Caltech 101. Experimental results show that the proposed method can significantly improve the accuracy of object detection. Comparisons with other recent shape-based methods further demonstrate the effectiveness and robustness of the proposed method.

49. Sum-of-gradient based fast intra coding in 3D-HEVC for depth map sequence (SOG- FDIC)

Original Research ArticlePages 329-339Jing Chen, Bohan Wang, Huanqiang Zeng, Canhui Cai, Kai-Kuang MaAbstractAs the latest video coding standard for multi-view plus depth video, 3D-HEVC yields high coding efficiency but at the cost of heavy computational complexity. To reduce the computational complexity, a fast intra coding algorithm based on sum-of-gradient criterion for depth map coding in 3D-HEVC, named SOG-FDIC, is proposed in this paper. Based on the observation that DMM modes and smaller partitioning sizes are rarely used in flat region, sum of gradient is presented to determine whether the current block belongs to the flat region so as to skip unnecessary checking of DMMs and smaller partitioning sizes. Experimental results show that the proposed algorithm can save about 21.8% coding time while keeping almost the same coding efficiency and the reconstructed video quality of depth maps and synthesized views, compared with the original 3D-HEVC. Moreover, it has been verified that the proposed method outperforms the state-of-the-art methods.

50. A generic denoising framework via guided principal component analysis Original Research ArticlePages 340-352Tao Dai, Zhiya Xu, Haoyi Liang, Ke Gu, Qingtao Tang, Yisen Wang, Weizhi Lu, Shu-Tao XiaAbstractThough existing state-of-the-art denoising algorithms, such as BM3D, LPG-PCA and DDF, obtain remarkable results, these methods are not good at preserving details at high noise levels, sometimes even introducing non-existent artifacts. To improve the performance of these denoising methods at high noise levels, a generic denoising framework is proposed in this paper, which is based on guided principle component analysis (GPCA). The propose framework can be split into two stages. First, we use statistic test to generate an initial denoised image through back projection, where the statistical test can detect the significantly relevant information between the denoised image and the corresponding residual image. Second, similar image patches are collected to form different patch groups, and local basis are learned from each patch group by principle component analysis. Experimental results on natural images, contaminated with Gaussian and non-Gaussian noise, verify the effectiveness of the proposed framework.





lib.jsut.edu.cnlib.jsut.edu.cn/_upload/article/files/ae/b4/d110c6c04e988… · web...

Documents