kyungpook national university arxiv:1812.07368v2 …mustansar fiaz, arif mahmood, sajid javed, and...

0

Handcra�ed and Deep Trackers: Recent Visual Object Tracking Approachesand Trends

MUSTANSAR FIAZ, Kyungpook National UniversityARIF MAHMOOD, Information Technology UniversitySAJID JAVED, University of WarwickSOON KI JUNG, Kyungpook National University

In recent years visual object tracking has become a very active research area. An increasing number of tracking algorithmsare being proposed each year. It is because tracking has wide applications in various real world problems such as human-computer interaction, autonomous vehicles, robotics, surveillance and security just to name a few. In the current study, wereview latest trends and advances in the tracking area and evaluate the robustness of di�erent trackers based on the featureextraction methods. �e �rst part of this work comprises a comprehensive survey of the recently proposed trackers. Webroadly categorize trackers into Correlation Filter based Trackers (CFTs) and Non-CFTs. Each category is further classi�edinto various types based on the architecture and the tracking mechanism. In the second part, we experimentally evaluated24 recent trackers for robustness, and compared handcra�ed and deep feature based trackers. We observe that trackersusing deep features performed be�er, though in some cases a fusion of both increased performance signi�cantly. In orderto overcome the drawbacks of the existing benchmarks, a new benchmark Object Tracking and Temple Color (OTTC) hasalso been proposed and used in the evaluation of di�erent algorithms. We analyze the performance of trackers over elevendi�erent challenges in OTTC, and three other benchmarks. Our study concludes that Discriminative Correlation Filter (DCF)based trackers perform be�er than the others. Our study also reveals that inclusion of di�erent types of regularizations overDCF o�en results in boosted tracking performance. Finally, we sum up our study by pointing out some insights and indicatingfuture trends in visual object tracking �eld.

CCS Concepts: •Computingmethodologies→Arti�cial intelligence; Computer vision; Computer vision problems;Tracking;

Additional Key Words and Phrases: Robustness of Tracking Algorithms, Object Tracking, Surveillance, Tracking Evaluation

ACM Reference format:Mustansar Fiaz, Arif Mahmood, Sajid Javed, and Soon Ki Jung. 0000. Handcra�ed and Deep Trackers: Recent Visual ObjectTracking Approaches and Trends. ACM Comput. Surv. 0, 0, Article 0 ( 0000), 36 pages.DOI: 0000001.0000001

1 INTRODUCTIONVisual Object Tracking (VOT) is a promising but di�cult sub-�eld of computer vision. It a�ained much reputationbecause of its widespread use in di�erent applications for instance autonomous vehicles [92], tra�c �owmonitoring [147], surveillance and security [5], human machine interaction [138], medical diagnostic systems[150], and activity recognition [4]. VOT is an a�ractive research area of computer vision due to opportunities

ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such,the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Governmentpurposes only. Permission to make digital or hard copies for personal or classroom use is granted. Copies must bear this notice and thefull citation on the �rst page. Copyrights for components of this work owned by others than ACM must be honored. To copy otherwise,distribute, republish, or post, requires prior speci�c permission and/or a fee. Request permissions from [email protected].© 0000 ACM. 0360-0300/0000/0-ART0 $15.00DOI: 0000001.0000001

ACM Computing Surveys, Vol. 0, No. 0, Article 0. Publication date: 0000.

arX

iv:1

812.

0736

8v2

[cs

.CV

] 1

2 Fe

b 20

19

https://orcid.org/0000-0003-2289-2284

https://orcid.org/0000-0001-5986-9876

https://orcid.org/0000-0003-0239-6785

0:2 • Fiaz et al.

and di�erent tracking challenges. In the previous few decades, remarkable endeavors are made by researchcommunity, but still VOT has much potential to explore further. �e di�culty of VOT lies in the myriad ofchallenges, such as occlusion, background clu�er, illumination changes, scale variation, low resolution, fastmotion, out of view, motion blur, deformation, in and out planer rotation [164, 165].

VOT is the process of identifying a region of interest in a sequence and consists of four sequential elements,including target initialization, appearance model, motion prediction, and target positioning. Target initializationis the process of annotating object position, or region of interest, with any of the following representations:object bounding box, ellipse, centroid, object skeleton, object contour, or object silhoue�e. Usually, an objectbounding box is provided in the initial frame of a sequence and the tracking algorithm estimates target positionin the remaining frames. Appearance modelling is composed of identifying visual object features for be�errepresentation of a region of interest and e�ective construction of mathematical models to detect objects usinglearning techniques. In motion prediction, the target location is estimated in subsequent frames. �e targetpositioning operation involves maximum posterior prediction, or greedy search. Tracking problems can besimpli�ed by constraints imposed on the appearance and motion models. During the tracking, new targetappearance is integrated by updating the appearance and motion models.

Currently, we have focused on monocular, model-free, single-target, casual, and short-term trackers forexperimental study. �e model-free characteristics hold for supervised training example in the initial frameprovided by bounding box. �e causality means that tracker will predict the target location on current framewithout prior knowledge of future frames. While, short-term means that if a tracker is lost (fails) during thetracking, re-detection is not possible. And trackers output is speci�ed by a bounding box.

Literature shows that much research has been performed on object tracking and various surveys have beenpublished. An excellent and extensive review of tracking algorithms is presented in [175] along with featurerepresentations and challenges. However, the �eld has greatly advanced in recent years. Cannons et al. [19]covered the fundamentals of object tracking problems, and discussed the building blocks for object trackingalgorithms, the evolution of feature representations and di�erent tracking evaluation techniques. Smeulders et al.[141] compared the performance of tracking algorithms and introduced a new benchmark. Li et al. [102] andYang et al. [169] discussed object appearance representations, and performed surveys for online generative anddiscriminative learning. Most of the surveys are somewhat outdated and subject to traditional tracking methods.Recently, the performance of tracking algorithm was boosted by the inclusion of deep learning techniques.Li et. al [101] classi�ed the deep trackers into Network Structure (NS), Network Function (NF), and NetworkTraining (NT). Moreover, VOT challenge [84–88] is providing the e�cient comparison of the various trackersbased and their brief introduction. However, our study di�ers in two aspects: (1) recent tracking approaches and(2) Comparative study of trackers based on their feature extraction method.

�e objective of the current study is to provide an overview of the recent progress, research trends, and tocategorize existing tracking algorithms. Our motivation is to provide interested readers an organized referenceabout the diverse tracking algorithms being developed, to help them �nd research gaps, and provide insights fordeveloping new tracking algorithms.

Features play an important role in the performance of a tracker. �ere are two broad categories of the featuresused by the tracking algorithms including HandCra�ed (HC) and deep features. HC features such as Histogram ofOriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), Local Binary Pa�ern (LBP) and color nameswere commonly used to represent target appearance. Recently researchers have shi�ed their methodology andfocus on deep features. Deep learning has shown remarkable success in various computer vision tasks such asobject recognition and tracking, image segmentation, pose estimation, and image captioning. Deep features havemany advantages over HC features because of having more potential to encode multi-level information and exhibitmore invariance to target appearance variations. �ere are various deep feature extraction methods includingRecurrent Neural Networks (RNN) [59], Convolutional Neural Networks (CNN) [140], Residual Networks [68],


Handcra�ed and Deep Trackers: Recent Visual Object Tracking Approaches and Trends • 0:3

and Auto-encoders [198]. In contrast to HC approaches, deep models are data-hungry and requiring a lot oftraining data. In applications with scarce training data, deep features are extracted using o�-the-shelf pre-trainedmodels such as VGGNet [140]. Despite the fact that deep features have achieved much success in single objecttracking [13, 33, 35] but still HC features [36, 110] produce comparative results and are being employed intracking algorithms. We have investigated di�erent trackers performance on the basis of HC, deep features andcombination of these features to get the broader aspect of the role of features in tracking performance.

As mentioned earlier, visual object tracking faces several challenges and therefore numerous algorithms areintroduced. For example Zhang et al. [192], Pan and Hu [127] and Yilmaz et al. [176] proposed tracking algorithmsto handle occlusion in videos. Similarly, to handle illumination variations, algorithms have been proposed byZhang et al. [197], Adam et al. [3], and Babenko et al. [8]. Moreover, Mei et al. [119], Kalal et al. [79], andKwon et al. [89] handled the problem of clu�ered background. Likewise, various tracking techniques havebeen developed to deal with other tracking challenges. Hence, there is a dire need to organize the literatureassociated with these challenges, to analyze the robustness of the trackers, and to categorize these algorithmsaccording to the challenges available in the existing benchmarks. In the current work, we categorized the trackersaccording to feature representation schemes such as handcra�ed and deep feature based trackers, and analyzedthe performance over eleven di�erent challenges.

�e rest of this paper is organized as follows: Section II demonstrates related work; the classi�cation of recenttracking algorithms and their brief introduction is explained in section III; experimental investigation is describedin section IV; and the conclusion and future directions are described in section V.

2 RELATED WORK�e research community has shown keen interest in VOT, and developed various state-of-the-art trackingalgorithms. �erefore, an overview of research methodologies and techniques will be helpful in organizing domainknowledge. Trackers can be categorized as single-object vs. multiple-object trackers, generative vs. discriminative,context-aware vs. non-aware, and online vs. o�ine learning algorithms. Single object trackers [93, 94] arethe algorithms tracking only one object in the sequence, while multi-object trackers [11, 91] simultaneouslytrack multiple targets and follow their trajectories. In generative models, the tracking task is carried out viasearching the best-matched window, while discriminative models discriminate target patch from the background[131, 169, 177]. In the current paper, recent tracking algorithms are classi�ed as Correlation-Filter based Trackers(CFTs) and Non-CFTs (NCFTs). It is obvious from the names that CFTs [21, 62, 144] utilize correlation �lters, andnon-correlation trackers use other techniques [57, 60, 83].

Yilmaz et al. [175] presented a taxonomy of tracking algorithms and discussed tracking methodologies, featurerepresentations, data association, and various challenges. Yang et al. [169] presented an overview of the local andglobal feature descriptors used to present object appearance, and reviewed online learning techniques such asgenerative versus discriminative, Monte Carlo sampling techniques, and integration of contextual information fortracking. Cannons [19] discussed object tracking components initialization, representations, adaption, associationand estimation. He discussed the advantages and disadvantages of di�erent feature representations and theircombinations. Smeulders et al. [141] performed analysis and evaluation of di�erent trackers with respect to avariety of tracking challenges. �ey found sparse and local features more suited to handle illumination variations,background clu�er, and occlusion. �ey used various evaluation techniques, such as survival curves, Grubs testing,and Kaplan Meier statistics, and provided evidence that F-score is the best measure of tracking performance. Liet al. [102] gave a detailed summary of target appearance models. �eir study included local and global featurerepresentations, discriminative, and generative, and hybrid learning techniques. In recent times, deep learninghas shown signi�cant progress in visual trackers and Li et el. [101] categorized deep learning trackers into threeaspects: (1) network structure; network function; and network training.


0:4 • Fiaz et al.

PT [172] Li’ [96] PMGRT [153] ALMM [181] P2t [54]

RPAC [108]DPT [110]

RPT [104]RTT [32]

PKCF [22] SCF [107] ESC [21]

Sparsity Based Trackers

Patch Learning

Multiple-Instance -Learning

Tracking Algorithms

Siamese-Based Trackers

Superpixel Based Trackers

Basic-CFTs

Regularized CFTs

Part-BasedTrackers

Graph Based Trackers

KCF [71]CF2 [114]HCFT* [113]HDT [130]

LCT [116]ILCT [115]MCPF [195]fDSST [34]

STAPLE [12]LMCF [156]RAJSSC [188]

JSSC [187]CREST [142]PTAV [45]

MUSTer [72]DNT [26]STCT [157]

SRDCF [36]deepSRDCF [35]SRDCFdecon [37]MRCT-AS [73]

CCOT [38]ECO [33]BACF [77]

CSRDCF [111]CACF [121]STRCF [97]

SCT [29]ACFN [27]DMSRDCF [56]

SiameseFC [13] CFNet [148] DCFNet [158] Dsiam [64] EAST [74]

SINT [145] SINT++ [160] YCNN [20] GOTURN [69] RDM [30]

MILTrack [9]FMIL [166]

CMIL [1] PMIL [161] CSR [139] Yang’s MIL [168]

SST [194]RSST [196]

CLRST [193] CEST [191] Guo’ [63] HST [174]

BKG [95] SPT [167] CST [155] Wang’ [154] SSD [75]

Gracker [159] SAT [41] GGT [42] TCNN [122] AMCT [173]

CNT [185]ELDA [51]ECT [52]

MDNet [123]SANet [46]BranchOut [65]

BIT [18]ADNet [178]DRLT [183]

Obli-Raf [186]DeepTrack [98]SDLSSVM [124]

DRT [53]TSN [146]

Siamese-Based Trackers

Part-Based Trackers

Luo’ [112] Ma’ [117] Wang’ [152] CTF [133] Bai’ [10]Fusion-Based

Trackers

Correlation Filter Trackers

Non-Correlation Filter Trackers

Fig. 1. Taxonomy of tracking algorithms.

Some relatively limited or focused reviews include the following works. Qi et al. [106] focused on classi�cationof online single target trackers. Zhang et al. [189] discussed tracking based on sparse coding, and classi�edsparse trackers. Ali et al. [5] discussed some classical tracking algorithms. Yang et al. [170] considered contextof tracking scene considering auxiliary objects [171] as the target context. Chen et al. [24] examined onlyCFTs. Arulampalam et al. [7] presented Bayesian tracking methods using particle �lters. Most of these studiesare outdated or consider only few algorithms and thus are limited in scope. In contrast, we presented a morecomprehensive survey of recent contributions. We classi�ed tracking algorithms as CFTs and NCFTs. Weevaluated the performance accuracy of HC and deep trackers. Moreover, tracking robustness has been examinedover di�erent challenges.

3 CLASSIFICATION OF TRACKING ALGORITHMSIn this section, recent tracking algorithms are studied and most of them are proposed during the last four years.Each algorithm presents a di�erent method to exploit target structure for predicting target location in a sequence.By analyzing the tracking procedure, we arrange these algorithms in a hierarchy and classify them into two maincategories: Correlation Filter Trackers (CFT) and Non-CFT (NCFT) with a number of subcategories in each class.

3.1 Correlation Filter TrackersDiscriminative Correlation Filters (DCF) are actively utilized in various computer vision applications includingobject recognition [48], image registration [44], face veri�cation [136], and action recognition [134]. In objecttracking, Correlation Filters (CF) have been used to improve robustness and e�ciency. Initially, the requirementof training made CF inappropriate for online tracking. In the later years, the development of Minimum Outputof Sum of Squared Error (MOSSE) �lter [15], that allows for e�cient adaptive training, changed the situation.



�e objective of the MOSSE �lter is to minimize the sum of the squared error between the desired output andactual output in Fourier domain. MOSSE is an improved version of Average Synthetic Exact Filter (ASEF) [16]which is trained o�ine to detect objects. ASEF computes mean for a set of exact �lters, each computed from adi�erent training image of the same object, to get best �lter at the output. Later on, many state-of-the-art CFTwere proposed based on MOSSE. Traditionally, the aim of designing inference of CF is to yield response mapthat has low value for background and high value for region of interest in the scene. One such algorithm isCirculant Structure with Kernal (CSK) tracker [70], which exploits circulant structure of the target appearanceand is trained using kernel regularized least squares method.

CF-based tracking schemes perform computation in the frequency domain to manage computational cost.General architecture of these algorithms follow ”tracking-by-detection” approach and is presented in Fig. 2.Correlation �lters are initialized from the initial frame of the sequence with a target patch cropped at the targetposition. During tracking, the target location is estimated in the new upcoming frame using the target estimatedposition in the last frame. To e�ectively represent appearance of the target, appropriate feature extraction methodis employed to construct feature map from the input patch. Boundaries are smoothed by applying a cosine�lter. Correlation operation is performed instead of exhausted convolution operation. �e response map iscomputed using Element-wise multiplication between adaptive learning �lter and extracted features, and byusing a Discrete Fourier Transform (DFT). DFT operates in the frequency domain using Fast Fourier Transform(FFT). Con�dence map is obtained in spatial domain by applying Inverse FFT (IFFT) over the response map. �emaximum con�dence score estimates the new target position. At the outcome, the target appearance at the newlypredicted location is updated by extracting features and updating correlation �lters.

Let h be a correlation �lter and x be the current frame, which may consist of the extracted features or the rawimage pixels. CNN convolutional �lters perform similar to correlation �lters in Fourier domain. According toconvolution theorem, correlation in frequency domain that computes a response map by performing element-wisemultiplication between zero padded versions of f(h) and f(x) is equivalent to circulant convolution in spatialdomain. O�en h is of much smaller size compared to x , therefore before transformation to Fourier domain, zeropadding is used such that transformed sizes of both are the same.

x ⊗ h = F−1(x � h∗), (1)

whereF−1 indicates the IFFT, denotes Fourier representation, ⊗ represents convolution, � means element-wisemultiplication, and ∗ is the complex conjugate. Equation yields a con�dence map between x and h. To updatethe correlation �lter, the estimated target around the maximum con�dence position is selected. Assume y is thedesired output. Correlation �lter h must satisfy for new target appearance z as:

y = F−1 (z � h∗), h∗ = y/z, (2)

where y denotes the desired output y in frequency domain and division operation is performed during element-wise multiplication. FFT reduces the computational cost, as circulant convolution has a complexity of O(n4) forimage size nxn while FFT require only O(n2 logn).

CF-based tracking frameworks face di�erent di�culties, such as the training of the target appearance (ori-entation, and shape), as it may change over time. Another challenge is the selection of an e�cient featurerepresentation, as powerful features may improve the performance of CFTs. Another important challenge forCFTs is scale adaption, as the size of correlation �lters are �xed during tracking. A target may change its scaleover time. Furthermore, if the target is lost then it cannot be recovered again. CFTs are further divided into thecategories B-CFTs, regularized CFTs, part-based, Siamese-based, and Fusion-based CFTs as explained below.

3.1.1 Basic Correlation Filter based Trackers. Basic-CFTs are tackers that use Kernelized Correlation Filters(KCF) [71] as their baseline tracker. Trackers may use di�erent features such as the HOG, colour names (CN)[39] and deep features using Recurrent Neural Networks (RNN) [162], Convolutional Neural Networks (CNN)


0:6 • Fiaz et al.

Initialization Frame z

PredictionFeature Extraction

Cosine Window Correlation Filter

Confidence Score

Desired Output

Feature Extraction

Train and Update

FFT IFFT

FFT

FFT

Fig. 2. The framework of correlation filter for visual objecttracking [24].

CNN

CF2

CF3

CF1

Weight 1

Weight 3

Weight 2Conv3-3Conv4-3

Conv5-3

Fig. 3. CF2 framework [114].

[140], and Residual Networks [68]. Numerous trackers have been developed using KCF as base tracker including[12, 26, 34, 45, 72, 113–116, 130, 142, 156, 157, 187, 188, 195].

A KCF [71] algorithm performs tracking using Gaussian kernel function for distinguishing between targetobject and its surroundings. HOG descriptors of cell size four is employed by KCF. During tracking, an imagepatch is cropped in new frame, HOG features are computed for that patch, and a response map is computed bymultiplying adaptive �lters on input features in Fourier domain. A new target position is predicted at the positionof maximum con�dence score in the con�dence map obtained by applying inverse Fourier transform on responsemap. A new patch containing object is then cropped and HOG features are recomputed to update the CF.

Ma et al.[114] exploited rich hierarchical Convolutional Features using CF represented as CF2 (Fig. 3). Forevery subsequent frame, the search area is cropped at the center based on previously predicted target position.�ree hierarchical convolutional features are extracted from VGGNet layers conv3-4, conv4-4, and conv5-4 toexploit target appearance. Deep features are resized to the same size using bilinear interpolation. An independentadaptive CF is utilized for each CNN feature, and response maps are computed. A coarse-to-�ne methodology isapplied over the set of correlation response maps to estimate the new target position. Adaptive hierarchical CFsare updated on newly-predicted target location. Ma et al. [113] also proposed Hierarchical Correlation Featurebased Tracker (HCFT*), which is an extension of CF2 that integrates re-detection and scale estimation of target.

�e Hedged Deep Tracking (HDT) [130] algorithm takes advantage of multiple levels of CNN features. In HDT,authors hedged many weak trackers together to a�ain a single strong tracker. During tracking, the target positionat the previous frame is utilized to crop a new image to compute six deep features using VGGNet. Deep featuresare exploited to individual CF to compute response maps also known as weak experts. �e target position isestimated by each weak tracker, and the loss for each expert is also computed. A standard hedge algorithm isused to estimate the �nal position. All weak trackers are hedged together into a strong single tracker by applyingan adaptive online decision algorithm. Weights for each weak tracker are updated during online tracking. Inan adaptive Hedge algorithm, a regret measure is computed for all weak trackers as a weighted average loss.A stability measure is computed for each expert based on the regret measure. �e hedge algorithm strives tominimize the cumulative regret of weak trackers depending upon its historical information.

�e Long-term Correlation Tracking (LCT) [116] involves exclusive prediction of target translation and scaleusing correlation �lters and random fern classi�er [126] is used for online re-detection of the target duringtracking. In LCT algorithms, the search window is cropped on the previously estimated target location and afeature map is computed. Translation estimation is performed using adaptive translation correlation �lters. Atarget pyramid is generated at the newly predicted target translation position, and scale is estimated using aseparate regression correlation model. In case of failure, the LCT tracking algorithm performs re-detection. If theestimated target score is less then a threshold, re-detection is then performed using online random fern classi�er.



Average response is computed using posteriors from all the ferns. LCT selects the positive samples to predictnew patch as target by using the k-nearest neighbor (KNN) classi�er. Author has further Improved LCT (ILCT)[115] using SVM classi�er instead of fern classi�er for re-detection.

�e Multi-task Correlation Particle Filter (MCPF) proposed by Zhang et al. [195] employ particle �lterframework. �e MCPF shepherd particles in the search region by exploiting all states of the target. �e MCPFcomputes response maps of particles, and target position is estimated as weighted sum of the response maps.Danelljan et al. [34] proposed Discriminative Scale Space Tracking (DSST) to separately estimate translationand scale by learning independent CFs. Scale estimation is done by learning the target sample at various scalevariations. In proposed framework, �rst translation is predicted by applying a standard translation �lter to everyincoming frame. A�er translation estimation, the target size is approximated by employing trained scale �lter atthe target location obtained by the translation �lter. �is way, the proposed strategy learns the target appearanceinduced by scale rather than by using exhaustive target size search methodologies. �e author further improvedthe computational performance without sacri�cing the robustness and accuracy. Fast DSST (fDSST) employssub-grid interpolation to compute correlation scores.

�e Sum of Template And Pixel-wise LEarners (STAPLE) [12] employed two separate regression models tosolve the tacking problem by utilizing the inherent structure for each target representation. �e tracking designtakes advantage of two complementary factors from two di�erent patch illustrations to train a model. HOGfeatures and global color histograms are used to represent the target. In the colour template, foreground andbackground regions are computed at previously estimated location. �e frequency of each colour bin for objectand background are updated, and a regression model for colour template is computed. In the search area, aper-pixel score is calculated based on previously estimated location, and the integral image is used to computeresponse, while for the HOG template, HOG features are extracted at the position predicted in the previousframe, and CF are updated. At every incoming frame, a search region is extracted centered at previous estimatedlocation, and their HOG features are convolved with CF to obtain a dense template response. Target positionis estimated from template and histogram response scores as a linear combination. Final estimated location isin�uenced by the model which has more scores.

�e Convolutional RESidual learning for visual Tracking (CREST) algorithm [142] utilizes residual learning[68] to adapt target appearance and also performs scale estimation by searching patches at di�erent scales. Duringtracking, the search patch is cropped at previous location, and convolutional features are computed. Residual andbase mapping are used to compute the response map. �e maximum response value gives the newly estimatedtarget position. Scale is estimated by exploring di�erent scale patches at newly estimated target center position.�e Parallel Tracking And Verifying (PTAV) [45] is composed of two major modules, i.e. tracker and veri�er.Tracker module is responsible for computing the real time inference and estimate tracking results, while theveri�er is responsible for checking whether the results are correct or not. �e Multi-Store tracker (MUSTer) [72]avoid dri�ing and stabilizes the tracking by aggregating image information using short and long term stores, andis based on the Atkinson-Shi�rin memory model. Short term storage involves an integrated correlation �lter toincorporate spatiotemporal consistency, while long term storage involves integrated RANSAC estimation andkey point match tracking to control the output.

3.1.2 Regularized Correlation Filter Trackers. Discriminative CF (DCF) tracking algorithms are limited in theirdetection range because they require �lter size and patch size to be equal. �e DCF may learn the background forirregularly-shaped target objects. �e DCF is formulated from periodic assumption, learns from a set of trainingsamples, and thus may learn negative training patches. DCF response maps have accurate scores close to thecentre, while other scores are in�uenced due to periodic assumption, thus degrading DCF performance. Anotherlimitation of DCFs is that they are restricted to only a �xed search region. DCF trackers perform poorly on a targetdeformation problem due to over ��ing of model caused by learning from target training samples but missing the


0:8 • Fiaz et al.

(a) Standard DCF (b) SRDCF

Fig. 4. Di�erence between standard DCF and SRDCF [36].

negative samples. �us, the tracker fails to re-detect in case of occlusion. A larger search region may solve theocclusion problem but the model will learn background information which degrades the discrimination power ofthe tracker. �erefore, there is a need to incorporate a measure of regularization for these DCF limitations andthose trackers are classi�ed as Regularized Correlation Filter Trackers (R-CFTs). Several R-CFTs have beenproposed such as [27, 29, 33, 35–38, 56, 73, 77, 97, 110, 121].

Danelljan et al.[36] presented Spatially Regularized DCF (SRDCF) by introducing spatial regularization inDCF learning. During tracking, the regularization component weakens the background information as shownin Fig. 4. Spatial regularization constraints the �lter coe�cients based on spatial information. �e backgroundis suppressed by assigning higher values to the coe�cients that are located outside the target territory andvice verse. �e SRDCF framework has been updated by using deep CNN features in deepSRDCF [35]. �eSRDCF framework has also been modi�ed to handle contaminated training samples in SRDCFdecon [37]. It downweights corrupted training samples and estimate good quality samples. SRDCFdecon extracts training samplesfrom previous frames and then assign higher weights to correct training patches. SRDCFdecon performs jointadaptation of both appearance model and weights of the training samples.

Li et al. [97] introduced the temporal regularization in SRDCF and introduced Spatial-Temporal RegularizedCF (STRCF). Temporal regularization has been induced using passive aggressive learning to SRDCF with singleimage. Recently, deep motion features have been used for activity recognition [78]. Motion features are obtainedfrom information obtained directly from optical �ow applied to images. A CNN is then applied to optical �ow toget deep motion features. Gladh et al.[56] presented Deep Motion SRDCF (DMSRDCF) which fused deep motionfeatures along with hand-cra�ed appearance features using SRDCF as baseline tracker. Motion features arecomputed as reported by [25]. Optical �ow is calculated on each frame on previous frame. �e x, y componentsand magnitude of optical �ow constitute three channels in the �ow map, which is normalized between 0 and 255and fed to the CNN to compute deep motion features.

Danelljan et al.[38] proposed learning multi-resolution feature maps, which they name as Continuous Con-volutional Operators for Tracking (CCOT). �e convolutional �lters are learned in a continuous sequence ofresolutions which generates a sequence of response maps. �ese multiple response maps are then fused toobtain �nal uni�ed response map to estimate target position. �e E�cient Convolution Operators (ECO) [33]tracking scheme is an improved version of CCOT. �e CCOT learns a large number of �lters to capture targetrepresentation from high dimensional features, and updates the �lter for every frame, which involves training ona large number of sample sets. In contrast, ECO constructs a smaller set of �lters to e�ciently capture targetrepresentation using matrix factorization. �e CCOT learns over consecutive samples in a sequence whichforgets target appearance for a long period thus causes over��ing to the most recent appearances and leading tohigh computational cost. In contrast, ECO uses a Gaussian Mixture Model (GMM) to represent diverse targetappearances. Whenever a new appearance is found during tracking, a new GMM component is initialized. If themaximum limit of components is achieved, then a GMM component with minimum weight is discarded if itsweight is less than a threshold value. Otherwise, the two closest components are merged into one component.

�e Channel Spatial Reliability for DCF (CSRDCF) [110] tracking algorithm integrates channel and spatialreliability with DCF. Training patches also contain non-required background information in addition to the



(a) Late merge

BA

Cost Function

Output

BA

Cost Function

(b) Intermediate merge

Output

BA

Cost Function

(c) Early merge

Output

Fig. 5. Siamese CNN typologies [90].

CNN

CNN

Template

Search Region

Fig. 6. The framework of SiameseFC [13].

required target information. �erefore, DCFs may also learn background information, which may lead to the dri�problem. In CSRDCF, spatial reliability is ensured by estimating a spatial binary map at current target positionto learn only target information. Foreground and background models retained as colour histogram are used tocompute appearance likelihood using Bayes’ rule. A constrained CF is obtained by convolving the CF with spatialreliability map that indicates which pixels should be ignored. Channel reliability is measured as a product ofchannel reliability measure and detection reliability measures. �e channel reliability measure is the maximumresponse of channel �lter. Channel detection reliability in response map is computed from the ratio between thesecond and �rst major modes, clamped at 0.5. Target position is estimated at maximum response of search patchfeatures and the constrained CF, and is weighted by channel reliability.

Mueller et al. [121] proposed Context Aware Correlation Filter tracking (CACF) framework where globalcontext information is integrated within Scale Adaptive Multiple Feature (SAMF) [103] as baseline tracker. �emodel is improved to compute high responses for targets, while close to zero responses for context information.�e SAMF uses KCF as baseline and solves the scaling issue by constructing a pool containing the target atdi�erent scales. Bilinear interpolation is employed to resize the samples in the pool to a �xed size template. Kianiet al. [77] exploited the background patches and proposed Background Aware Correlation Filter (BACF) tracker.

�e Structuralist Cognitive model for Tacking (SCT) [29] divides the target into several cognitive units. Duringtracking, the search region is decomposed into �xed-size grid map, and an individual A�entional Weight Map(AWM) is computed for each grid cell. �e AWM is computed from the weighted sum of A�entional WeightEstimators (AWE). �e AWE assigns more weights to target grid which less weights are given to background gridusing a Partially Growing Decision Tree (PGDT) [28]. Each unit works as individual KCF with A�entinal CF(AtCF), having di�erent kernel types with distinct features and corresponding AWM. �e priority and reliabilityof each unit are computed based on relative performance among AtCFs and its own performance, respectively.Integration of response maps of individual units gives target position.

Choi et al. proposed a A�entional CF Network (ACFN) [27] exploits target dynamic based on an a�entionalmechanism. An ACFN is composed of a CF Network (CFN) and A�entional Network (AN). �e CFN has severaltracking modules that compute tracking validation scores as precision. �e KCF is used for each tracking modulewith AtCF and AWM. �e AN selects tracking modules to learn target dynamics and properties. �e AN iscomposed of 2 Sub Networks (SN) such as Prediction SN (PSN) and Selection SN (SSN). Validation scores forall modules are predicted in PSN. �e SSN chooses active tracking modules based on current estimated scores.Target is estimated as that having the best response among the selected subset of tracking modules.

3.1.3 Siamese-Based Correlation Filter Trackers. A Siamese network joins two inputs and produces a singleoutput. �e objective is to determine whether identical objects exist, or not, in the two image patches that areinput to the network. �e network measures similarity between the two inputs, and has the capability to learnsimilarity and features jointly. �e concept of Siamese network was initially used for signature veri�cation and�ngerprint recognition. Later, Siamese networks were used in various applications, including face recognitionand veri�cation [137], stereo matching [180], optical �ow [40], large scale video classi�cation [82] and patchmatching [179]. We observe that Siamese architecture using CNN �nd similarity between two images using


0:10 • Fiaz et al.

shared (convolutional and/or fully connected) layers. Siamese CNN can be categorized into three main groupsbased on late merge, intermediate merge, and early merge [90] (Fig .5).

• Late merge: Two image patches are evaluated separately in parallel using the same network and combinedat the �nal layer [31].• Intermediate merge: �e network processes input images separately and then merges into a single stream

well before the �nal layer [40].• Early merge: Two image are stacked and a uni�ed input is fed to single CNN.

Integration of CFTs with Siamese network for visual tracking are classi�ed as Siamese-Based CFTs and hasbeen used to handle tracking challenges [13, 64, 74, 148, 158].

Siamese Fully Convolutional networks (SiameseFC) [13] shown in �g. 6 solves the tracking problem usingsimilarity learning that compares exemplar (target) image with a same-size candidate image, and yields highscores if the two images are the same. �e SiameseFC algorithm is fully convolutional, and its ouptput is ascalar-valued score map that takes as input an example target and search patch larger than target predicted inthe previous frame. �e SiameseFC network utilizes a convolutional embedding function and a correlation layerto integrate the deep feature maps of the target and search patch. Target position is estimated at maximum valuein response map. �is gives frame to frame target displacement. Valmadre et al. [148] introduced CorrelationFilter Network (CFNet) for end-to-end learning of underlying feature representations through gradient backpropagation. SiameseFC is used as base tracker, and CFNet is employed in forward mode for online tracking.During the online tracking of CFNet, target features are compared with the larger search area on new framebased on previously estimated target location. A similarity map between the target template and the search patchis produced by calculating the cross-correlation.

�e Discriminative Correlation Filters Network (DCFNet) [158] utilizes lightweight CNN network with corre-lation �lters to perform tracking using o�ine training. �e DCFNet performs back propagation to adapt the CFlayer utilizing a probability heat-map of target position.

Recently, Guo et al. [64] presented DSaim that has the potential to reliably learn online temporal appearancevariations. �e DSaim exploits CNN features for target appearance and search patch. Contrary to the SiameseFC,the DSaim learns target appearance and background suppression from previous frame by introducing regularizedlinear regression . Target appearance variations are learned from �rst frame to current frame, while backgroundsuppression is performed by multiplying the search patch with the learned Gaussian weight map. �e DSaimperforms element-wise deep feature fusion through circular convolution layers to multiply inputs with weightmap. Huang presented EArly Stopping Tracker (EAST) [74] to learn polices using deep Reinforcement Learning(RL) and improving speedup while maintaining accuracy. �e tracking problem is solved using Markov decisionprocess. A RL agent makes decision based on multiple scales with an early stopping criterion.

3.1.4 Part-Based Correlation Filter Trackers. �ese kind of trackers learn target appearance in parts, whilein other CFTs target template is learned as a whole. Variations may appear in a sequence, not just because ofillumination and viewpoint, but also due to intra-class variability, background clu�er, occlusion, and deformation.For example, an object may appear in front of the object being tracked, or a target may undergo non-rigidappearance variations. Part-based strategies are widely utilized in several applications, including object detection[50], pedestrian detection [129] and face recognition [81]. Several part-based trackers [21, 22, 32, 104, 107, 108, 111]have been developed to solve the challenges where targets are occluded or deformed in the sequences.

Real time Part based tracking with Adaptive CFs (RPAC) [108] adds a spatial constraint to each part of objectas shown in Fig. 7. In RPAC, KCF tracker is employed to track individually �ve parts of a target. Con�dence scoremap for each part is computed by assigning adaptive weights during tracking for every new input frame. A jointmap is constructed by assigning adaptive weights to each �ve con�dence score maps and a new target positionis estimated using particle �lter method. During tracking, adaptive weights or con�dence scores for each part



(a) Parts trackers for each bounding box

(b) Confidence maps for each part tracker

(c) Joint confidence map

Fig. 7. Each part tracker computes response map independently. Response maps are combined to get a joint confidence mapas weighted sum of individual maps in the Bayesian framework. Purple rectangles represent the sample candidates whilesolid purple denote the maximum likelihood on the confidence map [108].

are calculated by computing sharpness of response map and Smooth Constraint of Con�dence Map (SCCM).Response sharpness is calculated using Peak-to-Side-lobe Ratio (PSR), while SCCM is de�ned by the spatial shi�of a part between two consecutive frames. Adaptive part trackers are updated for parts with weights higher thena threshold value. A Bayesian inference theorem is employed to compute the target position by calculating theMaximum A Posteriori (MAP) for all parts.

�e Reliable Patch Tracker (RPT) [104] is based on particle �lter framework which apply KCF as base trackerfor each particle, and exploits local context by tracking the target with reliable patches. During tracking, theweight for each reliable patch is calculated based on whether it is a trackable patch, and whether it is a patch withtarget properties. �e PSR score is used to identify patches, while motion information is exploited for probabilitythat a patch is on target. Foreground and background particles are tracked along with relative trajectories ofparticles. A patch is discarded if it is no longer reliable, and re-sampled to estimate a new reliable patch. Anew target position is estimated by Hough Voting-like strategy by obtaining all the weighted, trackable, andreliable positive patches. Recurrently Target a�ending Tracking (RTT) [32] learns the model by discoveringand exploiting the reliable spatial-temporal parts using DCF. �ard-directional RNNs are employed to identifyreliable parts from di�erent angles as long-range contextual cues. Con�dence maps from RNNs are used toweight adaptive CFs during tracking to suppress the negative e�ects of background clu�er. Patch based KCF(PKCF) [22] is a particle �lter framework to train target patches using KCF as base tracker. Adaptive weights ascon�dence measure for all parts based on the PSR score are computed. For every incoming frame, responses foreach template patch are computed. �e PSR for each patch is computed, and maximum weighted particles areselected.

�e Enhanced Structural Correlation (ESC) tracker [21] exploits holistic and object parts information. �etarget is estimated based on weighted responses from non-occluded parts. Colour histogram model, based onBayes’ classi�er is used to suppress background by giving higher probability to objects. �e background contextis enhanced from four di�erent directions, and is considered for the histogram model of the object’s surroundings.�e enhanced image is broke down into patches (one holistic and four local) and CF is employed to all patches.�e CF is employed to all image patches and �nal responses are obtained from the weighted response of the�lters. Weight as a con�dence score for each part is measured from the object likelihood map and the maximumresponses of the patch. Adaptive CFs are updated for those patches whose con�dence score exceeds a thresholdvalue. Histogram model for object are updated if the con�dence score of object is greater then a threshold value,while background histogram model is updated on each frame. Zuo et al.[107] proposed Structural CF (SCF) toexploit the spatial structure among the parts of an object in its dual form. �e position for each part is estimatedat the maximum response from the �lter response map. Finally, the target is estimated based on the weightedaverage of translations for all parts.

3.1.5 Fusion-based Correlation Filter Trackers. In image fusion, complementary information is fused to improveperformance in numerous applications including medical imaging [14], face recognition [23], image segmentation



Dec

isio

n-L

evel

Feat

ure

-Lev

elP

ixel

-Lev

el

A type Images B type Images Fused Images

PLF

Features A Features B Features C Features A

Tracker ATracker A

FLF

DLF

Initialization

Feature Extraction

Correlation Filter

Response Map

Prediction

Fig. 8. Pixel-level, feature-level and decision level fusion-based tracking framework [112].

[151], and image enhancement [47]. Image fusion may be performed as Pixel-Level Fusion (PLF), Feature-LevelFusion (FLF), and Decision-Level Fusion (DLF) as shown in Fig. 8. Numerous trackers using di�erent types offusion have been developed including [10, 112, 117, 133, 152].

Luo et al. [112] used SAMF as a baseline tracker and performed those fusions. Authors used visual andinfrared images for PLF, HOG, grey while colorname features are used for FLF. Ma et al. [117] proposed FLFtracker where raw pixels, color histogram and Haar features were employed. Authors divided the region ofinterest into sub-parts and evaluated the features individually, which are fused using weighted entropy to exploitthe complementary information. Target appearances are updated via subspace for each kind of feature withnew object samples. Wang et al. [152] introduced deep feature fusion technique using two networks, LocalDetection Network (LDN) and Global Network Detection (DND). LND employs VGG-16 and fuses the deepfeatures extracted from conv4-3 and conv5-3 layers to generate response map. Feature map from conv5-3 isupsampled using a deconvolution layer and added to conv4-3 featuers. If con�dence score is less then a thresholdthen target is considered to be lost. GDN detects the target if LDN fails to track. It employs Region ProposalNetwork (RPN) a�er conv4-3. LDN is updated to integrate target variations while GDN parameters are �xed.

Rapuru et al. [133] proposed Correlation based Tracker level Fusion (CTF) by integrating two complementarytrackers, Tracking-Learning-Detection (TLD) [80] and KCF. TLD tracker has resumption ability and is composedof three basic units, tracking to predict the new target position; target localization in current frame; rectifyingthe detector error by learning di�erent target variations. KCF tracker performs tracking by detection and usesnon-linear regression kernels. KCF exploits the training and testing data in circulant structure that results inlow computational cost. During tracking, if the output of TLD tracker is valid then output for KCF tracker iscomputed. Conservative correspondence Cc for both trackers is calculated as con�dence score such that the �rst50% of positive patches has resemblance with the sample patches. Current bounding box (cbb) is the output ofthat tracker which has maximum Cc score. TLD calculates clusters of positive patches as TLD detector response(dbb). Final bounding box (fbb) is calculated as: if overlap score between cbb and dbb is greater than a threshold,and relative similarity is also greater then a threshold, otherwise the target is considered lost. Relative similaritydepicts the target con�dence. �e fbb is computed as weighted mean of dbb and cbb.

3.2 Non-Correlation Filter Trackers (NCFT)We categorize all trackers which do not employ correlation �lters as Non-Correlation Filter based Trackers(NCFTs). We categorize NCFTs into multiple categories including patch learning, sparsity, superpixel, graph,



multiple-instance-learning, part-based , and Siamese-based trackers. �ese trackers are discriminative exceptsparsity-based trackers which are generative trackers.

3.2.1 Patch Learning Trackers. Patch learning trackers exploit both target and background patches. Atracker is trained on positive and negative samples. �e trained tracker is tested on number of samples, andthe maximum response gives the target position. Several trackers have been proposed including [18, 46, 51–53, 65, 98, 123, 124, 146, 178, 183, 185, 186].

A Multi-Domain Network (MDNet) [123] consists of shared layers (three convolutional and two Fully-Connected(FC) layers) and one domain-speci�c FC layer as shown in Fig. 9. Shared layers exploit generic target representationfrom all the sequences, while domain speci�c layer is responsible for identi�cation of target using binaryclassi�cation for a speci�c sequence. During online tracking, the domain speci�c layer is independently learnedat the �rst frame. Samples are generated based on previous target location, and a maximum positive score yieldsthe new target position. Weights of the three convolutional layers are �xed while weights of three fully connectedlayers are updated for short- and long-term updates. Long-term update is performed a�er a �xed interval frompositive samples. �e short-term update is employed whenever tracking fails and the weight adaption for FClayers is performed using positive and negative samples from the current short term interval. Target position isadjusted by employing a bounding box regression model [55] in the subsequent frames.

A Structure Aware Network (SANet) [46] exploits the target’s structural information based on particle �lterframework. CNN and RNN deep features are computed for particles. RNN encodes the structural informationof the target using directed acyclic graphs. SANet is a modi�ed version of MDNet, with the addition of RNNlayers to amplify the rich object representation. Convolutional and recurrent features are fused using a skipconcatenation strategy to encode the rich information. Han et al. [65] presented Branch-Out algorithm, whichuses MDNet as a base tracker. �e Branch-Out architecture comprises of three CNN and multiple FC layers asbranches. Some branches consists of one fully-connected layer, while others have two fully-connected layers.During tracking, a random subset of branches is selected by Bernoulli distribution to learn target appearance.

Zhang et al. [185] proposed Convolutional Networks without Training (CNT) tracker employ particle �lterframework that exploits the inner geometry and local structural information of the target. �e CNT algorithm isan adaptive algorithm in which appearance variation of target is adapted during the tracking. CNT employs ahierarchical architecture with two feed forward layers of convolutional network to generate an e�ective targetrepresentation. In the CNT, pre-processing is performed on each input image where image is warped andnormalized. �e normalized image is then densely sampled as overlapping local image patches of �xed size, alsoknown as �lters, in the �rst frame. A�er pre-processing, a feature map is generated from a bank of �lters selectedwith k-mean algorithm. Each �lter is convolved with normalized image patch, which is known as simple cellfeature map. In second layer, called complex cell feature map, a global representation of target is formed bystacking simple cell feature map which encodes local as well as geometric layout information.

Exemplar based Linear Discriminant Analysis (ELDA) [51] employs LDA to discriminate target and background.ELDA takes several negative samples from the background and single positive sample at current target position.ELDA has object and background component models. �e object model consists of long-term and short-termmodels. �e target template at the �rst frame corresponds to long-term model, while short-term model correspondsto the target appearance in a short periods. �e background models comprises of an online and o�ine backgroundmodels. �e online is built from negative samples around the target, while the o�ine background model is trainedon large number of negative samples from natural images. �e ELDA tracker is comprised of short and long termdetectors. Target location is estimated from the sum of long-term and weighted sum of short-term detectionscores. ELDA has been enhanced by integration with CNN, and named as Enhanced CNN Tracker (ECT) [52].

�e Biologically Inspired Tracker (BIT) [18] performs tracking like ventral stream processing. �e BIT trackingframework consists of an appearance and tracking model. �e appearance model has two units, classical simple



FC6-n

FC6-1

FC6-2

Conv1Conv2

Conv3

FC4 FC5

Shared Layers Domain-Specific Layer

Fig. 9. MDNet architecture [123].

CNN

CNN

Current Frame

Previous Frame

Search Region

Template

Fully connected layers

Prediction

Fig. 10. GOTURN tracking framework [69].

cells (S1) and cortical complex cells (C1). A S1 is responsible to exploiting colour and texture information, while aC1 performs pooling and combining of color and texture features to form complex cell. A two-layer trackingmodel is composed of generative and discriminative models: a view-tuned learning (S2) unit and a task dependentlearning (C2) unit. Generative S2 unit computes response map by performing convolution between the inputfeatures, and the target and response maps are fused via average pooling. �e discriminative C2 unit thencomputes new target position by applying CNN classi�er. An Action-Decision Network (ADNet) [178] controlssequential actions (translation, scale changes, and stopping action) for tracking using deep RL. �e ADNet iscomposed of three convolutional and three FC layers. An ADNet is de�ned as an agent with the objective to �ndtarget bounding box. �e agent is pretrained to make decision about target’s movement from a de�ned set ofactions. During tracking, target is tracked based on estimated action from network at the current tracker location.Actions are repeatedly estimated by agent unless reliable target position is estimated. Under the RL, the agentgets rewarded when it succeeds in tracking the target, otherwise, it gets penalized.

�e Oblique Random forest (Obli-Raf) [186] exploits geometric structure of the target. During tracking, samplepatches are drawn as particles and forwarded to an oblique random forest classi�er, based on estimated targetposition on previous frame. Obli-Raf generates the hyperplane from data particles in a semi-supervised mannerto recursively cluster sample particles using proximal support vector machine. Particles are classi�ed as targetor background, based on votes at each leaf node of the tree. Particle with maximum score will be consideredas newly-predicted target position. If the maximum votes limit is less then a prede�ned threshold, then a newparticle samples set is produced from the estimated target location. If maximum limit is achieved, the model isupdated otherwise the previous model is retained.

Dual Linear Structured Support Vector Machine (SSVM) (DLSSVM) [124] which is the motivation of Struck [67].Usually, classi�er is trained to discriminate target from background but Struck employs kernelized structuredoutput of the SVM for adaptive tracking. DLSSVM uses dual SSVM linear kernels as discriminative classi�er forexplicit high dimensional features as compared to Struck. �e di�erence between both trackers is the selectionof optimization scheme to update dual coe�cients. DLSSVM employs Dual Coordinate Descent (DCD) [132]optimization method to compute a closed form solution. Another di�erence is that in Struck, pair of dual variablesare selected and optimized while DLSSVM selects only one dual variable. Scale DLSSVM (SDLSSVM) improvesthe tracker by incorporating multi-scale estimation.

3.2.2 Multiple-Instance-Learning Based Trackers. Multiple-Instance-Learning (MIL) was introduced by Diet-terich and is widely used in many computer visions tasks where MIL is being used for example object detection[182], face detection [61] and action recognition [6]. Various researcher have employed MIL to track targets[1, 9, 139, 161, 166, 168]. In MIL based tracking, training samples are placed in bags instead of consideringindividual patches, and labels are given at bags level. Positive label is assigned to a bag if it has at-least onepositive sample in it and on the other hand, negative bag contains all negative samples. Positive bag may containpositive and negative instances. During training in MIL, label for instances are unknown but bag labels are



known. In the MIL tracking framework instances are used to construct weak classi�ers and a few instances areselected and combined to form a strong classi�er.

Babenko et al. [9] designed a novel MILTrack to label ambiguity of instances using Haar features. MILBoost asbaseline tracker is utilized which employs the gradient boosting algorithm to maximize the log likelihood ofbags. A strong classi�er is trained to detect a target by choosing weak classi�ers. A weak classi�er is computedusing log odds ratio in a Gaussian distribution. Bag probabilities are computed using a Noisy-OR model. Onlineboosting algorithm is employed to get new target position from weighted sum of weak classi�ers.

Xu et al. [166] proposed an MIL framework that uses Fisher information using MILTrack (FMIL) to select weakclassi�ers. Uncertainty is measured from unlabeled samples using Fisher information criterion instead of loglikelihood. An online boosting method is employed for feature selection to maximize the Fisher information ofthe bag. Abdechiri et al. [1] proposed Chaotic theory in MIL (CMIL). Chaotic representation exploits complexlocal and global target information. HOG and Distribution Fields (DF) features with optimal dimension areused for target representation. Chaotic approximation is employed to enhance the discriminative ability ofthe classi�er. �e signi�cance of the instance is calculated using position distance and fractal dimensions ofstate space simultaneously. �e appearance model known as chaotic model is learned to adapt dynamic oftarget through chaotic map to maximize likelihood of bags using. To encode chaotic information, state spaceis reconstructed by converting an image into a vector form and normalizing it with a zero mean and varianceequivalent to one. Taken’s embedding theory generate a multi-dimensional space map from one-dimension space.�e minimum delay in time and prediction of the embedding dimension is performed by false nearest neighboursto reduce dimensionality for state space reconstruction. Finally, GMM is imposed to model state space.

Wang et al. [161] presented Patch based MIL (P-MIL) that decomposes the target into several blocks. �e MILfor each block is applied, and the P-MIL generates strong classi�ers for target blocks. �e average classi�cationscore, from classi�cation scores for each block, is used to detect whole target. Sharma and Mahapatra [139]proposed a MIL tracker depends on maximizing the Classi�er ScoRe (CSR) for feature selection. �e trackercomputes Haar-features for target with kernel trick, half target space, and scaling strategy.

Yang et al.[168] used Mahalanobis distance to compute the instance signi�cance to bag probability in a MILframework, and employed gradient boosting to train classi�ers. Instance are computed using a coarse-to-�nesearch technique during tracking. �e Mahalanobis distance describes the importance between instances andbags. Discriminative weak classi�ers are selected based on maximum margin between negative and positive bagsby exploiting average gradient and average classi�er strategy.

3.2.3 Siamese Network Based NCFT Trackers. Siamese network based NCFT perform tracking based onmatching mechanism. �e learning process exploits the general target appearance variations. Siamese network-based trackers match target templates with candidate samples to yield the similarities between patches. VariousSiamese-based CFTs have been developed including [20, 30, 69, 145, 160].

Generic Object Tracking Using Regression Network (GOTURN) proposed by Held et al. [69] exploits objectappearance and motion relationships. During tracking, template and search regions are cropped at previous andcurrent frames respectively, and those crops are padded with context information as shown in Fig. 10. Targettemplate and search regions are fed to �ve individual convolutional layers. Deep features from two separate�ows are fused into shared three sequential fully-connected layers. GOTURN is a feed-forward o�ine trackerthat does not require �ne-tuning, and directly regresses target location.

A Siamese INstance Search (SINT) [145] performs tracking using o�ine learned matching function, and �ndsbest-matched patch between target template and candidate patches in new frames without updating matchingfunction. �e SINT architecture have two steams: a query stream and search stream. Each steam consists of �veconvolutional layers, three region-of-interest pooling layers, one FC layer, and a contrastive loss function layerwhich is responsible to discriminate target from background to fuse features. During tracking, target template as



Region of Interest Superpixel Segmentation Classification

Fig. 11. Superpixel classification.

Color histograms

Color histograms

Centers of superpixel

Background confidence map

Centers of superpixelk previous

frames

Superpixelsegmentation

New frame Superpixelsegmentation

+ =

PCA Background Color Model

(x*1, y*1),(x*2, y*2),…

(x1, y1),(x2, y2),…

Fig. 12. BKG flowchart [95].

query from the initial frame is matched with candidate samples from each frame. �e output bounding box isre�ned using four Ridge bounding box regression trained over the bounding box from the initial frame. Chenand Tao [20] proposed two �ow CNN tracker called as YCNN that is learned end-to-end to calculate similaritymap between the search region and the target patch using shallow and deep features. YCNN architecture has two�ows: an object and search �ow. Deep features obtained from object and search �ows having three convolutionallayers are concatenated, and are forwarded to two FC layers to yield prediction map. Maximum score in theprediction map infers the new target location.

Reinforced Decision Making (RDM) [30] model is composed of a matching and a policy network. Predictionheatmaps are generated from the matching network, while the policy network is responsible for producingnormalized scores from prediction heatmaps. During tracking, a cropped search patch and along with N targettemplates are forwarded to two separate convolutional layers and these deep features are fused using sharedFC layers in matching networks to produce prediction maps. Using prediction map, policy network computesnormalized scores. Prediction map gives estimated target at the maximum score. �e policy network containstwo convolutional and two FC layers that make decisions about a reliable state using RL.

3.2.4 Superpixel Based Trackers. Superpixels represent a group of pixels having identical pixel values [2].Region of interest is segmented into superpixels and classi�cation is performed over superpixels for discriminationas shown in Fig. 11. A superpixel based representation got much a�ention by computer vision community forobject recognition [17], human detection [120], activity recognition [163], and image segmentation [2]. Numeroustracking algorithms have been developed using superpixels [75, 95, 154, 155, 167].

Li et al. [95] used BacKGround (BKG) cues in a particle framework for tracking (Fig. 12). �e background issegmented excluding target area for superpixels from previous k frames. �ese superpixels are representing thebackground. Superpixels for target are also computed in the current frame and compared with the backgroundsuperpixels using Euclidean distance and color histograms. Proposed scheme computes con�dence map basedon di�erence between the target and background. Current frame superpixels dissimilar to the backgroundsuperpixels are considered as the target suerpixels.

Yang et al. [167] also proposed a SuperPixel based Tracker (SPT). Mean shi� clustering is performed on super-pixels to model target and the background appearance. Similarity of superpixels in the current frame is computedfrom the target and the background models to �nd the target position. �e Constrained Superpixel Tracking(CST) [155] algorithm employs graph labeling using superpixels as nodes and enforces spatial smoothness,temporal smoothness, and appearance �tness constraints. Spatial smoothness is enforced by exploiting the latentmanifold structure using unlabeled and labeled superpixels. Optical �ow is used for the temporal smoothness toimpose short-term target appearance, while appearance �tness servers as long-term appearance model to enforceobjectness. Wang et al. [154] presented a Bayesian tracking method at coarse-level and �ne-level superpixelappearance model. �e coarse-level appearance model computes few superpixels such that there is only onesuperpixel in the target bounding box, and a con�dence measure de�nes whether that superpixel corresponds to



(a) State estimation (b) Model update

CNN1

CNN2CNN3

CNN4 CNN5CNN6

CNN7

CNN1

CNN2CNN3

CNN4 CNN5 CNN6

Fig. 13. Visualization of (a) target state estimation and (b)model update in Tree Structure CNN [122]. CNN weightsare shown using green arrows width for state estimation.A�inity among CNNs is shown by the width of orange edgefor model update while reliability of CNN is indicated bythe width of CNN box.

(a) (b) (c) (d)

Time t-1 Time t Time t+1

~ GMM

Fig. 14. ALMM tracking approach [181]. (a) Target detection.(b)Target patches extracted from input image. (c) Targetpatches to follow GMM based on distance of patches fromcenter of gravity. (d) ALMM prunes out patches to makerobust estimation.

background/target. �e �ne-level appearance model calculates more superpixels in the target region based ontarget location in the previous frame and Jaccard distance is used to �nd �ne-superpixels belonging to the targetin the current frame. �e Structural Superpixel Descriptor (SSD) [75] exploits the structural information viasuperpixels and preserves the intrinsic target properties. It decomposes a target into a hierarchy of di�erent sizedsuperpixels and assigns greater weights to superpixels closer to the target center. A particle �lter framework isused and background information is alleviated through adaptive patch weighting.

3.2.5 Graph Based Trackers. A graph has vertices (which may be pixels, superpixels or object parts) andedges (correspondence among vertices). Graphs are used to predict the labels of unlabeled vertices. Graph basedalgorithms have been successfully used for object detection [49], human activity recognition [100], and facerecognition [118]. Generally, graph-based trackers use superpixels as nodes to represent the object appearance,while edges represent the inner geometric structure. Another strategy is to construct graphs among the parts ofobjects in di�erent frames. Many trackers have been developed using graphs [41, 42, 122, 159, 173].

�e Tree structure CNN (TCNN) [122] tracker employed CNN to model target appearance in tree structure.Multiple hierarchical CNN-based target appearance models are used to build a tree where vertices are CNNs andedges are relations among CNNs (Fig. 13). Each path maintains a separate history for target appearance in aninterval. During tracking, candidate samples are cropped at target location estimated in the last frame. Weightedaverage scores computed using multiple CNNs are used to calculate abjectness for each sample. Reliable patchalong the CNN de�nes the weight of CNN in the tree structure. �e maximum score from multiple CNNs is usedto estimate target location. Bounding box regression methodology is also utilized to enhance the estimated targetposition in the subsequent frames.

Du et al. [41] proposed Structure Aware Tracker (SAT) that constructs hypergraphs in temporal domain toexploit higher order dependencies. SAT gathers candidate parts in frame bu�er from each frame by computingsuperpixels. A graph cut algorithm is employed to minimize the energy to produce the candidate parts. AStructure-aware hyper graph is constructed using candidate parts as nodes, while hyper edges denote relationshipamong parts. Object parts across multiple frames contribute to build a subgraph by grouping superpixelsconsidering both appearance and motion consistency. Finally, the target location and boundary is estimatedby combining all the target parts using coarse-to-�ne strategy. Graph tracker (Gracker) [159] uses undirectedgraphs to model planar objects and exploits the relationship between local parts. Search region is divided intogrids, and a graph is constructed where vertices represent cells with maximum SIFT response and edges areconstructed using Delaunay triangulation. During tracking, geometric graph-matching is performed to exploreoptimal correspondence between graph models and the candidate graph.

A Geometric hyperGraph Tracker (GGT) [42] constructs geometric hpyergraphs by exploiting higher ordergeometric relationships among target parts. Target parts in previous frame are matched with candidate parts in



new frame. �e relationship between target and candidate parts is represented by correspondence hypothesis.A geometric hypergraph is constructed from the superpixels where vertices are correspondence hypothesiswhile edges constitute the geometric relationship within the hypothesis. Reliable parts are computed fromcorrespondence hypotheses learned from the matched target and candidate part sets. During tracking, reliableparts are extracted with high con�dence to predict target location. An Absorbing Markov Chain Tracker (AMCT)[173] recursively propagates the predicted segmented target in subsequent frames. AMCT has two states: anabsorbing and a transient state. In an AMCT, any state can be entered to absorbing state, and once entered, cannotleave, while other states are transient states. An graph is constructed between two consecutive frames basedon superpixels, where vertices are background superpixels (represents absorbing states) and target superpixels(transient states). Edges weights are learned from SVM to distinguish foreground and background superpixels.Motion information is imposed by spatial proximity using inter-frame edges. �e target is estimated from thesuperpixel components a�er vertices have been evaluated against the absorption time threshold.

3.2.6 Part-Based NCFT Trackers. Part-based modeling have been activity used in NCFTs to handle deformableparts. Various techniques are employed to perform object detection [125], action recognition [43], and facerecognition [184] using parts. Object local parts are utilized to model a tracker [54, 96, 153, 172, 181].

Adaptive Local Movement Modeling (ALMM) [181] exploits intrinsic local movements of object parts fortracking. Image parts are estimated using a base tracker such as Struck [66], and GMM is used to prune outdri�ed parts. GMM is employed to model the parts movement based on displacement of centers from the globalobject center (Fig. 14). A weight is assigned to each part, based on motion and appearance for be�er estimation.�e target position is estimated from a strong tracker by combining all parts trackers in a boosting framework.

Yao et al. [172] proposed a Part based Tracker (PT) where object is decomposed into parts and an adaptiveweight is assigned to each part. A structural spatial constraint is applied to each part using minimum spanningtree where vertices represents parts and edges de�ne consistent connections. A weight is assigned to each edgecorresponding to Euclidean distance between two parts. Online structured learning using SVM is performed todistinguish target and its parts from background. During tracking, the maximum classi�cation scores of targetand parts is used to estimate the new target position.

Li et al. [96] used local covariance descriptors as target appearance and exploited the relationship among parts.�e target is divided into non overlapping parts. A pyramid is constructed having multiple local covariancedescriptors that are fused using max pooling depicting target appearance. Parts are modeled using star graphand central part of target representing central node. During tracking, target parts are selected from candidatepart pools and template parts by solving a linear programming problem. Target is estimated from selected partsusing a weighted voting mechanism based on relationship between center part and surrounding parts. Part-basedMulti-Graph Ranking Tracker (PMGRT) [153] constructs graphs to rank target parts. During tracking, target isdivided into parts and di�erent features are extracted for each target part. Multiple graphs are constructed basedon both target parts and feature types, where one graphs is from one feature type. An a�nity weight matrixis formed where rows represent graphs for di�erent features and columns denotes the graphs of various parts.Augmented Lagrangian formulation is optimized to select parts associated with high con�dence.

3.2.7 Sparsity Based Trackers. All algorithms studied so far are discriminative tracking methods. On the otherhand, Generative methods learn target representation and search target in each frame with minimal reconstructionerror [89]. Sparse representation is a good example for generative models. Sparse representations are widelyused in computer vision, signal processing, and image processing communities for numerous applications suchas face recognition [109], object detection [128], and image classi�cation [135]. �e objective is to discover anoptimal representation of the target which is su�ciently sparse and minimizes the reconstruction error. Mostlysparse coding is performed by �rst learning a dictionary. Assume X = [x1, ...,xN ] ∈ Rm×n represents grayscale images xi ∈ Rm . A dictionary D = [d1, ...,dk ] ∈ Rm×k is learned on X such that each image in X can be



(b) Local Sparse Representation Model(a) Global Sparse Representation Model

(d) Structural Sparse Representation Model(c) Joint Sparse Representation Model

Image

Particles: X

xi

Template: T

Particles: X

xi

Template: T

Image

Image

Particles: X

Template: TParticles: X

xi

Template: T

Image

Fig. 15. (a) global sparse representation [99], (b) local sparse representation model [76], (c) joint sparse representation model[190] and (d) structural sparse representation model [194].

sparsely represented by a linear combination of items in D: xi = Dαi , where αi = [α1, ...,αk ] ∈ Rk denotes thespares coe�cients. When k > r , where r is the rank of X, then dictionary D is overcomplete. For a known D, aconstrained minimization using `1−norm is o�en applied to �nd α for su�ciently sparse solution:

α∗i ≡ arg minαi

12 ‖ xi − Dαi ‖

22 +λ ‖ αi ‖1, (3)

where λ gives relative weights to the sparsity and reconstruction error. Dictionary D is learned in such a waythat all images in X can be sparsely represented with a small error. Dictionary D is learned to solve the followingoptimization problem:

{α∗,D∗} ≡ minimizeD,α

N∑i=1‖ X − Dα ‖22 +λ ‖ α ‖1, (4)

�ere are two alternative phases for dictionary learning. D is assumed to be �xed and the coe�cients α arecalculated in the initial phase. While in the second phase, dictionary D is updated while α is assumed to be �xed.In visual object tracking, the objective of dictionary learning is to perform discrimination between target andbackground patches by sparsely encoding target and background coe�cients. Various sparsity based trackershave been proposed including [63, 174, 191, 193, 194, 196].

Structural sparse tracking (SST) [194] is based on particle �lter framework which exploits intrinsic relationshipof local target patches and global target to jointly learn sparse representation. Fig. 15 presents di�erent types ofsparse representations of target. Fig 15 (a) shows global sparse representation which exploits holistic representa-tion of the target and is optimized using `1 minimization [99]. Local sparse representation models present thetarget patches sparsely in local patch dictionary as shown in Fig. 15 (b) [76]. �e joint sparse representationexploits the intrinsic correspondence among particles to jointly learn the dictionary [190] (Fig. 15 (c)). Jointsparsity enforces particles to be jointly sparse and share the same dictionary template. SST exploits all the threesparse representations in a uni�ed form as shown in Fig. 15 (d). SST estimates target from target dictionarytemplates and corresponding patches having maximum similarity score from all the particles by preserving thespatial layout structure. �e model is constructed on a number of particles representing target, and each targetrepresentation is decomposed into patches, and dictionary is learned. �e patch coe�cient is learned such that itminimizes the patch reconstruction error. SST considers that the same local patches from all the particles aresimilar. But this is not the case for tracking because usually outliers exist. Another problem in SST is that somelocal patches can select di�erent target templates due to noise or occlusion. Zhang et al. [196] improved SST andproposed Robust Structural Sparse Tracking (RSST) to exploit the shared correspondence by the local patchesand also modeled the outliers because of noise and occlusion.



Table 1. CFTs and NCFTs, and their characteristics.Category Subcategory Tracker Features Baseline tracker Scale estimation O�line training Online learning FPS Benchmarks Publication year

CFTs

Basic

KCF HOG, Raw pixels No No Yes 172 OTB2013 TPAMI 2015CF2 CNN KCF No No Yes 10.4 OTB2013,OTB2015 ICCV 2015HDT CNN KCF No No Yes 10 OTB2013,OTB2015 CVPR 2016LCT HOG KCF Yes No Yes 27.4 OTB2013 CVPR 2015fDSST HOG DCF Yes No Yes 54.3 OTB2013, VOT2014 TPAMI 2015Staple HOG, Color Histogram KCF Yes No Yes 80 OTB2013, VOT14, VOT2015 CVPR 2016

R-CFTs

SRDCF HOG, CN No No Yes 5 OTB-2013, OTB-2015, ALOV++ and VOT2014 ICCV 2015CCOT CNN Yes No Yes 0.3 OTB2015, Temple-Color, VOT2015 ECCV 2016ECO CNN, HOG, CN CCOT Yes No Yes 8 VOT2016, UAV123, OTB2015, TempleColor CVPR 2017CSRDCF HOG, Colornames, color histogram Yes No Yes 13 OTB2015, VOT2015, VOT2016 CVPR 2017CACF HOG SAMF Yes No Yes 13 OTB2013, OTB2015 CVPR 2017SCT HOG, RGB, colornames KCF No No Yes 40 OTB2013 CVPR 2016

Siamese-based

SiameseFC CNN Yes Yes No 58 OTB2013, VOT2014, VOT2015, VOT2016 ECCV 2016CFNet CNN SiameseFC Yes Yes Yes 43 OTB2013, OTB2015, VOT2014, VOT2016, TempleColor CVPR 2017Dsiam CNN Yes Yes Yes 45 OTB2013, VOT2015 ICCV 2017EAST HOG, raw pixels, CNN Yes Yes Yes 23 OTB2013. OTB2015, VOT2014, VOT2015 ICCV 2017

Part-based

RPAC HOG KCF Yes No Yes 30 selected Sequences CVPR 2015RPT HOG KCF No No Yes 4 OTB2013 and 10 selected sequences CVPR 2015RTT HOG, RNN Yes Yes Yes 3.5 OTB2013 CVPR 2016PKCF HOG, CN KCF Yes No Yes 0.5 OTB2013 Neurocomputing 2016

Fusion-basedMa’s pixels, color histogram, Haar features Yes No Yes 0.4 OTB2013 ICCV 2015Wang’s CNN Yes Yes Yes 12 Selected sequences ICASSP 2017CTF HOG, binary features KCF, TLD Yes No Yes 25 ALOV300++, OTB2013, VOT2015 TIP 2017

Non-CFTs

Patch learning

MDNet CNN Yes Yes Yes 1 OTB2013, OTB2015, VOT2014 CVPR 2016SANET CNN, RNN MDNet Yes Yes Yes 1 OTB205, TempleColor, VOT2015 CVPR 2017CNT raw pixels No No Yes 1.5 OTB2013 TIP 2016ADNet CNN Yes Yes Yes 3 OTB2013, OTB2015, VOT2013, VOT2014, VOT2015, ALOV300++ CVPR 2017DRLT CNN, RNN Yes Yes No 45 OTB2015 arXiv 2017Obli-Raf CNN, HOG Yes No Yes 2 OTB2013, OTB2015 CVPR 2017SDLSSVM LRT (low Rant Transform) Struck Yes No Yes 5.4 OTB2013, OTB2015 CVPR 2016DeepTrack CNN Yes Yes Yes 2.5 OTB2013, VOT2013 TIP 2016

MILMILTrack Haar MILBoost Yes No Yes 25 Selected Sequences TPAMI 2011FMIL Haar Yes No Yes 50 12 Selected Sequences Pa�ern Recognition 2015CMIL HOG, Distribution Field No No Yes 22 20 Selected Sequences Neurocomputing 2017

Sparsity-basedSST gray scale Yes No Yes 0.45 20 Selected Sequences CVPR 2015RSST gray scale, HOG and CNN SST Yes No Yes 1.8 40 selected sequences, OTB2013, OTB2015 and VOT2014 TPAMI 2018CEST Pixel color values Yes No Yes 4.77 15 Selected Sequences T-CYBERNETICS 2016

Siamese-basedGOTURN CNN Yes Yes No 100 VOT2014 ECCV 2016SINT CNN Yes Yes No OTB2013 CVPR 2016YCNN CNN Yes Yes No 45 OTB2015, VOT2014 T-CSVT 2017

Superpixel BKG HSI color histogram Yes No Yes 0.77 12 Selected sequences T-CSVT 2014SPT HSI color histogram Yes No Yes 5 12 Selected sequences TIP 2014

Graph TCNN CNN Yes Yes Yes 1.5 OTB2013, OTB2015, VOT2015 arXiv 2016SAT HSV color histogram Yes No Yes 0.5 Deform-SOT TIP 2016

Part-based ALMM Haar, Raw pixels, Historgam features Stuck No No Yes 40 OTB2013, 57 Selected Sequences T-CSVT 2017PT Haar Features No No Yes 2.5 13 Selected Sequences TPAMI 2017Context aware Exclusive Sparse Tracker (CEST) [191] exploits context information utilizing particle �lter

framework. �e CEST performs linear combination of dictionary elements to represent particles. Dictionary ismodeled as groups containing target, occlusion and noise, and context templates. In particle framework, newtarget is estimated as best particle from target template dictionary. Guo et al. [63] computed weight maps torepresent target and background structure. A reliable structural constraint is imposed using the weight maps bypenalizing the occluded target pixels. Using a Bayesian �ltering framework, target is estimated using maximumlikelihood from the estimated object state for all the particles. Yi et al. [174] proposed Hierarchical Sparse Tracker(HST) to integrate the discriminative and generative models. �e proposed appearance model is comprised ofLocal Histogram Model (LHM), Sparsity based Discriminant Model (SDM), and Weighted Aligment Pooling(WAP). LHM encodes the spatial information among target parts while the WAP assigns weights to local patchesbased on similarities between target and candidates. �e target template sparse representation is computed inSDM. Finally, candidate with the maximum score from LHM, WAP, and SDM determines the new target position.

In this section we have studied CFTs and NCFTs and elaborated di�erent tracking frameworks and classi�edtrackers into di�erent subcategories. We summarized the characteristics of important well-cited trackers fromeach subcategory in Table 1.

4 EXPERIMENTS AND ANALYSISWe have performed exhaustive experiments on three publicly available visual object tracking benchmarksincluding OTB2013 [164], OTB2015 [165], and TC-128 [105]. We also evaluated these trackers on our newlyintroduced benchmark Object Tracking and Temple Color (OTTC) as explained in Section 4.1. First, we givebrief introduction of selected benchmarks, evaluation protocols and selected trackers for comparison. �en, wereport a detailed analysis of experimental study performed over selected benchmarks and provide our insightsand �ndings. A project page is available containing benchmark videos and results on h�p://bit.ly/2TV46Ka.

4.1 BenchmarksOTB2013 [164] contains 50 sequences which are divided into 11 di�erent challenges including Motion Blur(MB), Occlusion (Occ), Deformation (DEF), In-Plane Rotation (IPR), Fast Motion (FM), Low Resolution (LR),


http://bit.ly/2TV46Ka


Table 2. Details of di�erent benchmarks.Benchmarks OTB2013 OTB2015 TC-128 OTTC VOT2017Sequences 50 100 128 186 60Min frames 71 71 71 71 41Max frames 3872 3872 3872 3872 1500Total frames 29491 59040 55346 92023 21356

Table 3. Detailed information of selected trackers including category, features, FPS computed over OTTC benchmark,implementation details and resource link. Abbreviations are as follows: Basic (B), Regularized (R), Siamese (S), Part Based(PB), Patch Learning (PL), Intensity Adjusted Histogram (IAH), Pixel Intensity Histogram (PIH), Color Names (CN), ColorHistogram (CH), Matlab (M), and MatConvNet (m) [149].

Trackers Category Feature FPS Implementation GPU Resource LinkCSRDCF R-CFT HOG, CN, Gray 8.17 M No h�ps://github.com/alanlukezic/csr-dcf

ECO R-CFT CNN, HOG, CN 6.72 M+m Yes h�ps://github.com/martin-danelljan/ECOCCOT R-CFT CNN 0.41 M+m Yes h�ps://github.com/martin-danelljan/Continuous-ConvOpSTRCF R-CFT HOG, CN, Gray 19.03 M No h�ps://github.com/lifeng9472/STRCFMCPF B-CFT CNN 0.15 M+m Yes h�p://nlpr-web.ia.ac.cn/mmc/homepage/tzzhang/mcpf.htmlSRDCF R-CFT HOG, CN 3.78 M No h�ps://www.cvl.isy.liu.se/research/objrec/visualtracking/regvistrack/DCFNet S-CFT CNN 1.72 M+m Yes h�ps://github.com/foolwood/DCFNet#dcfnet-discriminant-correlation-�lters-network-for-visual-tracking

deepSRDCF R-CFT CNN 1.62 M+m Yes h�ps://www.cvl.isy.liu.se/research/objrec/visualtracking/regvistrack/BACF R-CFT HOG 18.49 M No h�p://www.hamedkiani.com/bacf.html

SRDCFdecon R-CFT HOG 1.48 M No h�ps://www.cvl.isy.liu.se/research/objrec/visualtracking/decontrack/index.htmlCF2 B-CFT CNN 7.01 M+m Yes h�ps://sites.google.com/site/jbhuang0604/publications/cf2

SDLSSVM PL IAH, RGB Image 5.92 M No h�p://www4.comp.polyu.edu.hk/∼cslzhang/DLSSVM/DLSSVM.htmHDT B-CFT CNN 5.68 M+m Yes h�ps://sites.google.com/site/yuankiqi/hdt/RPT PB-CFT HOG 6.27 M No h�ps://github.com/ihpdep/rptECT PL CNN 0.4 M+m Yes h�ps://sites.google.com/site/changxingao/ecnnILCT B-CFT HOG, PIH, IAH 19.29 M No h�ps://sites.google.com/site/chaoma99/cf-lstm

SiameseFC S-CFT CNN 24.8 M+m Yes h�p://www.robots.ox.ac.uk/∼luca/siamese-fc.htmlCFNet S-CFT CNN 13.64 M+m Yes h�p://www.robots.ox.ac.uk/ luca/cfnet.html

STAPLE B-CFT HOG, CH 6.35 M No h�ps://github.com/bertine�o/staplefDSST B-CFT HOG 65.8 M No h�p://www.cvl.isy.liu.se/en/research/objrec/visualtracking/scalvistrack/index.html

Obli-Raf PL CNN 1.73 M+m Yes h�ps://sites.google.com/site/zhangleuestc/incremental-oblique-random-forestKCF B-CFT HOG 80.85 M No h�p://www.robots.ox.ac.uk/∼joao/circulant/CNT PL Image Pixels 0.46 M No h�p://faculty.ucmerced.edu/mhyang/project/cnt/BIT PL Gabor, CN 37.02 M No h�p://caibolun.github.io/BIT/index.html

Scale Variation (SV), Background Clu�er (BC), Out-of-View (OV), Illumination Variation (IV), and Out-of-PlaneRotation (OPR). OTB2015 [165] is an improved version of OTB2013 consisting of 100 sequences and coveringthe same 11 challenges. TC-128 [105] is another benchmark comprising of 128 sequences distributed over thesame 11 challenges. We combined all the sequences from OTB2015 and TC-128 to form a more challengingbenchmark, named as Object Tracking and Temple Color (OTTC). �e new benchmark is a union of uniquesequences from OTB2015 and TC-128 avoiding repetitions. OTTC contains 186 sequences distributed over 11challenges. Tracking performances varies depending upon the number of sequences and the length of eachsequence. Since OTB2015 and TC-128 contained 42 common sequences, a benchmark contained each sequenceonly once was needed to evaluate the tracking performance in a more comprehensive way. Table 2 shows somedetails of these benchmarks. We also report results over Visual Object Tracking (VOT2017) [88] benchmarkwhich covers only �ve challenges including size change, occlusion, motion change, illumination change, andcamera motion. Note that these challenges are more elaborately covered by OTTC benchmark, though usingslightly di�erent challenge names. Nevertheless, VOT is an important benchmark due to inclusion of very smalltarget tracking and IR sequences.

4.2 Evaluation ProtocolsWe employed three metrics including precision, success and speed for comparison. We have evaluated therobustness of the tracking algorithms using traditional One Pass Evaluation (OPE) technique. Precision andsuccess plots are drawn to examine the performance of trackers. For precision, the Euclidean distance iscomputed between the estimated centers and ground-truth centers as: δдp =

√(xд − xp )2 + (yд − yp )2, where

(xд ,yд) represents ground truth center location, and (xp ,yp ) is the predicted target center position in a frame.If δдp is less than a threshold than that frame is considered as a successful. In the precision plot, the threshold



δдp is �xed 20 pixels. Precision does not give a clear picture of estimated target size and shape because centerposition error quanti�es the pixel di�erence. �erefore, a more robust measure known as success has o�en beenused. For success, an Overlap Score (OS) between ground truth and the estimated bounding boxes is calculated.Let rд be the ground-truth bounding box and rt be the target bounding box. An overlap score is de�ned as:os = (|rt ∩ rд |)/(|rt ∪ rд |), where ∩ and ∪ denotes the intersection and union of two regions respectively, while| · | represents the number of pixels. OS is used to determine whether a tracking algorithm has successfullytracked a target in the frame. �ose frames having os scores greater than a threshold, are referred as successfulframes. In the success plot, the threshold value t0 varies between 1 and 0, hence producing varying resultantcurves. OS threshold t0 value is set at 0.5 for evaluation. In this work we report average success by computingaverage OS over all the frames in the benchmark. Similarly, we report average precision over all the frames ina benchmark. We reported precision and success curves as Area under the curve (AUC). We also reported thespeed of the trackers in Frames Per Second (FPS). FPS is the average speed of the trackers over all the sequencesin a benchmark.

4.3 Tracking AlgorithmsWe have selected 24 trackers proposed over the last 4 years having error-free easily executable codes (Table 3).Selected trackers include CSRDCF [110], STRCF [97], deepSRDCF [35], CF2 [114], DCFNet [158], BACF [77], ECO[33], CCOT [38], CFNet [148], CNT [185], KCF [71], HDT [130], ECT [52], Obli-Raf [186], MCPF [195], SiameseFC[13], SRDCF [36], SRDCFdecon [37], STAPLE [12], fDSST [34], DLSSVM [124], ILCT [115], RPT [104], and BIT[18]. Moreover, the selected trackers are popular among the research community and are o�en compared onpublicly available benchmarks. Most of the trackers do not require any pre-training and easy to execute withoutany technical di�culty. However, SiameseFC, CFNet, and DCFNet require pre-training but tested without o�inetraining. All codes are executed by using default parameters as suggested by the original authors. We havepresented both quantitative and qualitative analysis of all the compared trackers. Most of the compared trackersare from CFTs category. In future, if we �nd any executable codes especially for Non-CFTs, we will upload resultson our project page.

4.4 �antitative Evaluation4.4.1 Precision and Success. �e performance of trackers is reported in terms of average precision and success

(Fig. 16) on four benchmarks including OTB2013, OTB2015, TC-128 and OTTC. �e performance of all thecompared trackers in terms of average precision is shown in Fig. 16 (a)-(d). On OTB2013, where six trackersincluding CSRDCF, ECO, CCOT, STRCF, MCPF, and SRDCF obtained the average precision more than 80%while the performance of the remaining trackers degraded (Fig. 16 (a)). Similarly, over OTB2015, the same sixtrackers obtained a precision score of 85% or above. �ese 6 trackers, except MCPF, are regularized CFTs, whileMCPF is a basic CFT. Four trackers including BACF, SRDCFdcon, DCFNet, deepSRDCF obtained a precisionscore of 80% or above whereas the remaining 14 trackers achieved less than 80% average precision. On TC-128benchmark, majority of the trackers showed degraded performance (average precision less than 75%) while onlyfour trackers ECO, CCOT, MCPF, and SRDCF achieved relatively be�er performance, {78.3%, 77.1%, 76.9%, and75.7%} respectively (Fig. 16 (c)). �e OTTC benchmark has been proved to be the most challenging benchmark,with only three trackers including ECO, CCOT, and MCPF achieved more than 80% average precision (Fig. 16 (d)).

�e success plots for four benchmarks are shown in Fig. 16 (e)-(h). Only four trackers including ECO, CSRDCF,CCOT, and STRCF obtained the average OS {63.6%, 62.6%, 61.8%, and 60.5%} on OTB2013, while the remaining20 trackers showed a degraded average OS of less than 60.0% (Fig. 16 (e)). �e best performing trackers areregulerized CFTs. On OTB2015 (Fig. 16 (f)), 10 trackers a�ained signi�cant improvement in terms of averageOS (more than 60.0%) while the remaining trackers showed degraded performance. In contrast, none of thecompared trackers achieved an average OS of more than 60.0% on TC-128 benchmark (Fig. 16 (g)). In case of



(a) OTB2013 (b) OTB2015

(c) TC-128 (d) OTTC

(e) OTB2013

(g) TC-128

Precision Plots Success Plots

(f) OTB2015

(h) OTTC

Fig. 16. Precision plots for the 24 selected trackers over (a) OTB2013, (b) OTB2015, (c) TC-128, and (d) OTTC benchmarks.Success plots over (e) OTB2013, (f) OTB2015, (g) TC-128, and (h) OTTC benchmarks are also shown.

OTTC benchmark, only ECO and CCOT performed be�er (OS 61.5% and 60.1%) while the other compared trackersachieved less than 60% OS (Fig. 16 (h)). Overall, ECO obtained the �rst rank in all of the four benchmarks usingboth precision and success. CNT obtained the lowest rank in both OTB2013 and OTB2015 while SiameseFCa�ained the lowest rank in TC-128 and OTTC benchmarks (Figs. 16 (e)-(h)). ECO and CCOT showed bestperformance because of the contextual information preserved by using multi-resolution deep features. Most ofthe best performing trackers are regulerized CCOTs.

We observe that a be�er performing tracker on one benchmark may not maintain its ranking on the otherbenchmarks. For example, CSRDCF and CCOT change their ranking in di�erent benchmarks. CSRDCF performedbe�er than CCOT on OTB2013 while CCOT performed be�er on the other three benchmarks (Fig. 16). �enumber of sequences and distribution of challenges in a benchmark may change the ranking of a tracker. AsOTB2015, TC-128 and OTTC has more number of sequences therefore ranking may change signi�cantly. Trackingperformance is also a�ected by the number of frames in a sequence. Trackers with poor target update can onlyperform well over shorter sequences. If a sequence has more frames then performance on that particular sequencemay degrade due to noisy update of the tracker.

4.4.2 Features based Comparison. VOT requires a rich representation of the target appearance based ondi�erent types of features. We classify the aforementioned 24 trackers into two categories. �e �rst categorycomprises of those trackers which are based on HandCra�ed (HC) features such as CSRDCF [110], STRCF [97],SRDCF [36], SRDCFdecon [37], BACF [77], DLSSVM [124], STAPLE [12], ILCT [115], RPT [104], BIT [18], fDSST[34], KCF [71], and CNT[185]. While the second category consists of deep feature based trackers including



(c) HC trackers over TC-128 (d) Deep trackers over TC-128

(a) HC trackers over OTB2015 (b) Deep trackers OTB2015

Fig. 17. Precision and success plots for handcra�ed (HC) and deep features over OTB2015 and TC-128 benchmarks.Table 4. Precision for HC and deep trackers on di�erent challenges over OTTC benchmark.

Handcra�ed Trackers Deep TrackersChallenges STRCF SRDCF CSRDCF SDLSSVM SRDCFdecon BACF STAPLE fDSST BIT ILCT RPT KCF CNT ECO CCOT MCPF deepSRDCF DCFNet CF2 HDT ECT Obli-Raf SiameseFC CFNet

FM 71.3 69.9 68.4 65.9 62.1 64.2 62.6 58.3 58.0 58.2 58.2 53.3 41.0 74.6 75.2 72.5 70.0 69.0 66.0 61.8 60.7 55.0 40.1 38.9BC 79.0 83.0 80.3 73.3 70.9 73.2 70.9 70.5 69.0 66.9 70.2 63.3 58.7 85.7 82.6 81.5 74.4 74.4 69.7 68.4 69.5 65.0 40.4 52.0MB 74.2 73.9 72.4 67.4 67.4 62.4 62.7 56.2 56.7 58.9 58.7 52.8 39.7 78.0 80.6 72.5 73.7 68.5 66.9 61.0 60.4 50.2 42.2 42.8DEF 83.7 82.2 83.0 72.2 70.0 74.9 73.6 57.5 63.5 67.4 67.9 60.5 53.7 84.2 83.6 80.1 75.1 75.1 70.7 69.7 71.6 58.4 49.1 53.2IV 77.3 75.3 79.0 68.3 72.7 74.0 72.4 65.3 65.3 69.4 71.4 64.4 51.1 82.0 81.4 80.4 74.7 73.5 72.2 69.6 68.9 69.8 52.3 55.6IPR 70.8 73.1 71.8 66.5 65.3 66.0 65.3 57.5 61.5 61.0 62.7 57.5 47.3 76.8 75.1 79.3 69.9 70.0 67.1 62.4 65.7 59.4 44.8 47.2LR 75.9 77.3 77.7 72.4 60.1 63..4 57.0 50.8 51.9 46.1 54.7 48.0 62.2 82.1 83.1 75.0 83.1 66.4 58.6 51.3 66.3 65.5 42.9 45.1

OCC 70.3 71.9 71.0 65.1 63.6 60.5 60.9 49.1 57.0 56.4 56.1 49.3 49.7 77.3 74.5 77.1 68.1 67.5 61.5 59.0 61.5 51.1 41.3 42.6OPR 75.9 74.7 74.5 71.0 66.6 65.7 66.2 54.5 63.4 61.9 62.7 56.4 51.4 80.0 78.0 77.8 68.5 70.7 68.0 62.2 66.8 59.3 47.1 48.6OV 71.6 67.6 75.2 58.9 60.7 61.6 58.7 47.2 46.9 51.3 49.9 41.3 42.7 77.9 81.8 68.2 66.3 69.6 51.9 49.5 57.0 47.6 46.1 39.0SC 79.0 76.9 77.1 69.8 72.2 70.0 68.3 58.8 60.4 60.8 64.4 55.8 54.7 81.0 81.4 81.7 73.1 72.7 66.2 64.7 67.1 63.5 47.6 51.1

Overall 79.1 79.6 77.6 73.2 70.8 70.9 70.4 61.7 64.8 64.8 66.6 59.9 56.7 82.0 81.1 80.3 75.4 74.0 69.3 67.2 68.7 63.0 43.5 48.1

deepSRDCF [35], CF2[114], DCFNet [158], ECO[33], CCOT [38], CFNet [148], HDT[130], ECT [52], Obli-Raf[186], MCPF [195], and SiameseFC [13]. We used the default features se�ing as proposed by the original authors.

Figs. 17 (a-d) present the performance of HC and deep feature based trackers using an average precision andsuccess over OTB2015 and TC-128 benchmarks. On OTB2015 benchmark, CSRDCF and STRCF outperformedother HC trackers in terms of precision {89.4% and 86.4%} and success {65.4% and 65.0%} shown in Fig. 17 (a).While in case of deep trackers, ECO and CCOT a�ained the best performance with precision {89.9% and 89.6%}and success {68.0% and 66.7%} exhibited in (Fig. 17 (b)). Likewise, over TC-128 benchmark, SRDCF and STRCFperformed well using precision and success (Fig. 17 (c)), HC trackers, while ECO and CCOT trackers maintainedthe highest performance (Fig. 17 (d)) among deep trackers.

4.5 Challenges-based AnalysisMost of the trackers cannot exhibit excellent performance on all the tracking challenges. �e challenges includedin this study are Fast Motion (FM), Motion Blur (MB), Occlusion (OCC), Deformation (DEF), Illumination Variation(IV), Low Resolution (LR), In-Plane Rotation (IPR), Out-of-Plane Rotations (OPR), Out-of-View (OV), BackgroundClu�er (BC), and Scale Variations (SV). A challenge based evaluation of tracker performance has been performedin terms of precision (Table 4) and success (Figs. 18, 19) over OTTC benchmark.

In the fast motion and motion blur challenges, the target appearance is blurred by target or camera motion. Inthese challenges, STRCF and SRDCF has performed the best among HC trackers. While CCOT and ECO handledthese challenge successfully among the deep trackers. Both CCOT and ECO achieved the best performance forthese challenges compared to all the trackers (Table 4 and Figs. 18, 19). In fast motion sequences, target position



Fig. 18. Success plots of HC trackers for eleven object tracking challenges over OTTC benchmark.

Fig. 19. Success plots of deep trackers for eleven object tracking challenges over OTTC benchmark.



changes rapidly and target is tracked by exploring a large search space. Whereas for motion blur where targetsu�ers signi�cant appearance variations due to target or camera motion, tracking is performed by exploitinguseful features to assist in localizing the target. An overview of all the trackers included in this study reveals thatmulti-resolution feature maps are only computed by ECO and CCOT which may be considered as signi�cantlycontributing to the performance of these trackers in these challenges. Other trackers were not able to e�cientlyhandle these challenges because of the fast movement of target or camera in the video.

In deformation challenge, target exhibits di�erent variations in shape and orientation while in occlusionchallenge target hides itself partially or fully behind other objects in the scene. Tracking performance is highlyin�uenced by these challenges. Usually, trackers add the background information as noise when target observesappearance variation which leads to the dri� problem that is the tracked bounding box gradually moves awayfrom the actual target bounding box and starts tracking un-required objects. In these challenges, performanceof the tackers may be improved by handling dri� problem e�ciently. SRDCF, CSRDCF, and STRCF have besthandled these challenges among HC trackers while ECO, CCOT, and MCPF showed be�er performance among thedeep trackers. �e best performing ECO tracker has handled these challenges e�ciently by computing GaussianMixture Models for each target appearance. Also ECO has proposed reduced number of features compared toCCOT thus dropping some of the weak features which may correspond to the undesired target regions underocclusion thus obtaining be�er performance.

In some sequences in OTTB benchmark, illumination variations and low resolution challenges simultaneouslyappear posing challenge for most of the trackers. To address illumination challenge, a tracker needs to maintaina target model as well as a local background model. In low resolution videos, target appearance is also lowresolution. �is problem can be handled by utilizing more stronger features of target. CSRDCF, STRCF andSRDCF handled these challenges in a more appropriate manner compared to the other HC trackers. In deeptrackers, ECO, deepSRDCF and CCOT showed improved performed over both challenges. It is noted that ECO isbest in terms of success while CCOT and deepSRDCF performed best in terms of precision compared to both theHC and the deep trackers. For low resolution targets, e�cient feature extraction is di�cult thus deep features helpin improved performance for low resolution. Both ECO and CCOT also take the advantage of multi-resolutiondeep features for improved performance.

�e in�uence of Out of View (OV) challenge presented one of the most severe challenges to majority of thetrackers which can handle OV challenge by maintaining useful target samples to re-detect the tracker a�er failure.HC trackers such as CSRDCF, STRCF, SRDCF showed be�er performance over OV sequences using precisionwhile STRCF showed signi�cant improvement in terms of success. On the other hand, deep CCOT and ECOhandled OV challenge with respect to precision and success. ECO and CCOT handled OV challenge e�cientlywith the assistance of multiple convolutional operators to learn multi-resolution features. Other challenges suchas In-Plane Rotation (IPR) and Out-of-Plane Rotations (OPR) are handled by rotation invariant features andkeeping multiple target representations. Over the group of IPR sequences, trackers including SRDCF, CSRDCF,and STRCF exhibited the best performance among HC trackers and among deep trackers ECO, MCPF, CCOTobtained be�er performance. Similarly, STRCF and SRDCF tackled OPR challenge e�ciently in HC categorywhile ECO, CCOT and MCPF performed well among deep trackers. We observe that overall ECO and CCOTachieved signi�cant e�ciency over IPR and OPR challenges.

Background clu�er and scale variation challenges also di�cult to handle in visual object tracking. Forbackground clu�er videos, target matches with the background texture while in scale variations target exhibitssigni�cant changes in size. Trackers including STRCF, SRDCF, and CSRDCF in HC category handled well boththe background clu�er and the scale variations and hence obtained the best precision and success. �e trackersECO, CCOT, and MCPF best handled these challenges in deep trackers. Overall, ECO is the best performing deeptracker in these challenges and showed improved performance by taking advantage of factorized convolutional



Ground Truth CSRDCF fDSST ILCT RPT CNT BACF

SRDCFdecon STAPLE SRDCF BIT STRCF KCF SDLSSVM

Fig. 20. �alitative analysis of HC trackers on challeng-ing sequences (from top to bo�om Bolt, CarScale, FaceOcc1,Jogging-1, Matrix, and Skiing respectively).

Ground Truth ECO CCOT HDT SiameseFC CFNet

DCFNet ECT MCPF CF2 DeepSRDCF Obli-Raf

Fig. 21. �alitative analysis of deep trackers on challeng-ing sequences (from le� to right Bolt, CarScale, FaceOcc1,Jogging-1, Matrix, and Skiing respectively).

operators and updating GMM components. It is worth noting that the best performing trackers in each challengeare in the Regularized-CFTs category in the proposed hierarchy.

4.6 �alitative EvaluationFor the qualitative comparison, we selected six sequences including Bolt, CarScale, FaceOcc1, Jogging-1, Matrix,and Skiing which cover all the tracking challenges.

Figure 20 demonstrates the qualitative comparison for HC trackers. For Bolt sequence, most of the trackerssuch as CSRDCF, STRCF, and SRDCF etc tracked the player successfully while only four trackers SRDCFdecon,RPT, fDSST, and CNT were not able to track the activity of the player through out the sequence. In the CarScalesequence, majority of the trackers were not able to completely track the car on a road because of the scalevariation. FaceOcc1 presents the challenge of partial occlusion, where a woman rotates a book around her face. Incase of Jogging-1 sequence, occlusion because of the pole presented a major challenge for STAPLE, fDSST, KCF,RPT, ILCT and CNT trackers while the remaining trackers successfully tracked the person occluded by the pole.�e lighting variations and fast motion challenges in the sequences Matrix and Skiing degraded the performancefor most of the trackers excluding CSRDCF and SDLSSVM in Skiing and CSRDCF, fDSST and SDLSSVM in Matrix.

A qualitative study of deep trackers has also been performed to analyze the visual tracking performance.Obli-Raf, SiameseFC, and deepSRDCF missed the player and made their bounding box far beyond the runner forBolt sequence while ECO, CCOT, HDT, CF2, CFNet, DCFNet, MCPF, and ECT turned out to be the successfultrackers. For scale variation sequence CarScale, none of the tracker handled scaling e�ciently, however, ECO,DCFNet, Obli-Raf, and SiameseFC tracked the major parts of the vehicle. Over partial occlusion FaceOcc1 sequence,all tracker succeeded to track the face except for CFNet, although it achieved some success but eventually itsperformance falls o�. While for complete occlusion Jogging-1 sequence, HDT, DCFNet, and Obli-Raf presenteddegraded performance. Another challenging Matrix sequence where it is raining, HDT, DCFNet, CF2, MCPF,ECT Obli-Raf, SiameseFC, and CFNet exhibited degraded performance and only ECO, CCOT, and deepSRDCFsucceeded. For challenging sequence like Skiing, ECO, CCOT, MCPF, SiameseFC, and DCFNet accomplished thetracking task successfully as compared to other deep trackers.



Table 5. Comparison of HC and deep trackers using Robustness (R), Accuracy (A), and Expected average overlap (EAO)measures for baseline and realtime experiments over VOT2017.

Tracker Baseline RealtimeEAO A R EAO A R

HC trackers

CSRDCF 0.256 0.491 0.356 0.099 0.477 1.054STAPLE 0.169 0.530 0.688 0.170 0.530 0.688

KCF 0.135 0.447 0.773 0.134 0.445 0.782SRDCF 0.119 0.490 0.974 0.058 0.377 1.999DSST 0.079 0.395 1.452 0.077 0.396 1.480

Deep trackers

CF2 0.286 0.509 0.281 0.059 0.339 1.723ECO 0.280 0.483 0.276 0.078 0.449 1.466

CCOT 0.267 0.494 0.318 0.058 0.326 1.461SiameseFC 0.188 0.502 0.585 0.182 0.502 0.504

MCPF 0.248 0.510 0.427 0.060 0.325 1.489

Fig. 22. The AR plot (le�), the EAO curves (middle) and EAO plot (right) for the VOT2017 baseline experiment.

In short, qualitative study revealed CSRDCF and SRDCF as best trackers among HC tracker while ECO andCCOT showed e�cient performance among deep trackers. Additionally, It is noted that these tracking algorithmsemployed spatial regularization using discriminative correlation �lters.

4.7 Comparison over VOT2017We have also reported results over VOT2017 [88] benchmark comparing the performance of various trackers. InVOT toolkit, if a tracker looses the target, it is re-initialized using the ground-truth bounding box. �is procedureis named as reset property of the VOT toolbox. It should be noted that no reset is performed in OTB toolkit. Inaddition to the reset property, VOT toolkit requires multiple runs of the same tracker on the same video to evaluateperformance of that tracker, while OTB toolkit performs one pass evaluation. For performance comparison, VOThas three primary measures such as Accuracy (A), Robustness (R) and Expected Average Overlap (EAO). �eaverage overlap score between ground truth and predicted bounding boxes for successful tracking intervals isknown as accuracy. Robustness is reported as the how many times a tracker failed and required reset. Stochastictracking algorithms are executed multiple times for each sequence. VOT toolkit reduces the potential bias addeddue to reset property in the accuracy measure by discarding 10 frames a�er re-initialization. Averaging robustnessand accuracy over multiple runs for a single sequence gives the per sequence robustness and accuracy. EAO isused to measure the expected overlap score computed for typical short-term sequence lengths over an intervalby averaging the scores for the expected average overlap curve (see more details [86]). �e results shown inTable 5 are drawn from [88]. Tracking results are shown for baseline and real-time experiments separately forhandcra�ed and deep feature based trackers. Fig. 22 shows the results from baseline experiment. In VOT toolkit,in real-time experiments a tracker at the �rst frame, is initialized and waits for the output bounding box to becomputed and the next frame to be read from memory. If the next frame becomes available before completion



05

10152025303540455055606570758085

Fram

es P

er S

eco

nd

HC Trackers

OTB2013

OTB2015

TC-128

OTTC

0

5

10

15

20

25

30

Fram

es P

er S

eco

nd

Deep Trackers

OTB2013

OTB2015

TC-128

OTTC

Fig. 23. Speed analysis of HC and deep trackers in frames per second.

of computation for the previous bounding box, than that bounding box is output for the next frame as well. Inbaseline experiments, a bounding box is computed for each frame. In our analysis, we have included trackerswith results reported in [88] and overlapping with the trackers shown in Table 3. Among the compared trackers,CSRDCF performed best among HC trackers using EAO and R measures, while STAPLE performed best using Ameasure for the baseline experiments. On the other hand, for real-time experiments, STPALE showed signi�cantimprovement in performance among HC trackers for all measures. For deep trackers, in baseline experiments,CF2 exhibited best results with EAO=0.286, MCPF is the best with A=0.510, and ECO is the best with R=0.276.For real-time experiments, SiameseFC showed best performance in terms of all measures.

4.8 Speed ComparisonAn e�cient tracker must track target accurately and robustly with low computational cost. For real worldapplications, computational complexity takes a vital role in the selection of a speci�c tracker. Tracking modelcomplexity and target update frequency are the fundamental reasons e�ecting the speed of a tracker. We havereported the average speed of the HC and deep feature based trackers over OTB2013, OTB2015, TC-128, andOTTC benchmarks (Fig. 23). For all the compared trackers, we have used default parameters as given by theoriginal authors. For a fair speed comparison, all experiments are performed on the same computer with IntelCore i5 CPU 3.40 GHz and 8 GB RAM. GeForce GTX 650 GPU is being used for deep trackers. Overall HC trackersrequire less computational cost as compared to deep trackers which o�en take a lot of time for feature extractionat each frame. �e online model update for each frame in deep trackers incur high computational cost. �e KCFtracker is ranked as the fastest tracker compared to all the trackers. Speed comparison demonstrates that KCF,fDSST and BIT HC trackers process more frames in one second than the fastest deep tracker which is SiameseFC.Some HC trackers including SRDCF, SRDCFdecon, and CNT require high computational cost therefore FPS isless than 5. Similarly, deep trackers including CCOT, MCPF, deepSRDCF, DCFNet, ECT, Obli-Raf, and ECT arealso computational complex with speed less than 5 FPS as shown in Fig. 23.

4.9 Comparison of Tracking Performance over Di�erent FeaturesWe evaluated the performance of ECO over OTB2015 to get be�er understanding of HC and deep featuresas shown in Table 6. We compared HOG and Color Names (CN) as HC features with deep features obtainedfrom VGGNet and also made di�erent feature combinations. HC features are obtained as the prede�ned imageproperties using di�erent algorithms. For example, HOG is a special feature extractor designed to count theoccurrence of gradient orientations in a speci�c region of image. However, deep features are able to learn the



Table 6. Tracking performance comparison of ECO algorithm using di�erent types of featuresHOG CN HOG CN Conv1 Conv2 Conv3 Conv4 Conv5 Conv12 Conv123 Conv1234 Conv12345 HOG Conv15

Precision 78.5 58.3 80.6 69.8 77.6 62.1 69.5 73.0 81.5 74.4 73.6 73.4 90.0Success 59.9 43.4 61.1 51.6 54.6 42.5 46.5 47.1 61.5 55.7 55.4 55.0 68.2FPS 51.50 14.11 23.77 21.33 20.76 17.54 17.13 16.80 11.88 6.84 5.76 4.25 6.58

global as well as local information from the images that is most suitable for a required task. �e informationis automatically learned by the network while optimizing a given objective function. Multiple convolutionoperations are performed by sliding customized kernels which are automatically learned over the training data.Each �lter computes a di�erent image a�ribute keeping in view a required tracking task. Each convolution layercomputes features at a di�erent level providing multiple abstraction levels. Earlier layers preserve low levelspatial information while later layers retain high level semantic information [114]. Deep features computed by theearlier layers may be considered in a way similar to HC features given that a HC feature was most suitable for thechallenge to be handled. In our experiment, deep features are extracted from each VGG layer represented as Convi,where i is the layer index. Deep features are also extracted from more than one VGG layers, for example Conv123are extracted from layers 1, 2 and 3. We observe that Conv12 obtained best performance compared to only HCand single layer deep features. However, the combination of both HOG and deep features i.e., HOG Conv15exhibited overall best performance. �e increase in accuracy shows that there is some information in HC featureswhich was not encoded by the deep features. �erefore, fusion of both types of features was able to obtainimproved accuracy. For real-time applications, computation time is very important. In general, deep features takemore execution time compared to the HC features. �us, there is a trade-o� between accuracy and computationalcomplexity, that is more accurate feature combinations may take more computational time.

4.10 Findings and RecommendationsIn this work we have explored tracking problem using Correlation Filter based Trackers (CFTs) and Non-CFTsproposed over the past few years and we believe that it is still an open problem. We also performed an extensiveanalysis of state-of-the-art trackers with respect to handcra�ed and deep features. �alitative and quantitativeevaluations reveal that on a broader level, CFTs have shown be�er performance over Non-CFTs. �e CFTs areusing both handcra�ed and deep features. At a �ner level, we observe that regularized CFTs exhibit excellentperformance within this category. �erefore, we recommend the use of Regularized-CFTs because of their potentialfor further improvement. Spatial and temporal regularization has been used in Disrcriminative Correlation Filters(DCF) to improve the tracking performance.

Our study concludes that it is required to learn e�cient discriminative features preserving geometric, structural,and spatial target information. Structural information encodes appearance variations while geometric informationcaptures the shape and spatial information encodes the location of di�erent parts. Deep convolutional featuresencode low-level spatial and high-level semantic information which is vital for precise target positioning whileHC features encode less semantic information. �us e�cient fusion of HC and deep features including low-leveland high-level information captures invariant complex features from geometry and structure of targets andenhances the tracking performance.

�e performance of the trackers can be improved by including temporal information along with spatialinformation. Recurrent Neural Network (RNN) models capture the temporal relationship among sequentialimages. Although trackers using RNN models are proposed to integrate temporal information including RTT[32] and SANet [46], however the improvement in results is not signi�cant. Currently, RNN have not been muchexplored by the tracking community, therefore it remains unclear how much performance improvement can beobtained by employing architectures similar to RNN. �us, it may be a possible future research direction.

Learning based tracking algorithms su�er due to lack of training data availability. Usually object boundingbox is available only in the �rst frame. Recently, zero-shot and one-shot learning are studied to alleviate data



limitation problem and it is also a new direction yet to be explored in tracking domain. In tracking-by-detectionframework, online updating current frame due to limited positive samples can lead to over-��ing problem. Forexample, at detection stage, target is partially occluded and thus a tracker can learn noisy target shape. Forlong term tracking, positive target patches capturing all target shape variations are required to learn completetarget appearance. Generative Adversarial Networks (GAN) [58] has ability to produce realistic images. Recently,Song et al. [143] employed GAN and proposed VIsual Tracking via Adversarial Learning (VITAL) algorithm.Inclusion of GAN and reinforcement learning in tracking frameworks can also e�ectively improve the trackingperformance and is a promising future direction. In short, the tracker’s ability to e�ciently learn target’s shape,appearance, and geometry, as well as temporal variations is vital for e�cient tracking. Experimental study revealsthat more accurate trackers o�en have high computational cost. �erefore, it is necessary to obtain high trackingspeeds without losing accuracy.

5 CONCLUSIONSIn this study, a survey of recent visual object tracking algorithms is performed. Most of these algorithms werepublished during the last four years. �ese trackers are classi�ed into two main groups, CFTs and Non-CFTs. Eachcategory is further organized into di�erent classes based on the methodology and framework. �is paper providesinterested readers an organized overview of the diverse tracking area as well as trends and proposed frameworks.It will help the readers to �nd research gaps and provides insights for developing be�er tracking algorithms. �isstudy also enables the readers to select appropriate trackers for speci�c real world applications keeping in viewthe performance of the trackers over di�erent challenges. Four di�erent benchmarks including OTB50, OTB100,TC-128 and a new proposed benchmark OTTC are used for performance comparison of 24 algorithms usingprecision and success measures and execution time for each tracker. �is study concludes that regularized CFTshave yielded be�er performance compared to the others. Spatial and temporal regularizations emphasizing theobject information and suppressing the background in DCFs further enhance the tracking performance. Deepfeatures have ability to encode low-level and high-level information compared to handcra�ed features. E�cienttransfer learning while improving accuracy, robustness, and solving the limitation of training data will be newprogression.

6 ACKNOWLEDGMENTS�is research was supported by Basic Science Research Program through the National Research Foundation ofKorea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2016R1A2B1015101).

REFERENCES[1] Marjan Abdechiri, Karim Faez, and Hamidreza Amindavar. 2017. Visual object tracking with online weighted chaotic multiple instance

learning. Neurocomputing 247 (2017), 16–30.[2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Susstrunk. 2012. SLIC superpixels compared

to state-of-the-art superpixel methods. IEEE T-PAMI 34, 11 (2012), 2274–2282.[3] Amit Adam, Ehud Rivlin, and Ilan Shimshoni. 2006. Robust fragments-based tracking using the integral histogram. In CVPR. IEEE.[4] Jake K Aggarwal and Lu Xia. 2014. Human activity recognition from 3d data: A review. PRL (2014).[5] Ahmad Ali, Abdul Jalil, Jianwei Niu, Xiaoke Zhao, Saima Rathore, Javed Ahmed, and Muhammad Aksam I�ikhar. 2016. Visual object

tracking��classical and contemporary approaches. FCS 10, 1 (2016), 167–188.[6] Saad Ali and Mubarak Shah. 2010. Human action recognition in videos using kinematic features and multiple instance learning. IEEE

T-PAMI 32, 2 (2010), 288–303.[7] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. 2002. A tutorial on particle �lters for online nonlinear/non-

Gaussian Bayesian tracking. IEEE T-SP 50, 2 (2002), 174–188.[8] Boris Babenko, M.H Yang, and Serge Belongie. 2009. Visual tracking with online multiple instance learning. In CVPR. IEEE, 983–990.[9] Boris Babenko, M.H Yang, and Serge Belongie. 2011. Robust object tracking with online multiple instance learning. IEEE T-PAMI 33, 8

(2011), 1619–1632.



[10] Bing Bai, Bineng Zhong, Gu Ouyang, Pengfei Wang, Xin Liu, Ziyi Chen, and Cheng Wang. 2018. Kernel correlation �lters for visualtracking with adaptive fusion of heterogeneous cues. Neurocomputing 286 (2018), 109–120.

[11] Jerome Berclaz, Francois Fleuret, Engin Turetken, and Pascal Fua. 2011. Multiple object tracking using k-shortest paths optimization.IEEE T-PAMI 33, 9 (2011), 1806–1819.

[12] L. Bertine�o, J. Valmadre, S. Golodetz, O. Miksik, and P Torr. 2016. Staple Complementary learners for realtime tracking. In CVPR.IEEE.

[13] Luca Bertine�o, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks forobject tracking. In ECCVW. Springer, 850–865.

[14] RS Blum and Z. Liu. 2006. Multi-Sensor Image Fusion and Its Applications special series on Sig. Proc. and Comm.[15] D. S Bolme, J R. Beveridge, B. A Draper, and Y. M Lui. 2010. Visual object tracking using adaptive correlation �lters. In CVPR. IEEE.[16] David S Bolme, Bruce A Draper, and J Ross Beveridge. 2009. Average of synthetic exact �lters. In CVPR. IEEE, 2105–2112.[17] Yuanfeng Zhou Brekhna Brekhna, Arif Mahmood and Caiming Zhang. 2017. Robustness analysis of superpixel algorithms to image

blur, additive Gaussian noise, and impulse noise. SPIE EI 26 (2017), 61604.[18] B. Cai, X. Xu, X. Xing, K. Jia, J. Miao, and D. Tao. 2016. BIT: Biologically inspired tracker. IEEE T-IP 25, 3 (2016), 1327–1339.[19] Kevin Cannons. 2008. A review of visual tracking. Dept. Comput. Sci. Eng., York Univ., Toronto, Canada, Tech. Rep. CSE-2008-07 (2008).[20] Kai Chen and Wenbing Tao. 2017. Once for all: a two-�ow convolutional neural network for visual tracking. IEEE T-CSVT (2017).[21] K. Chen, W. Tao, and S. Han. 2017. Visual object tracking via enhanced structural correlation �lter. Elsevier Inf. Sci. 394 (2017), 232–245.[22] Wei Chen, Kaihua Zhang, and Qingshan Liu. 2016. Robust visual tracking via patch based kernel correlation �lters with adaptive

multiple feature ensemble. Neurocomputing 214 (2016), 607–617.[23] X. Chen, P.J Flynn, and K.W Bowyer. 2006. Fusion of infrared and range data: Multi-modal face images. In ICB. Springer, 55–63.[24] Z. Chen, Z. Hong, and D. Tao. 2015. An experimental survey on correlation �lter-based tracking. CoRR abs/1509.05520 (2015).[25] G. Cheron, I. Laptev, and C. Schmid. 2015. P-CNN: Pose-based CNN features for action recognition. In ICCV. IEEE, 3218–3226.[26] Zhizhen Chi, Hongyang Li, Huchuan Lu, and M.H Yang. 2017. Dual deep network for visual tracking. IEEE T-IP 26, 4 (2017), 2005–2015.[27] J Choi, HJ Chang, S Yun, T Fischer, Y Demiris, and JY Choi. 2017. A�entional correlation �lter network for adaptive visual tracking. In

CVPR. IEEE, 4828–4837.[28] Jongwon Choi and Jin Young Choi. 2015. User interactive segmentation with partially growing random forest. In ICIP. IEEE.[29] Jongwon Choi, Hyung Jin Chang, Jiyeoup Jeong, Yiannis Demiris, and Jin Young Choi. 2016. Visual tracking using a�ention-modulated

disintegration and integration. In CVPR. IEEE, 4321–4330.[30] J. Choi, J. Kwon, and K.M Lee. 2017. Visual tracking by reinforced decision making. arXiv preprint arXiv:1702.06291 (2017).[31] S. Chopra, R. Hadsell, and Y. LeCun. 2005. Learning similarity metric discriminatively, with application face veri�cation. In CVPR.

IEEE.[32] Zhen Cui, Shengtao Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Recurrently target-a�ending tracking. In CVPR. IEEE, 1449–1458.[33] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. 2017. ECO: E�cient Convolution Operators for Tracking. In CVPR. IEEE, 6931–6939.[34] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. 2017. Discriminative scale space tracking. IEEE T-PAMI 39 (2017), 1561–1575.[35] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. 2015. Convolutional features for correlation �lter based tracking. In ICCVW. IEEE.[36] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. 2015. Learning spatially regularized correlation �lters for tracking. In ICCV. IEEE.[37] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. 2016. Adaptive decontamination of the training set: A uni�ed formulation for

discriminative visual tracking. In CVPR. IEEE, 1430–1438.[38] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. 2016. Beyond Correlation Filters: Learning Continuous Convolution Operators

for Visual Tracking. In ECCV. Springer.[39] M. Danelljan, F. S. Khan, M. Felsberg, and J. V. de Weijer. 2014. Adaptive color a�ributes for real-time visual tracking. In CVPR. IEEE.[40] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox. 2015. Flownet: Learning

optical �ow with convolutional networks. In ICCV. IEEE, 2758–2766.[41] Dawei Du, Honggang Qi, Wenbo Li, Longyin Wen, Qingming Huang, and Siwei Lyu. 2016. Online deformable object tracking based on

structure-aware hyper-graph. T-IP 25, 8 (2016), 3572–3584.[42] Dawei Du, Honggang Qi, Longyin Wen, Qi Tian, Qingming Huang, and Siwei Lyu. 2017. Geometric hypergraph learning for visual

tracking. IEEE T-C 47, 12 (2017), 4182–4195.[43] Yong Du, Yun Fu, and Liang Wang. 2016. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE

T-IP 25, 7 (2016), 3010–3022.[44] Leila Essannouni, Elhassane Ibn-Elhaj, and Driss Aboutajdine. 2006. Fast cross-spectral image registration using new robust correlation.

Springer JRTIP 1, 2 (2006), 123–129.[45] H. Fan and H. Ling. 2017. Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking. (2017).[46] Heng Fan and Haibin Ling. 2017. SANet: Structure-Aware Network for Visual Tracking. CVPRW (2017), 2217–2224.[47] M Farid, A Mahmood, and S AlMaadeed. 2019. Multi-focus image fusion using Content Adaptive Blurring. Inf Fusion 45 (2019), 96–112.



[48] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trainedpart-based models. IEEE T-PAMI 32, 9 (2010), 1627–1645.

[49] Idir Filali, Mohand Saıd Allili, and Nadjia Benblidia. 2016. Multi-scale salient object detection using graph ranking and global–localsaliency re�nement. Elsevier Signal Process Image 47 (2016), 380–401.

[50] David Forsyth. 2014. Object detection with discriminatively trained part-based models. Computer 47, 2 (2014), 6–7.[51] C. Gao, F. Chen, JG Yu, R. Huang, and N. Sang. 2017. Robust tracking using exemplar-based detectors. IEEE T-CSVT 27, 2 (2017),

300–312.[52] C. Gao, H. Shi, JG Yu, and N. Sang. 2016. Enhancement of ELDA Based on CNN and Adaptive Model Update. Sensors 16, 4 (2016), 545.[53] Junyu Gao, Tianzhu Zhang, Xiaoshan Yang, and Changsheng Xu. 2017. Deep relative tracking. IEEE T-IP 26, 4 (2017), 1845–1858.[54] J. Gao, T. Zhang, X. Yang, and C. Xu. 2018. P2t: Part-to-target tracking via deep regression learning. IEEE TIP 27, 6 (2018), 3074–3086.[55] R. Girshick, J. Donahue, T. Darrell, and J. Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation.

In CVPR. IEEE, 580–587.[56] S. Gladh, M. Danelljan, F. S. Khan, and M. Felsberg. 2016. Deep motion features for visual tracking. In ICPR. IEEE, 1243–1248.[57] H. Gong, J. Sim, M. Likhachev, and J. Shi. 2011. Multi-hypothesis motion planning for visual object tracking. In ICCV. IEEE, 619–626.[58] I. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, A. Courville, and Y. Bengio. 2014. Generative adversarial nets. In

NIPS.[59] A. Graves, AR Mohamed, and G Hinton. 2013. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE, 6645–6649.[60] S. Gu, Y. Zheng, and C. Tomasi. 2010. E�cient visual object tracking with online nearest neighbor classi�er. In ACCV. Springer.[61] M. Guillaumin, J. J. Verbeek, and C. Schmid. 2010. Multiple Instance Metric Learning from Automatically Labeled Bags of Faces. In

ECCV. Springer, 634–647.[62] E. Gundogdu, A. Koc, B. Solmaz, R. I Hammoud, and A. A Alatan. 2016. Evaluation of feature channels for correlation-�lter-based

visual object tracking in infrared spectrum. In CVPRW. IEEE, 290–298.[63] Jie Guo, Tingfa Xu, Ziyi Shen, and Guokai Shi. 2017. Visual Tracking Via Sparse Representation With Reliable Structure Constraint.

IEEE SPL 24, 2 (2017), 146–150.[64] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. 2017. Learning dynamic Siamese network for visual tracking. In ICCV. IEEE.[65] B. Han, J. Sim, and H. Adam. 2017. Branchout: regularization for online ensemble tracking with CNN. In CVPR. IEEE, 521–530.[66] S. Hare, S. Golodetz, A. Sa�ari, V. Vineet, M. M Cheng, S. L Hicks, and Philip HS Torr. 2016. Struck: Structured output tracking with

kernels. IEEE T-PAMI 38 (2016), 2096–2109.[67] Sam Hare, Amir Sa�ari, and Philip HS Torr. 2011. Struck: Structured output tracking with kernels. In ICCV.[68] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770–778.[69] D. Held, S. �run, and S. Savarese. 2016. Learning to track at 100 fps with deep regression networks. In ECCV. Springer, 749–765.[70] J. F Henriques, R. Caseiro, P. Martins, and J. Batista. 2012. Exploiting the circulant structure of tracking-by-detection with kernels. In

ECCV. Springer, 702–715.[71] Joao F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. 2015. High-speed tracking with kernelized correlation �lters. IEEE

T-PAMI 37, 3 (2015), 583–596.[72] Zhibin Hong, Zhe Chen, Chaohui Wang, Xue Mei, Danil Prokhorov, and Dacheng Tao. 2015. Multi-store tracker (muster): A cognitive

psychology inspired approach to object tracking. In CVPR. IEEE, 749–758.[73] Hongwei Hu, Bo Ma, Jianbing Shen, and Ling Shao. 2017. Manifold Regularized Correlation Object Tracking. IEEE T-NNLS (2017).[74] C. Huang, S. Lucey, and D. Ramanan. 2017. Learning Policies for Adaptive Tracking with Deep Feature Cascades. In ICCV. IEEE.[75] W. Huang, R. Hu, C. Liang, W. Ruan, and B. Luo. 2017. Structural superpixel descriptor for visual tracking. In IJCNN. IEEE, 3146–3152.[76] X. Jia, H. Lu, and M.H Yang. 2012. Visual tracking via adaptive structural local sparse appearance model. In CVPR. IEEE, 1822–1829.[77] H. K. Galoogahi, A. Fagg, and S. Lucey. 2017. Learning Background-Aware Correlation Filters for Visual Tracking. In ICCV. IEEE.[78] H. K. Galoogahi, T. Sim, and S. Lucey. 2015. Correlation �lters with limited boundaries. In CVPR. IEEE, 4630–4638.[79] Z. Kalal, J. Matas, and K. Mikolajczyk. 2010. Pn learning: Bootstrapping binary classi�ers by structural constraints. In CVPR. IEEE.[80] Zdenek Kalal, Krystian Mikolajczyk, Jiri Matas, et al. 2012. Tracking-learning-detection. IEEE T-PAMI 34, 7 (2012), 1409.[81] P. Karczmarek, A. Kiersztyn, W. Pedrycz, and M. Dolecki. 2017. An application of chain code-based local descriptor and its extension

to face recognition. PR 65 (2017), 26–34.[82] A. Karpathy, G. Toderici, S. She�y, T. Leung, R. Sukthankar, and L. F. Fei. 2014. Large-scale video classi�cation with convolutional

neural networks. In CVPR. IEEE, 1725–1732.[83] Zul�qar Hasan Khan, Irene Yu-Hua Gu, and Andrew G Backhouse. 2011. Robust visual object tracking using multi-mode anisotropic

mean shi� and particle �lters. IEEE T-CSVT 21, 1 (2011), 74–87.[84] Matej Kristan, Roman P�ugfelder, Ale Leonardis, Jiri Matas, Fatih Porikli, Luka Cehovin, Georg Nebehay, Gustavo Fernandez, Toma

Vojir, Adam Ga�, et al. 2013. �e visual object tracking vot2013 challenge results. In ICCVW. IEEE, 98–111.[85] Matej Kristan, Roman P�ugfelder, Ale Leonardis, Jiri Matas, Fatih Porikli, Luka Cehovin, Georg Nebehay, Gustavo Fernandez, Toma

Vojir, Adam Ga�, et al. 2014. �e Visual Object Tracking VOT2014 challenge results. (2014), 191–217.



[86] Matej Kristan, Roman P�ugfelder, Ale Leonardis, Jiri Matas, Fatih Porikli, Luka Cehovin, Georg Nebehay, Gustavo Fernandez, TomaVojir, Adam Ga�, et al. 2015. �e Visual Object Tracking VOT2015 challenge results. (2015), 564–586.

[87] Matej Kristan, Roman P�ugfelder, Ale Leonardis, Jiri Matas, Fatih Porikli, Luka Cehovin, Georg Nebehay, Gustavo Fernandez, TomaVojir, Adam Ga�, et al. 2016. �e Visual Object Tracking VOT2016 challenge results. (2016).

[88] Matej Kristan, Roman P�ugfelder, Ale Leonardis, Jiri Matas, Fatih Porikli, Luka Cehovin, Georg Nebehay, Gustavo Fernandez, TomaVojir, Adam Ga�, et al. 2017. �e Visual Object Tracking VOT2017 challenge results. (2017), 1949–1972.

[89] Junseok Kwon and Kyoung Mu Lee. 2010. Visual tracking decomposition. In CVPR. IEEE, 1269–1276.[90] L. L Taixe, C. C Ferrer, and K. Schindler. 2016. Learning by tracking: Siamese cnn for robust target association. In CVPRW. IEEE, 33–40.[91] L. L Taixe, A. Milan, K. Schindler, D. Cremers, Ian Reid, and S. Roth. 2017. Tracking the Trackers: An Analysis of the State of the Art in

Multiple Object Tracking. CoRR abs/1704.02781 (2017).[92] V. A Laurense, J. Y Goh, and J C. Gerdes. 2017. Path-tracking for autonomous vehicles at the limit of friction. In ACC. IEEE, 5586–5591.[93] I. Leang, S. Herbin, B. Girard, and J. Droulez. 2018. On-line fusion of trackers for single-object tracking. PR 74 (2018), 459–473.[94] J. Lee, B. K Iwana, S. Ide, and S. Uchida. 2016. Globally Optimal Object Tracking with Fully Convolutional Networks. CoRR (2016).[95] Annan Li and Shuicheng Yan. 2014. Object tracking with only background cues. IEEE T-CSVT 24, 11 (2014), 1911–1919.[96] Fu Li, Xu Jia, Cheng Xiang, and Huchuan Lu. 2017. Visual tracking with structured patch-based model. IVC 60 (2017), 124–133.[97] F. Li, C. Tian, W. Zuo, L. Zhang, and MH Yang. 2018. Learning Spatial-Temporal Regularized Correlation Filters Tracking. In CVPR.

IEEE.[98] Hanxi Li, Yi Li, and Fatih Porikli. 2016. Deeptrack: Learning discriminative feature representations online for robust visual tracking.

IEEE T-IP 25, 4 (2016), 1834–1848.[99] Hanxi Li, Chunhua Shen, and Qinfeng Shi. 2011. Real-time visual tracking using compressive sensing. In CVPR. IEEE, 1305–1312.

[100] Meng Li and Howard Leung. 2016. Multiview skeletal interaction recognition using active joint interaction graph. IEEE T-M 18, 11(2016), 2293–2302.

[101] P. Li, D. Wang, L. Wang, and H. Lu. 2018. Deep visual tracking: Review and experimental comparison. PR 76 (2018), 323–338.[102] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. V Den Hengel. 2013. A survey of appearance models in visual object tracking. ACM

T-IST 4, 4 (2013), 58.[103] Y. Li and J. Zhu. 2014. A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration. In ECCVW. Springer, 254–265.[104] Y. Li, J. Zhu, and S. CH Hoi. 2015. Reliable patch trackers: Robust visual tracking by exploiting reliable patches. In CVPR. IEEE, 353–361.[105] Pengpeng Liang, Erik Blasch, and Haibin Ling. 2015. Encoding color information for visual tracking: Algorithms and benchmark. IEEE

T-IP 24, 12 (2015), 5630–5644.[106] Q. Liu, X. Zhao, and Z. Hou. 2014. Survey of single-target visual tracking methods based on online learning. IET-CV (2014).[107] S. Liu, T. Zhang, X. Cao, and C. Xu. 2016. Structural correlation �lter for robust visual tracking. In CVPR. IEEE, 4312–4320.[108] T. Liu, G. Wang, and Q. Yang. 2015. Real-time part-based visual tracking via adaptive correlation �lters. In CVPR. IEEE, 4902–4912.[109] Songjiang Lou, Xiaoming Zhao, Yuelong Chuang, Haitao Yu, and Shiqing Zhang. 2016. Graph regularized sparsity discriminant

analysis for face recognition. Neurocomputing 173 (2016), 290–297.[110] A. Lukei�, T. Voj, L. �.n Zajc, J. Matas, and M. Kristan. 2017. DCF with Channel and Spatial Reliability. In CVPR. IEEE.[111] A. Lukezic, L. Zajc, and M. Kristan. 2018. Deformable parts correlation �lters for robust visual tracking. IEEE T-C 48, 6 (2018),

1849–1861.[112] Chengwei Luo, Bin Sun, Qiao Deng, Zihao Wang, and Dengwei Wang. 2018. Comparison of Di�erent Level Fusion Schemes for

Infrared-Visible Object Tracking: An Experimental Survey. In ICRAS. IEEE, 1–5.[113] C. Ma, JB Huang, X. Yang, and MH Yang. 2018. Robust Visual Tracking via Hierarchical Convolutional Features. IEEE T-PAMI (2018).[114] C. Ma, J. B Huang, X. Yang, and M.H Yang. 2015. Hierarchical convolutional features for visual tracking. In ICCV. IEEE, 3074–3082.[115] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. 2018. Adaptive Correlation Filters with Long-Term and Short-Term

Memory for Object Tracking. IJCV (2018), 1–26.[116] Chao Ma, Xiaokang Yang, Chongyang Zhang, and M.H Yang. 2015. Long-term correlation tracking. In CVPR. IEEE, 5388–5396.[117] L. Ma, J. Lu, J. Feng, and J. Zhou. 2015. Multiple feature fusion via weighted entropy for visual tracking. In ICCV. 3128–3136.[118] HK Meena, KK Sharma, and SD Joshi. 2017. Improved facial expression recognition using graph sig. pro. IET EL 53, 11 (2017), 718–720.[119] Xue Mei and Haibin Ling. 2009. Robust visual tracking using �� 1 minimization. In ICCV. IEEE, 1436–1443.[120] G. Mori, X. Ren, A Efros, and J. Malik. 2004. Recovering human body con�g. Combining segmentation and recog.. In CVPR. IEEE.[121] Ma�hias Mueller, Neil Smith, and Bernard Ghanem. 2017. Context-Aware Correlation Filter Tracking. In CVPR. IEEE, 1387–1395.[122] H. Nam, M. Baek, and B. Han. 2016. Modeling and propagating cnns in a tree structure for visual tracking. CoRR abs/1608.07242 (2016).[123] H. Nam and B. Han. 2016. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In CVPR. IEEE, 4293–4302.[124] Jifeng Ning, Jimei Yang, Shaojie Jiang, Lei Zhang, and M.H Yang. 2016. Object tracking via dual linear structured SVM and explicit

feature map. In CVPR. IEEE, 4266–4274.[125] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, H. Li, K. Wang, J. Yan, C. C. Loy, and X. Tang. 2017.

DeepID-Net: Object Detection with Deformable Part Based Convolutional Neural Networks. IEEE T-PAMI 39, 7 (2017), 1320–1334.



[126] Mustafa Ozuysal, Pascal Fua, and Vincent Lepetit. 2007. Fast keypoint recognition in ten lines of code. In CVPR. IEEE, 1–8.[127] Jiyan Pan and Bo Hu. 2007. Robust occlusion handling in object tracking. In CVPR. IEEE, 1–8.[128] Houwen Peng, Bing Li, Haibin Ling, Weiming Hu, Weihua Xiong, and Stephen J Maybank. 2017. Salient object detection via structured

matrix decomposition. IEEE T-PAMI 39, 4 (2017), 818–832.[129] A. Priole�i, A. Møgelmose, P. Grisleri, M. M Trivedi, A. Broggi, and T. B Moeslund. 2013. Part-based pedestrian detection and

feature-based tracking for driver assistance: real-time, robust algorithms, and evaluation. IEEE T-ITS 14, 3 (2013), 1346–1359.[130] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.H Yang. 2016. Hedged deep tracking. In CVPR. IEEE, 4303–4311.[131] Lei Qin, Hichem Snoussi, and Fahed Abdallah. 2014. Object Tracking Using Adaptive Covariance Descriptor and Clustering-Based

Model Updating for Visual Surveillance. Sensors (2014).[132] Deva Ramanan. 2013. Dual coordinate solvers for large-scale structural svms. arXiv preprint arXiv:1312.1743 (2013).[133] Madan Kumar Rapuru, Sumithra Kakanuru, Pallavi M Venugopal, Deepak Mishra, and GRKS Subrahmanyam. 2017. Correlation-Based

Tracker-Level Fusion for Robust Visual Tracking. IEEE T-IP 26, 10 (2017), 4832–4842.[134] Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action mach a spatio-temporal maximum average correlation height �lter

for action recognition. In CVPR. IEEE, 1–8.[135] Adriana Romero, Carlo Ga�a, and Gustau Camps-Valls. 2016. Unsupervised deep feature extraction for remote sensing image

classi�cation. IEEE T-GRS 54, 3 (2016), 1349–1362.[136] Marios Savvides, BVK Vijaya Kumar, and Pradeep Khosla. 2002. Face veri�cation using correlation �lters. AutoID (2002), 56–61.[137] F. Schro�, D. Kalenichenko, and J. Philbin. 2015. Facenet: A uni�ed embedding for face recognition and clustering. In CVPR. IEEE.[138] Joan Severson. 2017. Human-digital media interaction tracking. US Patent 9,713,444.[139] V. Sharma and K. Mahapatra. 2017. MIL based visual tracking with kernel and scale adaptation. Sig. Pro.: Img. Comm. 53 (2017), 51–64.[140] K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).[141] Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah. 2014. Visual tracking:

An experimental survey. IEEE T-PAMI 36, 7 (2014), 1442–1468.[142] Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, and MH Yang. 2017. CREST Convolutional Residual Learning for Tracking. In ICCV. IEEE.[143] Yibing Song, Chao Ma, Xiaohe Wu, Lijun Gong, Linchao Bao, Wangmeng Zuo, Chunhua Shen, Lau Rynson, and Ming-Hsuan Yang.

2018. VITAL: VIsual Tracking via Adversarial Learning. In CVPR. IEEE.[144] Yao Sui, Ziming Zhang, Guanghui Wang, Yafei Tang, and Li Zhang. 2016. Real-time visual tracking: Promoting the robustness of

correlation �lter learning. In ECCV. Springer.[145] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. 2016. Siamese instance search for tracking. In CVPR. IEEE, 1420–1429.[146] Zhu Teng, Junliang Xing, Qiang Wang, Congyan Lang, Songhe Feng, and Yi Jin. 2017. Robust Object Tracking Based on Temporal and

Spatial Deep Networks. In ICCV. IEEE, 1153–1162.[147] B. Tian, Q. Yao, Y. Gu, K. Wang, and Y. Li. 2011. Video processing techniques for tra�c �ow monitoring: A survey. In ITSC. IEEE.[148] Jack Valmadre, Luca Bertine�o, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2017. End-to-end representation learning for

Correlation Filter based tracking. CVPR (2017), 5000–5008.[149] Andrea Vedaldi and Karel Lenc. 2015. Matconvnet: Convolutional neural networks for matlab. In ACMMM. ACM.[150] Sean Walker, Christopher Sewell, June Park, Prabu Ravindran, Aditya Koolwal, Dave Camarillo, and Federico Barbagli. 2017. Systems

and methods for localizing, tracking and/or controlling medical instruments. US Patent App. 15/466,565.[151] Fan Wang, Yan Wu, Peng Zhang, Qingjun Zhang, and Ming Li. 2017. Unsupervised SAR image segmentation using ambiguity label

information fusion in triplet Markov �elds model. IEEE Geoscience and Remote Sensing Le�ers 14, 9 (2017), 1479–1483.[152] G. Wang, J. Wang, W. Tang, and N. Yu. 2017. Robust visual tracking with deep feature fusion. In ICASSP. IEEE, 1917–1921.[153] Jingjing Wang, Chi Fei, Liansheng Zhuang, and Nenghai Yu. 2016. Part-based multi-graph ranking for visual tracking. In ICIP. IEEE.[154] Jun Wang, Weibin Liu, Weiwei Xing, and Shunli Zhang. 2017. Two-level superpixel and feedback based visual object tracking.

Neurocomputing 267 (2017), 581–596.[155] Lijun Wang, Huchuan Lu, and M.H Yang. 2018. Constrained Superpixel Tracking. IEEE T-C (2018), 1030–1041.[156] L. Wang, W. Ouyang, X. Wang, and H. Lu. 2016. Stct: Sequentially training convolutional networks for visual tracking. In CVPR. IEEE.[157] M. Wang, Y. Liu, and Z. Huang. 2017. Large Margin Object Tracking with Circulant Feature Maps. (2017), 4800–4808.[158] Q. Wang, J. Gao, J. Xing, M. Zhang, and Hu W. 2017. DCFNet: Discriminant Correlation Filters Network for Visual Tracking. CoRR

abs/1704.04057 (2017).[159] Tao Wang and Haibin Ling. 2018. Gracker: A graph-based planar object tracker. T-PAMI 40, 6 (2018), 1494–1501.[160] X. Wang, C. Li, B. Luo, and J. Tang. 2018. SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation. In CVPR. IEEE.[161] Zhenjie Wang, Lijia Wang, and Hua Zhang. 2017. Patch Based Multiple Instance Learning Algorithm for Object Tracking. Comp. Int.

and Neurosc. (2017).[162] Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. MIT Press NC

1, 2 (1989), 270–280.



[163] Jianxin Wu, Adebola Osuntogun, Tanzeem Choudhury, Ma�hai Philipose, and James M Rehg. 2007. A scalable approach to activityrecognition based on object use. In ICCV. IEEE, 1–8.

[164] Yi Wu, Jongwoo Lim, and M.H Yang. 2013. Online object tracking: A benchmark. In CVPR. IEEE, 2411–2418.[165] Yi Wu, Jongwoo Lim, and M.H Yang. 2015. Object tracking benchmark. IEEE T-PAMI (2015), 1834–1848.[166] Chao Xu, Wenyuan Tao, Zhaopeng Meng, and Zhiyong Feng. 2015. Robust visual tracking via online multiple instance learning with

Fisher information. Elsevier PR 48, 12 (2015), 3917–3926.[167] Fan Yang, Huchuan Lu, and M.H Yang. 2014. Robust superpixel tracking. IEEE T-IP 23, 4 (2014), 1639–1651.[168] Honghong Yang, Shiru �, and Zunxin Zheng. 2017. Visual tracking via online discriminative multiple instance metric learning.

Springer MTA (2017), 1–19.[169] Hanxuan Yang, Ling Shao, Feng Zheng, Liang Wang, and Zhan Song. 2011. Recent advances and trends in visual tracking: A review.

Neurocomputing 74, 18 (2011), 3823–3831.[170] Ming Yang, Ying Wu, and Gang Hua. 2009. Context-aware visual tracking. IEEE T-PAMI 31, 7 (2009), 1195–1209.[171] M. Yang, Y. Wu, and S. Lao. 2006. Intelligent collaborative tracking by mining auxiliary objects. In CVPR, Vol. 1. IEEE, 697–704.[172] Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, and Anton van den Hengel. 2017. Part-based robust tracking using online latent

structured learning. IEEE T-CSVT 27, 6 (2017), 1235–1248.[173] D. Yeo, J. Son, B Han, and J. H Han. 2017. Superpixel-Based Tracking-By-Segmentation Using Markov Chains. In CVPR. IEEE, 511–520.[174] Yang Yi, Yang Cheng, and Chuping Xu. 2017. Visual tracking based on hierarchical framework and sparse representation. Springer

MTA (2017), 1–23.[175] Alper Yilmaz, Omar Javed, and Mubarak Shah. 2006. Object tracking: A survey. ACM Comput. Surv. 38 (2006), 13.[176] Alper Yilmaz, Xin Li, and Mubarak Shah. 2004. Contour-based object tracking with occlusion handling in video acquired using mobile

cameras. IEEE T-PAMI 26, 11 (2004), 1531–1536.[177] Qian Yu, �ang Ba Dinh, and Gerard Medioni. 2008. Online tracking and reacquisition using co-trained generative and discriminative

trackers. In ECCV. Springer, 678–691.[178] S. Yun, J. Choi, Y. Yoo, K. Yun, and JY Choi. 2017. Action-Decision Net. for Tracking with Deep Reinforcement Learning. In CVPR.

IEEE.[179] Sergey Zagoruyko and Nikos Komodakis. 2015. Learning to compare image patches via convolutional neural networks. In CVPR. IEEE.[180] J. Zbontar and Y. LeCun. 2015. Computing the stereo matching cost with a convolutional neural network. In CVPR. IEEE, 1592–1599.[181] Baochang Zhang, Zhigang Li, Alessandro Perina, Alessio Del Bue, Vi�orio Murino, and Jianzhuang Liu. 2017. Adaptive local movement

modeling for robust object tracking. IEEE T-CSVT 27, 7 (2017), 1515–1526.[182] Cha Zhang, John C Pla�, and Paul A Viola. 2006. Multiple instance boosting for object detection. In Adv NIPS. 1417–1424.[183] Da Zhang, Hamid Maei, Xin Wang, and Yuan-Fang Wang. 2017. Deep Reinforcement Learning for Visual Object Tracking in Videos.

CoRR abs/1701.08936 (2017).[184] Jiaqi Zhang, Yao Deng, Zhenhua Guo, and Youbin Chen. 2016. Face recognition using part-based dense sampling local features.

Neurocomputing 184 (2016), 176–187.[185] Kaihua Zhang, Qingshan Liu, Yi Wu, and M.H Yang. 2016. Robust visual tracking via convolutional networks without training. IEEE

TIP 25, 4 (2016), 1779–1792.[186] L. Zhang, J. Varadarajan, P Suganthan, N. Ahuja, and P. Moulin. 2017. Robust Tracking Using Oblique Random Forests. In CVPR. IEEE.[187] M. Zhang, J. Xing, J. Gao, and W. Hu. 2015. Robust visual tracking using joint scale-spatial correlation �lters. In ICIP. IEEE, 1468–1472.[188] Mengdan Zhang, Junliang Xing, Jin Gao, Xinchu Shi, Qiang Wang, and Weiming Hu. 2015. Joint scale-spatial correlation tracking with

adaptive rotation estimation. In ICCVW. IEEE, 32–40.[189] S. Zhang, H. Yao, X. Sun, and X. Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. PR (2013).[190] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. 2012. Robust visual tracking via multi-task sparse learning. In CVPR. IEEE, 2042–2049.[191] T. Zhang, B. Ghanem, S. Liu, C. Xu, and N. Ahuja. 2016. Robust tracking via exclusive context modeling. IEEE T-C 46, 1 (2016), 51–63.[192] T. Zhang, K. Jia, C. Xu, Y. Ma, and N. Ahuja. 2014. Partial occlusion handling for tracking via robust part matching. In CVPR. IEEE.[193] T. Zhang, S. Liu, N. Ahuja, MH Yang, and B. Ghanem. 2015. Robust Tracking Via Consistent Low-Rank Sparse Learning. IJCV (2015).[194] T. Zhang, S. Liu, C. Xu, S. Yan, B. Ghanem, N. Ahuja, and M.H Yang. 2015. Structural sparse tracking. In CVPR. IEEE, 150–158.[195] T. Zhang, C. Xu, and M.H Yang. 2017. Multi-Task Correlation Particle Filter for Robust Object Tracking. In CVPR. IEEE, 4819–4827.[196] Tianzhu Zhang, Changsheng Xu, and Ming-Hsuan Yang. 2018. Robust structural sparse tracking. IEEE T-PAMI (2018).[197] Wei Zhong, Huchuan Lu, and M.H Yang. 2012. Robust object tracking via sparsity-based collaborative model. In CVPR. IEEE.[198] B. Zhuang, L. Wang, and H. Lu. 2016. Visual tracking via shallow and deep collaborative model. Neurocomputing 218 (2016), 61–71.

Received XXXXXXX 0000; revised XXXXXX 0000; accepted 22 January 2019


kyungpook national university arxiv:1812.07368v2 …mustansar fiaz, arif mahmood, sajid javed, and...

Documents