天文信息技术联合实验室 new progress on astronomical cross-match research zhao qing
TRANSCRIPT
天文信息技术联合实验室
New Progress On Astronomical Cross-Match Research
Zhao Qing
Contents
• Our Previous Function
• New Improvements and Attempts– Discussion of Adaptability on HTM-Indexed
Data– New function based on Boundary Growing M
odel– Cross-match in distributed environment base
d on MapReduce model
• Plan & Discussion
Contents
• Our Previous Function
• New Improvements and Attempts– Discussion of Adaptability on HTM-Indexed
Data– New function
• Plan & Discussion
Our Previous Function
• PHIXmatch —— Paralleled Healpix-Indexing Xmatch• Test Dataset: SDSS(100million)
×2MASS( 470million)• Function: Spatial Join
• Results:
SDSS_ID Twomass_ID Distance
587731512617271364 02595905+0000200 5.243e-05
587731512617271365 02595905+0000200 6.55e-05
587731513154076828 02593768+0012219 3.2e-05
587731513154077269 02593768+0012219 0.0025043169
HEALPix Index Function
• HEALPix —— Hierarchical Equal Area isoLatitude Pixelization of a sphere.
• Quadtree pixel numbering
What we have resolved
• Resolve the border-block problem A fast bitwise operation algorithms to deduce the
neighbor blocks’ index number
• Realize parallel cross-match computation in multi-core environment
Results & Performance Analysis
Function Table A Data Amount of A
Table B Data Amounts of B
Time Finish amounts
/sec
PHIXmatch function
SDSS 100,106,811 2MASS 470992970 25min 52,139
GaoDan’s Function
Part of GSC2.3
295,832 Part of GSC2.3
295,832 5.6min 880
• Results
• Conclusion
Has marked performance superiority comparing with previous functions and is applicable to large-scale cross-match on multi-core system
• Paper: Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui, Liqiang Lv, and Jian Xiao, A Paralleled
Large-Scale Astronomical Cross-Matching Function, International Conference on Algorithms and Architectures for Parallel Processing (ica3pp) 2009, LNCS5574: p604~614
Contents
• Our Previous Function
• New Improvements and Attempts– Discussion of Adaptability on HTM-Indexed
Data– New function
• Plan & Discussion
Adaptability Research on HTM-Indexed Data
• HTM—Hierarchical Triangular Mesh
• Resolve the border-data problem in HTM
Results of HTM version Xmatch
• 42min• Why the results is poor compared with HEALPix v
ersion? Answer: the triangle-shape!
Contents
• Our Previous Function
• New Improvements and Attempts– Discussion of Adaptability on HTM-Indexed
Data– New function
• Plan & Discussion
New function based on Boundary Growing Model
• Database reading operation is too time-consuming, especially for the border data!
Contents
• Our Previous Function
• New Improvements and Attempts– Discussion of Adaptability on HTM-Indexed
Data– New function
• Plan & Discussion
MapReduce
• A software framework introduced by Google to support distributed computing on large data sets on clusters of computers.– Huge datasets– Distributable application– Data stored either in a filesystem (unstructured) or within a dat
abase (structured)
• Map step & Reduce step– Map: The master node takes the input, chops it up into smaller
sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
– Reduce: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the
answer to the problem it was originally trying to solve.
Map Step & Reduce Step
Input
Map
Map
Map
Map
Map
Map
Reduce
Reduce
Result
Result
Shuffle/SortChop/replicate
Apache Hadoop
• A Java software framework inspired by Google’s MapReduce and Google File System papers.
• What function does it perform?Easy programming, auto scheduling, error detection &
correction,
• Who use Hadoop?– Yahoo! – web search; advertising businesses– Amazon – S3, EC2– IBM & Google – computation plat for Universities– Institute of Computing Technology, Chinese Academ
y of Sciences -- PBminer
Page links: 1 T output: over 300 TB, compressed! Number of cores in a job: over 10,000 disk in the cluster: over 5 P
Hadoop Architecture
Why using MapReduce to Xmatch
• Near-linear speedup, comparing with MPI cluster
• Suitable for data-intensive, compute-intensive application, low-cost!
• Have been used in many Data Mining application, maybe useful for more complex cross-match functions.
Plan & Discussion
• Service for larger data sets (TB) and various catalogs such as…– Interfaces for more kinds of catalogs– Additional measures to deal with TB-level
data– Parallelizing other cross-match functions
天文信息技术联合实验室
Thank you!
We need your help!