天文信息技术联合实验室 new progress on astronomical cross-match research zhao qing

天文信息技术联合实验室

New Progress On Astronomical Cross-Match Research

Zhao Qing

Contents

• Our Previous Function

• New Improvements and Attempts– Discussion of Adaptability on HTM-Indexed

Data– New function based on Boundary Growing M

odel– Cross-match in distributed environment base

d on MapReduce model

• Plan & Discussion

Contents



Data– New function


Our Previous Function

• PHIXmatch —— Paralleled Healpix-Indexing Xmatch• Test Dataset： SDSS(100million)

×2MASS（ 470million）• Function: Spatial Join

• Results:

SDSS_ID Twomass_ID Distance

587731512617271364 02595905+0000200 5.243e-05

587731512617271365 02595905+0000200 6.55e-05

587731513154076828 02593768+0012219 3.2e-05

587731513154077269 02593768+0012219 0.0025043169

HEALPix Index Function

• HEALPix —— Hierarchical Equal Area isoLatitude Pixelization of a sphere.

• Quadtree pixel numbering

What we have resolved

• Resolve the border-block problem A fast bitwise operation algorithms to deduce the

neighbor blocks’ index number

• Realize parallel cross-match computation in multi-core environment

Results & Performance Analysis

Function Table A Data Amount of A

Table B Data Amounts of B

Time Finish amounts

/sec

PHIXmatch function

SDSS 100,106,811 2MASS 470992970 25min 52,139

GaoDan’s Function

Part of GSC2.3

295,832 Part of GSC2.3

295,832 5.6min 880

• Results

• Conclusion

Has marked performance superiority comparing with previous functions and is applicable to large-scale cross-match on multi-core system

• Paper: Qing Zhao, Jizhou Sun, Ce Yu, Chenzhou Cui, Liqiang Lv, and Jian Xiao, A Paralleled

Large-Scale Astronomical Cross-Matching Function, International Conference on Algorithms and Architectures for Parallel Processing (ica3pp) 2009, LNCS5574: p604~614

Contents





Adaptability Research on HTM-Indexed Data

• HTM—Hierarchical Triangular Mesh

• Resolve the border-data problem in HTM

Results of HTM version Xmatch

• 42min• Why the results is poor compared with HEALPix v

ersion? Answer: the triangle-shape!

Contents





New function based on Boundary Growing Model

• Database reading operation is too time-consuming, especially for the border data!

Contents





MapReduce

• A software framework introduced by Google to support distributed computing on large data sets on clusters of computers.– Huge datasets– Distributable application– Data stored either in a filesystem (unstructured) or within a dat

abase (structured)

• Map step & Reduce step– Map: The master node takes the input, chops it up into smaller

sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

– Reduce: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the

answer to the problem it was originally trying to solve.

Map Step & Reduce Step

Input

Map

Map

Map

Map

Map

Map

Reduce

Reduce

Result

Result

Shuffle/SortChop/replicate

Apache Hadoop

• A Java software framework inspired by Google’s MapReduce and Google File System papers.

• What function does it perform?Easy programming, auto scheduling, error detection &

correction,

• Who use Hadoop?– Yahoo! – web search; advertising businesses– Amazon – S3, EC2– IBM & Google – computation plat for Universities– Institute of Computing Technology, Chinese Academ

y of Sciences -- PBminer

Page links: 1 T output: over 300 TB, compressed! Number of cores in a job: over 10,000 disk in the cluster: over 5 P

Hadoop Architecture

Why using MapReduce to Xmatch

• Near-linear speedup, comparing with MPI cluster

• Suitable for data-intensive, compute-intensive application, low-cost!

• Have been used in many Data Mining application, maybe useful for more complex cross-match functions.

Plan & Discussion

• Service for larger data sets (TB) and various catalogs such as…– Interfaces for more kinds of catalogs– Additional measures to deal with TB-level

data– Parallelizing other cross-match functions

天文信息技术联合实验室

Thank you!

We need your help!

天文信息技术联合实验室 new progress on astronomical cross-match research zhao qing

Documents