iccet2009 manuscript vol 1a

Proceedings2009 International Conference on Computer Engineering and Technology

ICCET 2009

Volume I

Proceedings2009 International Conference on Computer Engineering and Technology

Volume I

January 22 - 24, 2009 Singapore

Edited by Jianhong Zhou and Xiaoxiao Zhou Sponsored by

International Association of Computer Science & Information Technology

Los Alamitos, California Washington

Tokyo

Copyright 2009 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331. The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect the authors opinions and, in the interests of timely dissemination, are published as presented and without change. Their inclusion in this publication does not necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of Electrical and Electronics Engineers, Inc. IEEE Computer Society Order Number P3521 BMS Part Number CFP0967F ISBN 978-0-7695-3521-0 Library of Congress Number 2008909477 Additional copies may be ordered from:IEEE Computer Society Customer Service Center 10662 Los Vaqueros Circle P.O. Box 3014 Los Alamitos, CA 90720-1314 Tel: + 1 800 272 6657 Fax: + 1 714 821 4641 http://computer.org/cspress [email protected] IEEE Service Center 445 Hoes Lane P.O. Box 1331 Piscataway, NJ 08855-1331 Tel: + 1 732 981 0060 Fax: + 1 732 981 9667 http://shop.ieee.org/store/ [email protected] IEEE Computer Society Asia/Pacific Office Watanabe Bldg., 1-4-2 Minami-Aoyama Minato-ku, Tokyo 107-0062 JAPAN Tel: + 81 3 3408 3118 Fax: + 81 3 3408 3553 [email protected]

Individual paper REPRINTS may be ordered at: Editorial production by Lisa OConner Cover art production by Joe Daigle/Studio Productions Printed in the United States of America by The Printing House

IEEE Computer Society

Conference Publishing Services (CPS)http://www.computer.org/cps

2009 International Conference on Computer Engineering and Technology

ICCET 2009

Table of ContentsVolume - 1Preface - Volume 1 ...........................................................................................................................................xiv ICCET 2009 Committee Members - Volume 1..................................................................................xvi ICCET 2009 Organizing Committees - Volume 1..........................................................................xvii

Session 1Overlapping Non-dedicated Clusters Architecture .................................................................................................3Martin tava and Pavel Tvrdk

To Determine the Weight in a Weighted Sum Method for Domain-Specific Keyword Extraction ..................................................................................................................................................11Wenshuo Liu and Wenxin Li

Flow-based Description of Conceptual and Design Levels ................................................................................16Sabah Al-Fedaghi

A Method of Query over Encrypted Data in Database ........................................................................................23Lianzhong Liu and Jingfen Gai

Traversing Model Design Based on Strong-Association Rule for Web Application Vulnerability Detection .......................................................................................................................28Zhenyu Qi, Jing Xu, Dawei Gong, and He Tian

Attribute-Based Relative Ranking of Robot for Task Assignment ....................................................................32B.B. Choudhury, B.B. Biswal, and R.N. Mahapatra

A Subjective Trust Model Based on Two-Dimensional Measurement .............................................................37Chaowen Chang, Chen Liu, and Yuqiao Wang

A Genetic Algorithm Approach for Optimum Operator Assignment in CMS ................................................42Ali Azadeh, Hamrah Kor, and Seyed-Morteza Hatefi

v

Dynamic Adaption in Composite Web Services Using Expiration Times .......................................................47Xiaohao Yu, Xueshan Luo, Honghui Chen, and Dan Hu

An Emotional Intelligent E-learning System Based on Mobile Agent Technology .................................................................................................................................................................51Zhiliang Wang, Xiangjie Qiao, and Yinggang Xie

Audio Watermarking for DRM Based on Chaotic Map ......................................................................................55B. Lei and I.Y. Soon

Walking Modeling Based on Motion Functions ...................................................................................................60Hao Zhang and Zhijing Liu

Preprocessing and Feature Preparation in Chinese Web Page Classification ..................................................64Weitong Huang, Luxiong Xu, and Yanmin Liu

High Performance Grid Computing and Security through Load Balancing .....................................................68V. Sugavanan and V. Prasanna Venkatesh

Research of the Synthesis Control of Force and Position in Electro-Hydraulic Servo System ..............................................................................................................................................................73Yadong Meng, Changchun Li, Hao Yan, and Xiaodong Liu

Session 2Features Selection Using Fuzzy ESVDF for Data Dimensionality Reduction ................................................81Safaa Zaman and Fakhri Karray

PDC: Propagation Delay Control Strategy for Restricted Floating Sensor Networks .....................................................................................................................................................................88Xiaodong Liu

Fast and High Quality Temporal Transcoding Architecture in the DCT Domain for Adaptive Video Content Delivery .....................................................................................................91Vinay Chander, Aravind Reddy, Shriprakash Gaurav, Nishant Khanwalkar, Manish Kakhani, and Shashikala Tapaswi

Electricity Demand Forecasting Based on Feedforward Neural Network Training by a Novel Hybrid Evolutionary Algorithm ..........................................................................................98Wenyu Zhang, Yuanyuan Wang, Jianzhou Wang, and Jinzhao Liang

Investigation on the Behaviour of New Type Airbag ........................................................................................103Hu Lin, Liu Ping, and Huang Jing

Performance Evaluation of PNtMS: A Portable Network Traffic Monitoring System on Embedded Linux Platform ..................................................................................................................108Mostafijur Rahman, Zahereel Ishwar Abdul Khalib, and R.B. Ahmad

PB-GPCT: A Platform-Based Configuration Tool .............................................................................................114Huiqiang Yan, Runhua Tan, Kangyun Shi, and Fei Lu

A Feasibility Study on Hyperblock-based Aggressive Speculative Execution Model .........................................................................................................................................................................119Ming Cong, Hong An, Yongqing Ren, Canming Zhao, and Jun Zhang

vi

Parallel Method for Discovering Frequent Itemsets Using Weighted Tree Approach ...................................................................................................................................................................124Preetham Kumar and Ananthanarayana V S

Optimized Design and Implementation of Three-Phase PLL Based on FPGA .............................................129Yuan Huimei, Sun Hao, and Song Yu

Research on the Data Storage and Access Model in Distributed Environment .............................................134Wuling Ren and Pan Zhou

An Effective Classification Model for Cancer Diagnosis Using Micro Array Gene Expression Data .............................................................................................................................................137V. Saravanan and R. Mallika

Study and Experiment of Blast Furnace Measurement and Control System Based on Virtual Instrument ..................................................................................................................................142Shufen Li and Zhihua Liu

A New Optimization Scheme for Resource Allocation in OFDMA Based WiMAX Systems .....................................................................................................................................................145Arijit Ukil, Jaydip Sen, and Debasish Bera

An Integration of CoTraining and Affinity Propagation for PU Text Classification ............................................................................................................................................................150Na Luo, Fuyu Yuan, and Wanli Zuo

Session 3Ergonomic Evaluation of Small-screen Leading Displays on the Visual Performance of Chinese Users ...............................................................................................................................157Yu-Hung Chien and Chien-Cheng Yen

Curvature-Based Feature Extraction Method for 3D Model Retrieval ...........................................................161Yujie Liu, Xiaolan Yao, and Zongmin Li

A New Method for Vertical Handoff between WLANs and UMTS in Boundary Conditions ..........................................................................................................................................166Majid Fouladian, Faramarz Hendessi, Alireza Shafieinejad, Morteza Rahimi, and Mahdi M. Bayat

Research on Secure Key Techniques of Trustworthy Distributed System .....................................................172Ming He, Aiqun Hu, and Hangping Qiu

WebELS: A Multimedia E-learning Platform for Non-broadband Users .......................................................177Zheng He, Jingxia Yue, and Haruki Ueno

Implementation and Improvement Based on Shear-Warp Volume Rendering Algorithm ..................................................................................................................................................................182Li Guo and Xie Mei

Conferencing, Paging, Voice Mailing via Asterisk EPBX ................................................................................186Ale Imran and Mohammed A. Qadeer

A New Mind Evolutionary Algorithm Based on Information Entropy ...........................................................191Yuxia Qiu and Keming Xie

vii

An Encapsulation Structure and Description Specification for Application Level Software Components ..................................................................................................................................195Jin Guojie and Yin Baolin

Fault Detection and Diagnosis of Continuous Process Based on Multiblock Principal Component Analysis ..............................................................................................................................200Libo Bie and Xiangdong Wang

Strong Thread Migration in Heterogeneous Environment ................................................................................205Khandakar Entenam Unayes Ahmed, Md. Al-mamun Shohag, Tamim Shahriar, Md. Khalad Hasan, and Md. Mashud Rana

A DSP-based Active Power Filter for Three-phase Power Distribution Systems .........................................210Ping Wei, Zhixiong Zhan, and Houquan Chen

Access Control Scheme for Workflow .................................................................................................................215Lijun Gao, Lu Zhang, and Lei Xu

A Mathematical Model of Interference between RFID and Bluetooth in Fading Channel ......................................................................................................................................................................218Junjie Chen, Jianqiu Zeng, and Yuchen Zhou

Optimization Strategy for SSVEP-Based BCI in Spelling Program Application ..........................................223Indar Sugiarto, Brendan Allison, and Axel Grser

Session 4A Novel Method for the Web Page Segmentation and Identification .............................................................229Jing Wang and Zhijing Liu

Disturbance Observer-Based Variable Structure Control on the Working Attitude Balance Mechanism of Underwater Robot ..........................................................................................232Min Li and Heping Liu

Adaptive OFDM Vs Single Carrier Modulation with Frequency Domain Equalization ..............................................................................................................................................................238Inderjeet Kaur, Kamal Thakur, M. Kulkarni, Daya Gupta, and Prabhjyot Arora

A Bivariate C1 Cubic Spline Space on Wang's Refinement .............................................................................243Huan-Wen Liu and Wei-Ping Lu

Fast Shape Matching Using a Hybrid Model ......................................................................................................247Gang Xu and Wenxian Yang

A Multi-objective Genetic Algorithm for Optimization of Cellular Manufacturing System ............................................................................................................................................252H. Kor, H. Iranmanesh, H. Haleh, and S.M. Hatefi

A Formal Mapping between Program Slicing and Z Specifications ...............................................................257Fangjun Wu

Modified Class-Incremental Generalized Discriminant Analysis ....................................................................262Yunhui He

Controlling Free Riders in Peer to Peer Networks by Intelligent Mining .......................................................267Ganesh Kumar. M, Arun Ram. K, and Ananya. A.R

viii

Servo System Modeling and DSP Code Autogeneration Technology for Open-CNC ..........................................................................................................................................................272Shukun Cao, Heng Zhang, Li Song, Changsheng Ai, and Xiangbo Ze

Extending Matching Model for Semantic Web Services ...................................................................................276Alireza Zohali, Kamran Zamanifar, and Naser Nematbakhsh

Sound Absorption Measurement of Acoustical Material and Structure Using the Echo-Pulse Method ...........................................................................................................................................281Liang Sun, Hong Hou, Liying Dong, and Fangrong Wan

Parallel Design of Cross Search Algorithm in Motion Estimation ..................................................................286Fan Zhang

Influences of DSS Environments and Models on Current Business Decision and Knowledge Management ................................................................................................................................291Md. Fazle Munim and Fatima Binte Zia

A Method for Transforming Workflow Processes to CSS ................................................................................295Jing Xiao, Guo-qing Wu, and Shu Chen

Session 5An Empirical Approach of Delta Hedging in GARCH Model .........................................................................303Qian Chen and Chengzhe Bai

Multi-objective Parameter Optimization Technology for Business Process Based on Genetic Algorithm ..................................................................................................................................308Bo Wang, Li Zhang, and Yawei Tian

Analysis and Design of an Access Control Model Based on Credibility ........................................................312Chaowen Chang, Yuqiao Wang, and Chen Liu

Modeling of Rainfall Prediction over Myanmar Using Polynomial Regression ...........................................316Wint Thida Zaw and Thinn Thu Naing

New Similarity Measure for Restricted Floating Sensor Networks .................................................................321Yuan Feng, Xiaodong Liu, and Xiangqian Ding

3D Mesh Skeleton Extraction Based on Feature Points ....................................................................................326Faming Gong and Cui Kang

Pairings Based Designated Verifier Signature Scheme for Three-Party Communication Environment ................................................................................................................................330Han-Yu Lin and Tzong-Sun Wu

A Novel Shared Path Protection Scheme for Reliability Guaranteed Connection ................................................................................................................................................................334Jijun Zhao, Weiwei Bian, Lirong Wang, and Sujian Wang

Generalized Program Slicing Applied to Z Specifications ................................................................................338Fangjun Wu

PC Based Weight Scale System with Load Cell for Product Inspection ........................................................343Anton Satria Prabuwono, Habibullah Akbar, and Wendi Usino

ix

Short-Term Electricity Price Forecast Based on Improved Fractal Theory ....................................................347Herui Cui and Li Yang

BBS Sentiment Classification Based on Word Polarity ....................................................................................352Shen Jie, Fan Xin, Shen Wen, and Ding Quan-Xun

Applying eMM in a 3D Approach to e-Learning Quality Improvement ........................................................357Kattiya Tawsopar and Kittima Mekhabunchakij

Research on Automobile Driving State Real-Time Monitoring System Based on ARM .....................................................................................................................................................................361Hongjiang He and Yamin Zhang

Information Security Risk Assessment and Pointed Reporting: Scalable Approach ...................................................................................................................................................................365D.S. Bhilare, A.K. Ramani, and Sanjay Tanwani

An Extended Algorithm to Enhance the Performance of the Gridbus Broker with Data Restoring Technique .............................................................................................................................371Abu Awal Md. Shoeb, Altaf Hussain, Md. Abu Naser Bikas, and Md. Khalad Hasan

Session 6Prediction of Ship Pitching Based on Support Vector Machines .....................................................................379Li-hong Sun and Ji-hong Shen

The Methods of Improving the Manufacturing Resource Planning (MRP II) in ERP ........................................................................................................................................................................383Wenchao Jiang and Jingti Han

A New Model for Classifying Inputs and Outputs and Evaluating the DMUs Efficiency in DEA Based on Cobb-Douglas Production Function ..................................................................390S.M. Hatefi, F. Jolai, H. Kor, and H. Iranmanesh

The Analysis and Improvement of the Price Forecast Model Based on Fractal Theory ........................................................................................................................................................................395Herui Cui and Li Yang

A Flash-Based Mobile Learning System for Learning English as Second Language ...................................................................................................................................................................400Firouz B. Anaraki

Recognition of Trade Barrier Based on General RBF Neural Network ..........................................................405Yu Zhao, Miaomiao Yang, and Chunjie Qi

An Object-Oriented Product Data Management .................................................................................................409Fan Wang and Li Zhou

Study of 802.11 Network Performance and Wireless Multicasting .................................................................414Biju Issac

A Novel Approach for Face Recognition Based on Supervised Locality Preserving Projection and Maximum Margin Criterion ....................................................................................419Jun Kong, Shuyan Wang, Jianzhong Wang, Lintian Ma, Baowei Fu, and Yinghua Lu

x

Association Rules Mining Based on Simulated Annealing Immune Programming Algorithm .........................................................................................................................................424Yongqiang Zhang and Shuyang Bu

Processing Power Estimation of Simple Wireless Sensor Network Nodes by Power Macro-modeling .....................................................................................................................................428M. Rafiee, M.B. Ghaznavi-Goushchi, and B. Seyfe

A Fault-Tolerant Strategy for Multicasting in MPLS Networks ......................................................................432Weili Huang and Hongyan Guo

A Novel Content-based Information Hiding Scheme ........................................................................................436Jun Kong, Hongru Jia, Xiaolu Li, and Zhi Qi

Ambi Graph: Modeling Ambient Intelligent System .........................................................................................441K. Chandrasekaran, I.R. Ramya, and R. Syama

Session 7Research on Grid-based Short-term Traffic Flow Forecast Technology ........................................................449Wang Xinying, Juan Zhicai, Liu Xin, and Mei Fang

A Nios II Based English Speech Training System for Hearing-Impaired Children .....................................................................................................................................................................452Ningfeng Huang, Haining Wu, and Yinchen Song

A New DEA Model for Classification Intermediate Measures and Evaluating Supply Chain and its Members ..............................................................................................................................457S.M. Hatefi, F. Jolai, H. Iranmanesh, and H. Kor

A Novel Binary Code Based Projector-Camera System Registration Method ..............................................462Jiang Duan and Jack Tumblin

Non-temporal Mutliple Silhouettes in Hidden Markov Model for View Independent Posture Recognition ..........................................................................................................................466Yunli Lee and Keechul Jung

Classification of Quaternary [21s+1,3] Optimal Self-orthogonal Codes ........................................................471Xuejun Zhao, Ruihu Li, and Yingjie Lei

Performance Analysis of Large Receive Offload in a Xen Virtualized System ...........................................475Hitoshi Oi and Fumio Nakajima

An Improved Genetic Algorithm Based on Fixed Point Theory for Function Optimization .............................................................................................................................................................481Jingjun Zhang, Yuzhen Dong, Ruizhen Gao, and Yanmin Shang

Example-Based Regularization Deployed to Face Hallucination ....................................................................485Hong Zhao, Yao Lu, Zhengang Zhai, and Gang Yang

An Ensemble Approach for Semantic Assessment of Summary Writings .....................................................490Yulan He, Siu Cheung Hui, and Tho Thanh Quan

A Fast Reassembly Methodology for Polygon Fragment ..................................................................................495Gang Xu and Yi Xian

xi

A Data Mining Approach to Modeling the Behaviors of Telecom Clients ....................................................500Xiaodong Liu

Simulating Fuzzy Manufacturing System: Case Study ......................................................................................505A. Azadeh, S.M. Hatefi, and H. Kor

Research of INS Simulation Technique Based on UnderWater Vehicle Motion Model .........................................................................................................................................................................510Jian-hua Cheng, Yu-shen Li, and Jun-yu Shi

Modeling and Simulation of Wireless Sensor Network (WSN) with SpecC and SystemC .............................................................................................................................................................515M. Rafiee, M.B. Ghaznavi-Ghoushchi, S. Kheiri, and B. Seyfe

Session 8Sub-micron Parameter Scaling for Analog Design Using Neural Networks ..................................................523A.A. Bagheri-Soulla and M.B. Ghaznavi-Ghoushchi

An Improved Genetic Algorithm Based on Fixed Point Theory for Function Optimization .............................................................................................................................................................527Jingjun Zhang, Yuzhen Dong, Ruizhen Gao, and Yanmin Shang

P2DHMM: A Novel Web Object Information Extraction Model ....................................................................531Jing Wang and Zhijing Liu

An Efficient Multi-Patterns Parameterized String Matching Algorithm with Super Alphabet ................................................................................................................................................536Rajesh Prasad and Suneeta Agarwal

Research on Modeling Method of Virtual Enterprise in Uncertain Environments ............................................................................................................................................................541Jihai Zhang

Design of Intrusion Detection System Based on a New Pattern Matching Algorithm ..................................................................................................................................................................545Hu Zhang

To Construct Implicit Link Structure by Using Frequent Sequence Miner (FS-Miner) ................................................................................................................................................................549May Thu Aung and Khin Nwe Ni Tun

Recognition of Eye States in Real Time Video ...................................................................................................554Lei Yunqi, Yuan Meiling, Song Xiaobing, Liu Xiuxia, and Ouyang Jiangfan

Performance Analysis of Postprocessing Algorithm and Implementation on ARM7TDMI .......................................................................................................................................................560Manoj Gupta, B.K. Kaushik, and Laxmi Chand

NURBS Interpolation Method with Feedrate Correction in 3-axis CNC System .........................................565Liangji Chen and Huiying Li

Implementation Technique of Unrestricted LL Action Grammar ....................................................................569Jing Zhang and Ying Jin

xii

Improving BER Using RD Code for Spectral Amplitude Coding Optical CDMA Network .......................................................................................................................................................573Hilal Adnan Fadhil, S.A. Aljunid, and R. Badlishah Ahmad

USS-TDMA: Self-stabilizing TDMA Algorithm for Underwater Wireless Sensor Network ........................................................................................................................................................578Zhongwen Guo, Zhengbao Li, and Feng Hong

Mathematical Document Retrieval for Problem Solving ..................................................................................583Sidath Harshanath Samarasinghe and Siu Cheung Hui

Lossless Data Hiding Scheme Based on Adjacent Pixel Difference ...............................................................588Zhuo Li, Xiaoping Chen, Xuezeng Pan, and Xianting Zeng

Author Index - Volume 1 .............................................................................................................................593

xiii

PrefaceDear Distinguished Delegates and Guests, The Organizing Committee warmly welcomes our distinguished delegates and guests to the International Conference on Computer Engineering and Technology 2009 (ICCET 2009), held on January 22 - 24, 2009 in Singapore. ICCET 2009, ICACC 2009 and ICECS 2008 are sponsored by International Association of Computer Science and Information Technology (IACSIT) and Singapore Institute of Electronics (SIE), and the accepted papers of ICECS 2008 have been included in the ICCET proceeding as a special session. If you have attended a conference sponsored by IACSIT before, you are aware that the conferences together report the results of research efforts in a broad range of computer science. These conferences are aimed at discussing with all of you the wide range of problems encountered in present and future high technologies. The ICCET 2009, ICACC 2009 and ICECS 2008 are organized to gather members of our international community of computer and control scientists so that researchers from around the world can present their leading-edge work, expanding our communitys knowledge and insight into the significant challenges currently being addressed in that research. The conference Program Committee is itself quite diverse and truly international, with membership from the Americas, Europe, Asia, Africa and Oceania. This proceeding records the fully refereed papers presented at the conference. The main conference themes and tracks are Computer Engineering and Technology. The conference aims to bring together researchers, scientists, engineers, and practitioners to exchange and share their experiences, new ideas, and research results about all aspects of the main conference themes and tracks and discuss the practical challenges encountered and the solutions adopted. The main goal of these events is to provide international scientific forums for exchange of new ideas in a number of fields that interact in-depth through discussions with their peers from around the world. Both inward research; core areas of computer control and outward research; multi-disciplinary, inter-disciplinary, and applications will be covered during these events. The conference has solicited and gathered technical research submissions related to all aspects of major conference themes and tracks. All the submitted papers in the proceeding have been peer reviewed by the reviewers drawn from the scientific committee, external reviewers and editorial board depending on the subject matter of the paper. Reviewing and initial selection were undertaken electronically. After the rigorous peer-review process, the submitted papers were selected on the basis of originality, significance, and clarity for the purpose of the conference. The selected papers and additional late-breaking contributions to be presented as lectures will make an exiting technical program. The conference program is extremely rich, featuring high-impact presentations.

xiv

The high quality of the program guaranteed by the presence of an unparalleled number of internationally recognized top experts can be assessed when reading the contents of the program. The conference will therefore be a unique event, where attendees will be able to appreciate the latest results in their field of expertise, and to acquire additional knowledge in other fields. The program has been structured to favor interactions among attendees coming from many diverse horizons, scientifically, geographically, from academia and from industry. Included in this will to favor interactions are social events at prestigious sites. We would like to thank the program chairs, organization staff, and the members of the program committees for their work. Thanks also go to Ms. Lisa O'Conner, CPS Production Editor, Conference Publishing Services (CPS), IEEE Computer Society, for her wonderful editorial service to this proceeding. We are grateful to all those who have contributed to the success of ICCET 2009. We hope that all participants and other interested readers benefit scientifically from the proceedings and also find it stimulating in the process. Finally, we would like to wish you success in your technical presentations and social networking. We hope you have a unique, rewarding and enjoyable week at ICCET 2009, ICACC 2009 and ICECS 2008 in Singapore. With our warmest regards, Yi Xie January 22 - 24, 2009 Singapore

xv

ICCET 2009 Committee MembersV.Saravanan, Karunya University, India Gunter Glenda A., University of Central Florida, USA Wen-Tsao Pan, Jinwen University of Science and Technology, China (Taiwan) Gopalakrishnan Kasthurirangan, Iowa State University, USA Anupam Shukla, Indian Institute of Information Technology, India Wei Guo, Tianjin University, China Mahanti Prabhat Kumar, University of New Brunswick, Canada Hrudaya Ku Tripathy, Institute of Advanced Computer and Research, India Narasimhan V. Lakshmi, University of Newcastle, Australia Jinlong Wang, Qingdao Technological University, China Amrita Saha, West Bengal University of Technology, India Yi Xie, Cagayan State University, Philippines Sevaux Marc, University of South-Brittany, France Amir Masoud Rahmani, Islamiz Azad University, Iran Yang Laurence T., St. Francis Xavier University, Canada Lau Bee Theng, Swinburne University of Technology Sarawak, Malaysia Poramate Manoonpong, University of Gottingen, Germany Tahseen A. Jilani, University of Karachi, Pakistan Qian Chen, Columbia University, USA Zhihong Xiao, Zhejiang Wanli University, China

xvi

ICCET 2009Organizing CommitteesHonor ChairsR. C. Eberhart, Purdue University, USA A. Kandel, University of South Florida, USA J.D. Pinter, Dalhousie University, Canada

Conference ChairsS.R. Bhadra Chaudhuri, Bengal Engineering and Science University, India Jianhong Zhou, Sichuan University Xiaoxiao Zhou, Nanyang Technological University, Singapore

Conference Steering CommitteeYi Xie, Cagayan State University, Philippines Hoang Huu Hanh, Hue University, Vietnam Kamaruzaman Jusoff, Yale University, USA

Program Committee ChairsS.M. Aqil Burney, University of Karachi, Pakistan Nazir Ahmad Zafar, University of Central Punjab, Pakistan Nashat Mansour, Lebanese American University, Lebanon

Publicity ChairsBasim Alhadidi, Al Balqa Applied University, Jordan M. Aqeel Iqbal, Foundation University, Pakistan Brian B.C. Shinn, Chungbuk National University, Korea

xvii

International Conference on Computer Engineering and Technology

Session 1


Overlapping Non-Dedicated Clusters Architecturet Martin Sava and Pavel Tvrdk Department of Computer Science and Engineering Czech Technical University in Prague Prague, Czech Republic {stavam2,tvrdik}@fel.cvut.cz

AbstractNon-dedicated computer clusters promise more efcient resource utilization than conventional dedicated clusters. Existing non-dedicated clustering solutions either expect trust among participating users, or they do not take into account a possibility of running multiple independent clusters on a same set of computers. In this paper, we argue how an ability to run multiple independent clusters without requiring trust among participating users can be capitalized to increase user experience and thus attract more users to participate in the cluster. A generic extension of non-dedicated clusters that satises these requirements is dened and a feasibility of one particular extension is demonstrated on our implementation. I. I NTRODUCTION Clusters build from commodity computers are a popular computational platform. They are used as a cost-effective alternative to expensive supercomputers [1], [2], as a scalable high available solution for commercial applications [3] as well as load leveling clusters for ordinary day-to-day use [4], [5]. A concept of traditional dedicated clusters was extended by several existing projects [5][8] to support utilization of non-dedicated computers, usually standard workstations. These projects rely on users offering their idle computers to participate in a cluster. Methods of attracting users to participate vary. Some projects offer only a good feeling from the participation. In other environments, like university laboratories, the participation may be enforced by a system administrator. The most interesting method used, however, seems to be a reciprocal offer of cluster computing power to volunteering users. In this case, the users are granted a computing power of the cluster proportional to the power they have given to the cluster. The reciprocal computing power trading, however, needs a well suited cluster architecture. In case there is just a single instance of non-dedicated cluster running, the volunteers cannot use the earned cluster computing power directly from their machine, instead they rst need to login to the cluster and perform their resource demanding computations there.978-0-7695-3521-0/09 $25.00 2009 IEEE DOI 10.1109/ICCET.2009.66 3

Such a scenario is clearly not well suited for the volunteering users. For example, a user who is about to perform a parallel compilation on his machine would like to use his granted cluster time, but he can not, since he can perform the cluster operations only from the cluster itself, not from his machine. On the other hand, if there is support for coexistence of multiple clusters, the machines of volunteering users can form their own clusters and the users can use the granted cluster time transparently from their machines. Second important aspect for attracting users to participate in non-dedicated clusters is a trust relationship. Users offering their computers as non-dedicated computing resources should not be required to fully trust the cluster administrators and neither should the administrators be required to trust the users. Any such a trust requirement complicates forming and expansion of the cluster. In this paper, we rst briey review the most important existing architectures. Then we present a relaxation of existing architecture concepts and argue abouts its advantage over existing systems. A feasibility of the architecture is demonstrated for one particular case on our research clustering solution called Clondike [8]. II. S COPE We primarily focus on clusters attempting to provide a single system image (SSI) illusion at the operating system level. This is a reasonable limitation, since the motivation for developing SSI clusters is, similarly as in our architecture, in improving user experience. III. E XISTINGARCHITECTURES

The most common form of clustering are dedicated clusters. In these clusters, all machines are fully dedicated to the cluster, they all share user account-space, process space, and le system. Kerrighed [9] or OpenSSI [10] are well known examples of such clusters. An extension of dedicated clusters are non-dedicated clusters [5][8]. These clusters consist of one or more dedicated machines, forming a core of the cluster, and any number of non-dedicated machines. The non-dedicated machines can join or leave cluster at any time, but they do not fully belong to the cluster even at the time they are joined. This separation

is often achieved by running the cluster code inside virtual machines running on the non-dedicated machines. An interesting alternative to standard architectures is represented by the openMosix [4]/Mosix2 [11] projects. Cluster machines do not share common le-system, but they are still assumed to share the user account-space. Users of a cluster are not provided standard SSI features, but rather a different SSI depending on a machine they are logged in. Another signicant architecture are multi-clusters. This is not really an architecture of a single cluster, but the term refers to clusters interconnected together. Multi-clusters are gaining popularity during last years as a next logical step towards an idealized grid solution. Examples of projects supporting the multi-cluster architecture are Mosix2 [11] or LSF [12]. IV. P ROPOSEDARCHITECTURE

Figure 1.

Dedicated cluster. All nodes belong to a single cluster.

Figure 2. Non dedicated cluster. The upper left node is forming a core of non-dedicated cluster and is using 2 other nodes as non-dedicated nodes.

In this section, we dene our envisioned architecture. A basic building block of the proposed cluster can be either a single machine or a dedicated SSI cluster. There can be mixed environments, where some blocks are single machines and some are clusters. We will refer to these blocks uniformly as nodes in both cases, distinguishing explicitly between clusters and single machines where required. The nodes can be connected in an arbitrary way, but the key factor is that each node forms a core of its own SSI cluster, using the other nodes as non-dedicated blocks. By forming a core we mean that it denes its account space, le system, and process space. Every node can possibly form its own administrative domain. In addition, there should be no requirement for trust among participating nodes. These two attributes imply a need for strong security model of the architecture implementation. As implied by the denition, every node can use the others as its non-dedicated nodes. When some node is used as a non-dedicated block, its own view of SSI still exists. Moreover, its SSI view is fully separated from the SSI view of the node that is using this node as its non-dedicated node. Nothing prevents 2 nodes to interact with each other, using each other as its non-dedicated node. A node can be used as a non-dedicated node by more than one node. In that case, all SSI views participated by the node, including the local view, should be separated from each other. Because of the nature of the proposed solution, we will refer to it in the paper as an overlapping non-dedicated cluster (ONDC). V. C OMPARISONWITH OTHER ARCHITECTURES

illustrate schematically the difference among the three types of clusters. Mosix with its architecture is very close to the ONDC with all nodes as single machines, but its architecture seems to be driven more by technical aspects than by an intentional design. The biggest problem with Mosix is that it requires a full trust among all participating nodes (and it assumes either shared user account space or a consensus on mapping of user ids). Multi-clusters are as well a special case of ONDC. Similarly as Mosix, current projects require trust among the participating clusters, which may be a limiting factor in real deployments. The existing grid solutions are close to ONDC especially when ONDC is used for a large scale deployment in a distributed area. The main difference is that the grid solutions are primarily targeted only on large scale deployments. In contrast, ONDC can be useful as well for a local resource sharing. User having a few machines at home would likely not use any of grid solutions to interconnect them, but an ONDC cluster may be a good candidate for that. The concept of overlapping clusters is similar to virtual organizations mechanisms (VO) [13], [14]. Modern grid solutions, like XtreemOS [15], are often based on VOs. The main difference between VO and ONDC concept is that the virtual organizations are designed for a mutual cooperation

Non-dedicated clusters are an extension of dedicated clusters. The ONDC is an extension of the non-dedicated clusters, thus it is a super-set of both. Figures 1, 2, and 34

Figure 3. ONDC. All nodes are using the other nodes as non-dedicated nodes. All clusters are isolated from each other.

agreement and some degree of trust, while our solution does not require any relation among cluster users. ONDC is, indeed, based on the assumption that the users does not know each other and have a very limited trust among themselves. VI. A DVANTAGESOF THE

ONDC

OVER OTHER

ARCHITECTURES

The key advantage of the ONDC architecture is an unique combination of a system without trust requirements and ability to form a separate cluster from each participating node. This combination of features has a high potential for attracting users to participate in the cluster. By relaxing the trust requirement, users can easily join the cluster. This idea was already leveraged by successful projects like BOINC [1]. If there are trust requirements, users generally have to undergo some registration and possibly a (mutual) trust review process, which itself may be sufcient to deter users from joining. By allowing a coexistence of multiple independent clusters we enable a natural user rewarding mechanism, where the users can get back the resources offered to the cluster by using the other nodes. The ability of each node to form its own cluster (and hence export its le system) is another factor contributing to easy expansion of a cluster. Any user needs just a common ONDC code and does not need to install or congure anything specic for the other clusters. The cluster nodes can immediately use his machine as a non-dedicated node and the users node can as well immediately use all other machines. Another advantage of allowing each node to form its own cluster is a natural option of coexistence of different installations of clusters (even with conicting versions of software installed on them) on the same physical hardware. An important architectural advantage of the ONDC architecture is a better fault-tolerance with respect to standard non-dedicated clusters. Fault-tolerance in standard nondedicated clusters relies on fault-tolerance mechanisms of its dedicated core. When the core fails, the whole cluster fails. In ONDC, when some core fails, the cluster formed by this core stops to work, but all other clusters are still functional (they just possibly loose some processes running on the crashed node). Clearly, this does not increase fault-tolerance of any cluster in the ONDC. But the non-dedicated clusters are generally based on the idea of utilizing idle machines and the ONDC allows continuous utilization of those idle machines even in presence of some cluster failures. VII. U SE CASES In order to better illustrate the architecture, we will describe a few possible use-cases of ONDC in this section. These are just an illustrative examples. All of the mentioned examples can coexist and cooperate as a single instance of ONDC.5

The smaller scale example can be a university computer laboratory. Currently, if the clustering is to be used in this environment, there is either a single cluster shared by all users (enforcing the same environment for them), or there is some non-transparent job scheduling system, where the users can send their jobs to be processed. With ONDC, each computer can form its own cluster, using resources of any other computer which is not being used at the moment. In addition, if any user brings his own laptop, he can simply plug it to the network, and start using the other computers as non-dedicated nodes of a cluster based on his laptop (and of course, the computers in the laboratory can use his laptop, if it is idle). He does not need any administrator privileges for the computers in the network. Another use case of ONDC are the multi-clusters. There is a clear demand for such computing platforms, underlined by the existence of commercial solutions like Mosix2 or LSF. The existing solutions could benet especially from the security research of the ONDC architecture, as this is directly applicable to them. In a largest scale, the ONDC architecture can be used as a world-wide cluster, similar to the SETI [16] or BOINC [1] projects. For such a project, a standard non-dedicated cluster can be sufcient. ONDC based solutions could possibly attract larger user base, since users can be rewarded by their offered time with the proportional cluster computing power. In addition, with ONDC not only single volunteer computers can be connected, but whole clusters and multi-clusters can be connected and offer their spare computing power. As long as they are already taking use of (compatible) ONDC infrastructure, it would be a simple conguration task to let the participation in the wider cluster build on ONDC. VIII. R ELATIONSHIP WITHMULTI CORE COMPUTING

Any project being developed should plan for the future as well. The ONDC architecture targets on using commodity computers. It is always hard to predict future hardware evolution, but commonly believed future development in the industry is that the commodity computers are going to have more and more cores, while the cores themselves will not become much faster. This is, indeed, a perfect match for ONDC architectures. There are 2 main factors. First, the developers of software will have to focus more on the parallelization of common programs. The cpu utilization patterns we can expect are as follows. Most of the cores would be utilized for a limited time when cpu intensive parallelized tasks are running. For the rest of the time most of the cores would be idle. Actually, such patterns are already seen now with current machines. For example, in IT companies with frequent application compilations or in graphical studios where rendering represents the most CPU intensive operations. ONDC can contribute a lot to such environments. Properly parallelized applications can use other machines computing

power at the time of high CPU demand. Assuming that the limited high cpu demand periods do not always overlap for all machines, there can be often some spare resources available in the network (of course, the biggest problem here can be support for distributed shared memory and I/O bounded tasks, but this is more technical question and is out of the scope of this paper). The second factor contributing to non-dedicated clusters generally is the observation that some programs are not easily parallelizable (either due to the nature of algorithms or due to a prohibiting complexity of such a parallelization). Users who work mostly with such programs would have some of their cores idle for most of the time. Machines of these users are good candidates to participate as nondedicated nodes in any non-dedicated clustering solution. Clearly, more research is needed in this area, especially to measure impact of such a sharing, since the usage of another core is not for free (cache conicts, bus contention, power consumption, etc.) and the cpu is not the only resource consumed by a running application (memory usage, network, disk I/O, etc.). IX. I MPLEMENTATION The proposed ONDC architecture is quite generic and there may be many implementations. The existing projects closest to the ONDC architecture are multi-clusters and Mosix. They fail, however, to address the trust less nodes cooperation requirements. To verify our ideas, we have started development of our own implementation of the architecture. In this section we will briey describe our system and its technical background, changes required to meet the ONDC and how we address 2 key topics - scheduling and security/trust handling. A. Clondike The original idea of Clondike [8], [17] was to implement a rst non-dedicated clustering solution based on the Linux operating system. Clondike is still in an experimental research phase, but it already supports most of the requirements on such systems. Clusters based on Clondike consist of one dedicated node, called core node and a number of non-dedicated nodes called detached nodes. This is a typical setup of any non-dedicated cluster, although some may contain a whole dedicated cluster as their core. A key feature of Clondike is a support for both preemptive and non-preemptive process migration based on process checkpointing. With the migration support, it is possible to utilize detached nodes that would sit idle otherwise. The usage of detached nodes is continuously monitored by the core node and if there is some opportunity to migrate a core node local process to idle detached node, the migration mechanisms are used.6

B. Technical background The implementation consists of 3 parts. The lowest level is a kernel patch, that is kept as minimal as possible so that upgrades to new kernels are not unduly complicated. The patch consist mostly of a simple hooks for second part of the system that are a kernel modules. These modules implement most of the lowest level functionality required for process migration and actually the process migration support itself. This is technically a most complicated part and a description of this implementation is out of the scope of the paper. The implementation details can be found in [17]. Finally, there is a userspace part of the system. It makes use of the kernel part of the system, interacting with it via a special control le system (exporting cluster specic data about processes), system signals (for requesting migration) and as well using a standard linux kernel netlink sockets when a bidirectional interaction is required (for example non-preemptive scheduling decisions on process execution). The userspace part performs all tasks that do not need to be directly in kernel, like scheduling, monitoring or information distribution. From a practical point of view (coding and debugging) it is a big advantage to put as much functionality as possible to userspace. C. Changes required to support ONDC Clondike was since the beginning designed to allow cooperation of untrusted parties, so this functionality did not require any modication to match ONDC requirements. In order to allow overlapping clusters, an extension to standard Clondike system was required. The system needs to allow coexistence of a core node and a detached node on a single physical machine. This way, it can act as a core of its own cluster, while still offering its resources to another cluster. Second related extension is an ability to act as a detached node of multiple independent clusters. This is a natural requirement, if we have for example cluster of 3 nodes (Alfa, Beta, and Gamma) each of them forming its cluster, we would like them to use all other nodes as detached nodes. Therefore Alfa will use Beta and Gamma as detached nodes, while Beta will use Alfa and Gamma as detached nodes. Gamma is then acting as a detached node for 2 independent clusters (and as a core node of its own cluster). D. Scheduling As outlined in previous sections, the main advantage of overlapping clusters support is a possibility of a scheduling based on reciprocal computing power exchange. Economy inspired market based schedulers [18], [19] studied in a standard non-dedicated Clondike environment [20] seems to be a promising candidate for our goals. Mosix multi cluster solution has some support for overlapping clusters and the market based schedulers are one of the scheduling options used [21].

Market based schedulers seems to be a better candidate for ONDC than for standard non-dedicated clusters. In standard non-dedicated clusters with market based schedulers, the situation is a bit cumbersome. The user offers his machine to the cluster and gets some credit for that. He can then use this credit for execution of his tasks, but not directly on his machine (since his machine acts as a non-dedicated node only. In non-overlapping cluster case, the tasks started locally on detached nodes are not clustered, they are in fact isolated from the cluster environment as much as possible). He must login to the cluster core and he can use his credit only there. In ONDC, he can use his credit transparently from his machine, since that machine forms a cluster on its own. To illustrate the difference between non-dedicated cluster and ONDC we will use an example of 2 nodes cluster, with nodes Alfa and Beta. Let Alfa be a core node of nondedicated cluster and Beta a detached node. When somebody runs a calculation on Alfa that will use resources on Beta, the owner of Beta will get a credit to run a comparatively expensive calculation on cluster formed by Alfa. To use his credit, he needs to login to Alfa and execute some process there (since Beta is only a detached node). In contrast, in ONDC case both machines can be a core node. So when the owner of Beta has credit to execute something on Alfa, he can execute a local process and that could be transparently migrated to Alfa and use the credit there. Despite the apparent suitability, the market based scheduling strategies were not yet tested in ONDC version of Clondike system and it is a future research topic to do so. Since the scheduler in Clondike resides completely in userspace, it is much easier task to implement a new scheduler than in other similar systems. A simpler and more straightforward scheduling strategy was used in the existing prototype. Each cluster in the ONDC version of Clondike has its own scheduler running on the cluster core node. This scheduler is tracking load and a count of cluster tasks running on associated detached nodes and it is trying to level the load of these nodes (including the core node itself, which is preferred for some tasks). The scheduling decision is performed only at the process start time, so only non preemptive migration is used. To honor the local user priority over cluster users, the node with some active local tasks does not accept any cluster jobs, even if some scheduler requests such a migration (so, if machine Alfa is running a local users job and a scheduler on machine Beta requests a job migration on Alfa, he is refused as long as the local tasks are running). Similarly, if the machine has too many remote jobs running on itself, it does not accept any new migration requests. The scheduling algorithm performed by each node is as follows: 1) If local cluster running tasks7

count is lower then threshold, do not emigrate task. 2) Else find the least loaded remote node, that is bellow its accepting threshold and try to emigrate there 3) If no remote node is found, the task is kept running locally. This strategy is very simple and has many problems in a real life, none the less it served well for system testing as will be demonstrated in the Performance section. It make use of the cluster overlapping nature and allows reciprocal computing power sharing, but unlike the market strategies, it does not have any fairness guarantees. Obviously, it is usable only in a smallest scale and only for scheduling of sufciently independent processes (i.e. not for collection of closely cooperating highly dependent processes like for example a MPI application. Such processes may need some co-scheduling [22] techniques to perform well.) E. Security and trust The security functionality and trust management of original Clondike system is directly applicable in ONDC environment. In this section we will briey review the mechanisms used, details can be found in [23], [24]. Since non dedicated clusters span over administrative domain boundaries and they are potentially used in an untrusted network environment, the line security must be ensured. Clondike is currently relying on establishing IPSEC channels between nodes to provide transparent link level security. By the ONDC denition, there should be no trust requirement between node owners. From the security point of view, it implies two main possible classes of attacks. First, an owner of machine forming cluster core node can try to send malicious code to remote machine to get access to that node. Second, the owner of machine acting as a detached node can read anything from memory of cluster processes running on his machine. Moreover, he can alter those processes memory and code. The rst attack is technically easy to prevent. The processes of cluster users run with the least possible privileges on the detached nodes and thus the detached node is protected against these processes. There is no reliable way how to prevent second type of attacks, where detached node owner reads or modies cluster processes running on his machine. The owner of the detached node has superuser privileges so he can do basically everything. In some special cases the results of remote execution can be quickly veried on the core node, but this is not always the case. Our approach to this issue is based on deferring the security critical decisions to the cluster users. Each user can specify what programs can run on what machine. In

addition, the user can specify the les from his le system that can be access from other nodes. All nodes for which no specic rules are specied are considered of a same trust level, we refer to them as anonymous nodes. The anonymous nodes can as well participate in cluster, the users can use them for executing of processes whose results can be either easily veried, or for processes that perform operations with non-sensitive data. To illustrate the decisions the user can make, we will use example with 3 nodes Alfa, Beta, and Gamma. A user on Alfa can trust owner of Beta, but trust less (or perhaps do not even know about) machine Gamma. The user would specify that any process can be migrated to Beta and the Beta can access any le on his le system. The user on Alfa has as well some process that performs some resource demanding, but easily veriable operations (NP-complete calculations are one possible example). He can then specify that this process can be executed on any node. The process may need to write result somewhere, so the user may need to give a write access to some restricted part of the le system to any node, so that the result can be written. The user dened restriction has to be obeyed by the scheduler. The scheduler, however, cannot always know which les are going to be accessed by a process being migrated to a detached node. So the processes on the detached nodes must be monitored for a le system access requests and when a violating access is detected an action must be taken. The action can be either rollback to previous process checkpoint, taken before migration to the detached node, or a process termination. Migration back to core node is not an option, since the process may be already altered by the owner of detached node. The last problem in the user dened restrictions enforcement is a direct result of preemptive migration support. Thanks to the migration capabilities, the process can visit many nodes during its life time. An owner of any of visited detached nodes can alter the process being executed. This means, the le system access violation checks must not consider only the node from which they are being performed, but all nodes that were visited during the process execution. The problem can be illustrated again on our 3 nodes cluster example. When a non-sensitive process is migrated to untrusted node Gamma, the owner can modify the process. For example, he can change it to delete all accessible les. If the process gets to execution of these command on Gamma, it will be eventually terminated due to access violation. In case the process migrates back to Alfa before executing the deletion code, and the process visit history is not consulted, it would be allowed to execute and all accessible les would be deleted. A solution to this (and related) problem is discussed in [24] and a stigmata mechanism is proposed. Each process is marked by a stigma of all nodes visited and any sensitive operation is consulted against user dened rules and all8

Table 1 PARAMETERS OF THE TEST MACHINES . M EMORY IS IN G IGABYTES , BUILD TIME IN FORMAT MINUTES : SECONDS , AND THE SEQUENTIAL TIME IS IN SECONDS . Name Alfa Beta Gamma Delta Cores 2 2 2 1 Mem. 4 2 2 1 Build time 2:13 3:28 3:46 6:43 Seq. time 6 7 8 11

stigmata. If any of the visited nodes does not have required privilege, the process is terminated. X. P ERFORMACE

Figure 4. Graph with times when a single compilation was started from the machine Delta

Figure 5. Graph with times when a single compilation was started from the machine Alfa

There is a vast amount of cases that could be measured. We have decided to demonstrate one common possible use case - the parallel compilation. This is a good representative of a harder class of problems, as there is relatively big overhead due to a high communication-to-computation ratio. Moreover, the applications used were not designed to run in a cluster environment.

Figure 6. Graph showing overheads of runs corresponding to Figures 5 and 4 both with and without IPSEC. The overheads are expressed in percents.

of this paper to do a detailed analysis of variance (time variance was in range of few seconds), worst case scenarios, etc. In cases when multiple concurrent compilations were measured, the choosen set of results was one with a shortest time of the compilation on the slowest machine. The rst set of performed tests demonstrates a standard non-dedicated cluster functionality of Clondike. Figure 4 shows compilation times of a single compilation started from the slowest machine (Delta). Each group of bars shows compilation times for different cluster congurations. In each bar group, there are 3 running times: the best time achieved in the cluster, the best time achieved in a cluster with IPSEC, and a theoretical minimum time. The rst 2 times are clear. The theoretical optimal time is calculated as follows: St + ((Ts St )/(Ts )) (1/(isetof participatingnodes

1/Ti )),

Figure 7. Graph with times when each machine simultaneously started one compilation

In our demonstration, we use standard unmodied gnu tools [25] like make, gcc, etc. The application being compiled is a Linux kernel, which is a sufciently large application to benet from cluster parallelization. Clondike does not have all security mechanisms implemented yet, but the performance most demanding part of the security, the IPSec [26] based channel security, was used and therefore the performance gures are representative. Tests were performed on a realistic platform, using 4 heterogeneous computers. All test machines have x86 64bit architecture and are interconnected with a standard 100Mbps Ethernet. Table 1 lists all key characteristics of the machines used for testing. It is generally hard to compare machines performance, so rather than meaningless frequency numbers, the table captures the time it takes to build the kernel on each of the machines. Another important value is a sequential portion of the build time - this includes mainly the nal linking. Each test was performed 10 times and the presented values represent the minimum time achieved. It is not purpose9

(1) where St denotes the sequential part of the compilation, Ts denotes the time of compilation on the node that started it, and Ti denotes the sequential compilation time on node i. The set of participating nodes includes the node that started the compilation. The theoretical minimal time accounts for the sequential time of the calculation, in order to isolate inherent parallelization limitations (due to Amdahls law) from inefciencies of the system itself. Ratio of achieved times to theoretical minimum times calculated this way reects the overhead of the system (there is still some inherent limitation due to network transfers, but this cannot be so easily cleaned up). There are a few noteworthy observations regarding the graph on Figure 4. First, the measured times are very good, even though the overhead is increasing with the number of nodes, as can be seen on Figure 6. The increasing overhead is due to inefciencies of current experimental scheduler that cannot effectively use all machines especially in the end of calculation. The second important observation is that the overhead due to security (represented by IPSEC) is apparent, but still quite small (15% in the worst case). Quite low security overhead is due to the fact that IPSEC was congured to run the AES encryption, which is very fast on 64bit platforms. In contrast to Figure 4, Figure 5 captures the results of a single compilation started from the fastest machine (Alfa). The important observation here is that the compilation has lower overhead both with and without the IPSEC. The key factor for the lower overhead is the fact that less percentage of work is sent to other machines. In addition, based on the runtime observation of the system behavior while compiling, it seems that the scheduler can more effectively use other machines while running on Alfa. This is, however, an area that requires more research in future.

The second set of tests captures an ONDC-specic use case in which all machines start a compilation at the same time. This is a use-case, that cannot be performed on any standard non-dedicated cluster, since only core machines can use the others in that case. In ONDC, each machine should rst do only its compilation and offer resources only after it is done with that. Results of this test can be seen on Figure 7: each machine completes its work in about the same or better time than in non-clustered case (the slightly worse times in some cases are only due to random time variations). Probably the most important number is the time of a compilation on the slowest machine (Delta), since it shows the total time spent since the start of the compilations till the end of the last. If we divide this number by 4, we get an average build time of each of the kernels in this case. In numbers, it would be around 58 seconds for a nonsecured compilation. The best non-secured time measured when starting from Alfa was 1 minute and 2 seconds, so we can see that this case is even more resource utilization efcient. There can be many other test combinations even for this simple use case, but we believe the presented results demonstrate clearly enough that parallelization of a ordinary programs can be achieved with acceptable overheads. XI. C ONCLUSIONS In this paper, we have dened an extension of existing clustering concepts. We have discussed its relationship with other architectures and its advantages. We have argued why we believe that the architecture is going to be even more interesting in the future. To verify our ideas, we have demonstrated feasibility of one possible use case of the architecture. ACKNOWLEDGEMENTS This research has been supported by the Czech Grant Agency GACR under grant No. 102/06/0943 and by the research program MSMT 6840770014. R EFERENCES[1] D. P. Anderson, Boinc: a system for publicresource computing and storage, in Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on, 2004, pp. 410. [Online]. Available: http://dx.doi.org/10.1109/GRID.2004.14 [2] D. Ridge, D. Becker, P. Merkey, and T. Sterling, Beowulf: Harnessing the power of parallelism in a pile-of-pcs, in Proceedings, IEEE Aerospace, 1997, pp. 7991. [3] L. A. Barroso, J. Dean, and U. Holzle, Web search for a planet: The google cluster architecture, Micro, IEEE, vol. 23, no. 2, pp. 2228, 2003. [Online]. Available: http: //ieeexplore.ieee.org/xpls/abs\ all.jsp?arnumber=1196112 [4] openmosix, http://www.openmosix.org/. [5] K. Kaneda, Y. Oyama, and A. Yonezawa, A virtual machine monitor for utilizing non-dedicated clusters, in SOSP 05: Proceedings of the twentieth ACM symposium on Operating systems principles. New York, NY, USA: ACM, 2005, pp. 111.10

[6] C. Kauhaus and A. Schafer, Harpy: A virtual machine based approach to high-throughput cluster computing, http: //www2.informatik.uni-jena.de/ckauhaus/2005/harpy.pdf. [7] R. C. Novaes, P. Roisenberg, R. Scheer, C. Northeet, J. H. Jornada, and W. Cirne, Non-dedicated distributed environment: A solution for safe and continuous exploitation of idle cycles, in In Proceedings of the Workshop on Adaptive Grid Middleware, 2003, pp. 107115. [8] M. Kacer, D. Langr, and P. Tvrdik, Clondike: Linux cluster of non-dedicated workstations, in CCGRID 05: Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid05) - Volume 1. Washington, DC, USA: IEEE Computer Society, 2005, pp. 574581. [9] Kerrighed, http://www.kerrighed.org/. [10] Openssi, http://www.openssi.org/. [11] Mosix, http://www.mosix.org/. [12] Lsf, http://www.platform.com/. [13] M. Coppola, Y. Jgou, B. Matthews, C. Morin, L. P. Prieto, scar David Snchez, E. Y. Yang, and H. Yu, Virtual organization support within a grid-wide operating system, IEEE Internet Computing, vol. 12, no. 2, pp. 2028, 2008. [14] L. J. Winton, A simple virtual organisation model and practical implementation, in ACSW Frontiers 05: Proceedings of the 2005 Australasian workshop on Grid computing and e-research. Darlinghurst, Australia, Australia: Australian Computer Society, Inc., 2005, pp. 5765. [15] Xtreemos, http://www.xtreemos.eu/. [16] Seti, http://setiathome.berkeley.edu/. [17] J. Capek, Preemptive process migration in a cluster of nondedicated workstations, Masters thesis, Czech Technical University, June 2005. [18] K. Lai, L. Rasmusson, E. Adar, L. Zhang, and B. A. Huberman, Tycoon: An implementation of a distributed, marketbased resource allocation system, Multiagent Grid Syst., vol. 1, no. 3, pp. 169182, 2005. [19] R. Buyya, D. Abramson, and S. Venugopal, The grid economy, in Proceedings of the IEEE, 2005, pp. 698714. [20] M. Kot l and P. Tvrdk, Evaluation of heterogeneous nodes sa in a nondedicated cluster, in Parallel and Distributed Computing and Systems, 2006. [21] L. Amar, J. Stosser, A. Barak, and D. Neumann, Economically enhanced mosix for market-based scheduling in grid os, in 8th IEEE/ACM Int. Conf. on Grid Computing, 2007. [22] M. Hanzich, F. Gine, P. Hernandez, F. Solsona, and E. Luque, Coscheduling and multiprogramming level in a non-dedicated cluster, in Recent advances in parallel virtual machine and message passing interface, 2005, pp. 327336. [23] M. Stava and P. Tvrdik, File system security in the environment of non-dedicated computer clusters, in PDCAT 07: Proceedings of the Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies. Washington, DC, USA: IEEE Computer Society, 2007, pp. 445452. [24] M. Kacer and P. Tvrdik, Protecting non-dedicated cluster environments by marking processes with stigmata, in Advanced Computing and Communications, 2006. ADCOM 2006. International Conference on, 2006, pp. 107112. [25] Gnu, http://www.gnu.org/. [26] S. Kent and K. Seo, Security Architecture for the Internet Protocol, RFC 4301 (Proposed Standard), Dec. 2005, http: //tools.ietf.org/html/rfc4301.


To Determine the Weight in a Weighted Sum Method for Domain-Specific Keyword Extraction

Wenshuo Liu Wenxin LiKey Laboratory of Machine Perception Peking University Beijing Supertool Internet Technology Co.Ltd Beijing, China {lwshuo, lwx}@pku.edu.cn

AbstractKeyword extraction has been a very traditional topic in Natural Language Processing. However, most methods have been too complicated and slow to be applied in real applications, for example in web-based system. This paper proposes an approach which will complete some preparing works focusing on exploring the linguistic characteristics of a specific domain. This part can be completed once and for all and thus reduce the burden in the real extraction process. It is a weighted sum method and the preparing work focus on finding out the weight. Once we have the weight, the extraction can be completed by addition, multiplication and sort, which are quite simple for modern computer. Experimental results show the

Traditional methods focused on efficient algorithms to improve the performance of the task of keyword extraction. This paper presents a method emphasizing on doing enough and effective preparing works, in order to simplify the real extraction process. The difference is illustrated in Figure 1. While in traditional methods, every step must be performed for every document, the weight extraction part in my work is finished once and for all. In my work, keyword extraction involves assigning scores to each candidate words considering various features, sort the candidates according to the score and choose the few top ones. Four different features are used: TFIDF, the part-of- speech (PoS) tag, relative position of first occurrence and chi-square statistics. Some experiments show that, the weighted sum of the feature vector can be a good choice for the score, as long as we have a proper weight vector. Different domains have different characteristics in the usage of word, so a certain weight vector working well in one domain might be totally ineffective in another domain. The weight vector is mainly the domain-specific information we here need to explore. This article is primarily about constructing a model to learn the weight vector. The model is very much like a perceptron.

effectiveness of the proposed approach.

I.

INTRODUCTION

Keyword extraction is the process of extracting a few salient words from a certain text and using the words to summarize the content. This task has been widely studied for a long time in the natural language processing communities, because it is important for many text applications, such as document retrieval. Domain-specific keyword extraction came into sight when researchers found out fully exploiting domain-specific information can greatly improve the performance of this task.978-0-7695-3521-0/09 $25.00 2009 Crown Copyright DOI 10.1109/ICCET.2009.136 11

II.

RELATED WORK

TFIDF (T, D) = P[term in D is T] log P[T in a Document]. (1) The TF is measured by counting the times that term T occurs in document D, and the IDF by counting the number of documents in the corpus in a specific domain. B. POS When inspecting manually assigned keywords, the vast majority turns out to be nouns. But there are still differences between different domains. For example, in entertainment news, keywords might always be peoples names, which are nouns. But in sports field, verbs are also quite important. We count the occurrences for each kind of PoS tag as manually assigned keywords in the whole corpus and then divide by the total number of keywords. For example, when we consider noun:

Probabilistic methods and machine learning have been widely used in the task of keyword extraction. Peter D. Turney (1999) developed the system GenEx. The system exploited genetic algorithm and used to be the state of the art. Eibe Frank, Gordon W. Paynter, Ian H.Witten, Carl Gutwin, and Craig G. Nevill-Manning (1999) described a simple procedure, called KEA, which was based on Nave Bayes. KEA was proved to be equally effective compared with GenEx, and even outperformed when fully exploiting domain-specific information. Hulth (2004a) and Hulth (2004c) presented approach using supervised machine learning. Their approaches constructed the prediction models from texts with manually assigned keywords. Graph-based algorithms have also been explored. Xiaojun Wan, Jianwu Yang and Jianguo Xiao (2007) proposed an iterative reinforcement approach to simultaneously finishing the task of keywords extraction and document summarization. Their approach fully exploited the sentence-to-sentence, word-to-word, and sentence-toword relationship. III.DATA REPRESENTATION

PoS(noun)= manually assigned keywords which are noun manually assigned keywords(2) The results are numbers between 0 and 1 and they indicate which kinds of words are more likely to be keywords in the target domain. C. Relative Position of First Occurrence Not only the occurrence, but also the location of the terms is important. Terms occurring in, for example, headlines and in sentences at certain positions, such as in the first sentence of paragraphs, are shown to contain more relevant terms. This feature is calculated as the number of words that precede its first appearance, and then divided by the documents length. For example, considering term T in document D:

The input document will be first split up to get separate terms. However, the terms themselves are useless. It is their attributes that matter. In this article, I choose four attributes, as mentioned above, to form a feature vector for each candidate keywords. Web page articles, which are classified into suitable domains, are used for training and testing. Those articles all have manually assigned keywords for the model to learn. For any document, each candidate words will be represented as a four-dimension feature vectors. Words which are so common that they have no differentiating ability, such as , have been stored in a stop list and removed during pre-processing. A. TFIDF TFIDF combines term frequency (TF) and inverse document frequency (IDF). It is designed to measure how specific a term T is to a certain document D:12

RPFO(T,D) = the position of first appearance the length of the document(3) The result is a number between 0 and 1 and indicates the proportion of the document preceding the terms first appearance.

D. Chi-Square Statistic For term T and Domain D, chi-square statistic is defined as:

use the result to update the weight by adding the result to it. For example, when we consider the feature TFIDF:

CHI (T , D) = (n11 n22 n12 n21 ) (n11 + n12 + n21 + n22 ) (n11 + n12 )(n21 + n22 )(n11 + n21 )(n12 + n22 )(4)

TFIDF =

E (keywords 'TF IDF ) TF IDF

E (non keywords ' TF IDF ) TF IDFAfter similar calculation we have a vector as ( TFIDF

(5)

In the equation, n11 indicates the times that T occurs in domain D, n21 indicates the times that terms other than T occur in domain D, n22 indicates the times that terms other than T occur in domains other than D, n22 indicates the times that T occurs in domains other than D. The chisquare statistic is used to test the dependence between a term and a domain. The higher this value is, the more dependent term T is on the domain D. If n11n22-n12n21 > 0, then term T is positively relevant to domain D, and if n11n22-n12n21 < 0, then term T is negatively relevant to domain D. IV. TRAINNING MODEL

PoS FirstOccurence Chi )

for update.

So after the nth article, we update the weight vector by:

n+1 = n +(6)

Thus with every article, we examine the difference between keywords and non-keywords, have a weight vector. V. WeEXPERIMENTS AND EVALUATION

and update the

weight vector according to the difference. And finally we

As the title suggested, it is a weighted sum method and finally in the real extraction task all have to be done are multiplication, addition and sort. So far we have talked about how candidate keywords are generated and represented. In order to get the weighted sum of the fourdimension feature vector, we still miss the weight vector. We need weights because the four features have different discriminating ability. Apparently the more the feature can discriminate between keywords and non-keywords, the higher weight it should be assigned. We can find the weight manually and actually that is the inspiration of this work. But doing it manually is too time-consuming if we try to determine weight vectors for many domains. We use some learning ideas from perceptron. All the articles in the training corpus are with manually assigned keywords. The weight vector is initially set to all zero, namely (0, 0, 0, 0). In every article, first on each dimension we examine the average value of keywords and non-keywords respectively, calculate the difference between them, and divide the results by the sum of all candidate keywords. Then we

scraped

2563

web

pages

from

http://tech.mop.com/. All texts are about information technology (IT). 1563 of them were used for training and 1000 for testing. The whole text and the keywords manually assigned in the meta keywords tag were extracted. We used the Chinese lexical analysis system ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) to complete word segmentation and PoS tagging. This system is developed by Chinese Academy of Science. The weight vector we get for this IT domain is (66.59, 22.76, 8.250, 4.202). On average, 6.8 keywords are manually assigned per text. In the following experiments we choose 7 words with the highest scores as the keywords. The extracted 7 words are compared with the manually assigned keywords. The results are presented in Table 1, which shows: precision(P); recall(R); and F-measure(F) . P refers to the proportion of automatically selected keywords which are also manually assigned. R refers to

13

the proportion of manually assigned keywords selected by this method. If we denote: ASMA = the number of terms both automatically selected and manually assigned

the weighted sums, which are used to rank the candidate keywords, are not fully exploited. If used properly, they might well benefit text categorization, document retrieval, and other natural language processing tasks

AS = the number of terms automatically selected

MA = the number of terms manually assignedThen P and R are defined as:

ACKNOWLEDGMENT Id like to thank Minghui Wu, Xuan Zhao, Hao Xu,

P=R is defined as:

ASMA AS ASMA MA

Songtao Chi and Chaoxu Zhang from Beijing Supertool (7) Internet Technology Co.Ltd for all the help and suggestions they have provided. I also want to thank Chun-Tsung Endowment Fund for (8) giving me a chance to take part in real research. I would especially like to take the opportunity to thank professor Von-Wun Soo, who has been so kind and given me a lot of valuable instructions while I was in National (9) Tsing Hua University. REFERENCE[1]

R=

F-measure combines precision and recall. It is usually used as a standard information retrieval metric:

F=

2 P R P+R

After observation, we found out that the keywords manually assigned are not all reliable, so we manually refined 200 documents. 100 of them are used for training and others for testing. Table 1compared the performance of our method on different data set. The second experiment shows a significant increase. This method relies heavily on the data. As long as the data is reliable, this method can perform quite well. VI. CONCLUSION AND FUTURE WORKS[3] [2]

In this article, we explored a new method on domainspecific keyword extraction. This method focuses on doing enough and effective preparing works to explore the linguistic characteristics of a specific domain, and thus simplifies the real extraction task. And experiments showTABLE I. THE PERFORMANCE P raw data refined data R F [6] [5] [4]

0.449 0.644

0.554 0.732

0.496 0.685[7]

that it did lead to a better performance. However, there are still much to improve. We still can not make sure the weight vector we get is the optimum solution. Moreover,14

Xiaojun Wan, Jianwu Yang and Jianguo Xiao. 2007. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL07). pp. 552-559. Prague, June 2007 Anette Hulth and Beta B. Megyesi. 2006. A study on automatically extracted keywords in text categorization. In the proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association of Computational Linguistics. pp 537-544. Sydney, July 2006 Anette Hulth. 2004a. Enhancing linguistically oriented automatic keyword extraction. In the proceedings of the Human Language Technology conference/North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL 2004). Boston, May 2004 Anette Hulth. 2004b. Reducing false positives by expert combination in automatic keyword indexing. In: Nicolov, N., Botcheva, K., Angelova, G., and Mitkov, R., (eds.), Recent Advances in Natural Language Proces

iccet2009 manuscript vol 1a

Documents