big data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634...

116
Big Data 981919 黃黃黃 991604 黃黃黃 991616 黃黃黃 991619 黃黃黃 991632 黃黃黃 991634 黃黃黃 991635 黃黃黃 991637 黃黃黃 991648 黃黃黃 991660 黃黃黃 991664 黃黃黃 3A G3

Upload: lily-stanley

Post on 11-Jan-2016

250 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Big Data

981919 黃于庭 991604 林右千 991616 李嘉芸991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟991635 陸雨新 991637 杜韋霆 991648 何冠儀991660 魏松毅 991664 梅耀文

3A G3

Page 2: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Question a:Describe its possible definitions

991637 杜韋霆

Page 3: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

3

What is big data?

With the advance of science and technology , we automatically create a large amounts of data every day. These data are generated from many places such as:

• sensors used to gather climate information• posts to social media sites• digital pictures and videos• purchase transaction records• cell phone GPS signalsWe can call this kind of data “Big data”.

Ref: Speed of Business --- IBMhttp://www-01.ibm.com/software/data/bigdata/

Page 4: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

4

Definitions• Wiki

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.

Ref: http://en.wikipedia.org/wiki/Big_data

• Gartner

“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

Ref: http://www.gartner.com/it-glossary/big-data/

• Doug Laney

Data sets where the three Vs—volume, velocity and variety—present specific challenges in managing these data sets.

Ref: http://www.isaca.org/Knowledge-Center/Blog/Lists/Posts/Post.aspx?ID=299

• Webopedia

Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

Ref: http://www.webopedia.com/TERM/B/big_data.html

Page 5: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

5

Definitions(continue)• Andrew Brust We can safely say that Big Data is about the technologies and practice of handling

data sets so large that conventional database management systems cannot handle them efficiently, and sometimes cannot handle them at all.

• John Rauser Any amount of data that's too big to be handled by one computer.• Techopedia.com Big data refers to a process that is used when traditional data mining and handling

techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.

And more! Ref: http://www.opentracker.net/article/25-definitions-big-data

Page 6: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

6

Definitions(conclusion)

There are many different definitions of Big data,But most of them talk about: 1. Size of data sets are very large.2. Hard to deal with commonly used software

tools .3. The time of data processing are important.4. Types of data are many.

Page 7: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

7

Why is it so important?

Big data issues are important because:1.data sets that companies gathers are more than just

words, but also includes video and images.2. The methods that generate data are different from the

past.3.Manual or on-hand tools are not efficient enough.4.Companies requires fast reaction and accuracy.5.Data before processed are useless.

Ref: http://www.arthurtoday.com/2012/01/big-data.html

Page 8: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

8

What are its characteristi cs? 991632 游智鈞

• Actually we find there are many definitions used in big data ,but the origins of the term come from a 2001 paper by Doug Laney of Meta Group , it defines big data as data sets which have three Vs- Volume ,Velocity and Variety.

• Most people talking about is 3Vs ,but some talk about is 4Vs , add a fourth V “Veracity”.

• IBM proposed a concept of 3I –Instrumented ,Interconnected and Intelligent .

Page 9: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

9

[2]

Batch : It’s not continuousprocessing of data, batch processing is used for very large files. The files to be transmitted are gathered over a period and then send together as a batch.

Reference:[2] http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data[3] http://contest.trendmicro.com/2013/tw/train.htm

Page 10: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

10

4V(Volume 、 Velocity 、 Variety 、 Veracity)• Volume: There are many factors contributes to increase in

data volume-past transaction records, daily data collected from sensors, data create by social media, etc.

• Velocity: It means how fast data is be creating and how fast data must be processing . For businesses, in the shortest possible time processed data, enterprises will be able to bring more benefits.

• Variety: Today's data type may be a variety of formats.1. Structured Data : Database Data(Trial Balance, Financial Report, General information)

2. Semi-structured Data : Email, Blog Posts

3. Unstructured Data : Text, Video, Photo ,Audio[4]

Reference :[4] 雲端時代的殺手級應用 - 海量資料分析 胡世忠 著

Page 11: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

11

• Veracity: Because the source of data from anywhere, you can not guarantee the correctness of data or any data is benefit for enterprises. So ,it’s important for enterprises to get a useful data and analysis it.

[5]

Reference:[5] http://skyfollow.com/big-data-velocity-comparisons-incoming-rate-chart/

Page 12: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

3I(Instrumented 、 Interconnected 、 Intelligent)

• Instrumented : It means huge change in data source. We place the sensors above lots of things so that people can be more sensitive, more comprehensive perceive the physical world. Eg: Smart meter

• Interconnected : It means huge change in the way of data transmission. We use sensors, RFID and more communication technology to communicate between objects.

• Intelligent : It means huge change in the way of data use. Eg: SuperComputer-Waston

12

Reference:[6]http://www.ibm.com/smarterplanet/ie/en/overview/ideas/

Page 13: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

13

Conclusion

Big data is not just only represent the basis of a large volume of data , but also represents life has now entered another level.

Bring to life more convenient, more intelligent choices.

Page 14: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Question b: What’s the possible challenges, and opportunities of big data?

991604 林右千991616 李嘉芸991619 鍾佳琳

Page 15: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

15

Challenges - Understand & Use

• The challenge is how we can understand and use big data when it comes in an unstructured format, such as text or video.

• Unstructured data is a generic label for describing any corporate information that is not in a database. Unstructured data can be textual or non-textual. – Textual unstructured data is generated in media like email messages,

PowerPoint presentations, Word documents, collaboration software and instant messages.

– Non-textual unstructured data is generated in media like JPEG images, MP3 audio files and Flash video files.

References from: http://spotfire.tibco.com/blog/?p=6793http://searchbusinessanalytics.techtarget.com/definition/unstructured-data

991604 林右千

Page 16: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

16

Challenges - Understand & Use

Direct quote from: http://www.slideshare.net/Hadoop_Summit/hadoops-opportunity-to-power-nextgeneration-architectures

Page 17: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

17

Challenges - Understand & Use

• For example, as social media applications like Twitter and Facebook go mainstream, the growth of unstructured data is expected to far outpace the growth of structured data.

• In customer-facing businesses, the information contained in unstructured data can be analyzed to improve customer relationship management and relationship marketing.

References from: http://searchbusinessanalytics.techtarget.com/definition/unstructured-data

Page 18: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

18

Opportunities - Government

• The opportunities about the government, following are three parts and their examples.

1. Improve administrative efficiency - ACSSA2. Combat and prevent crime - Memphis PD3. Improve traffic problems - Stockholm

References from: 胡世忠 . 雲端時代的殺手級應用:Big Data 海量資料分析 . 臺北市 : 天下雜誌股份有限公司 . 2013: 9789862416730

Direct quote from:http://www.memphispolice.org/

Direct quote from:http://www.alamedasocialservices.org/public/index.cfm

Page 19: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

19

Improve administrative efficiency - ACSSA

• Alameda County is the seventh largest county in California. The Alameda County Social Services Agency (ACSSA) provides social services to as many as 140,000 people living below the poverty line, with 19,000 actively managed cases.

• The antiquated systems it was using could not keep up with the need for information, which meant that the agency’s understanding of what was happening out in the community lagged weeks or even months behind actual events.

• ACSSA teamed with IBM to deploy an information management system that combined analytics with business intelligence to give workers an agency-wide, comprehensive view of individual cases.

References from: IBM The SmarterCities Leadership Series. Smarter Government Services.

Page 20: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

20

Improve administrative efficiency - ACSSA

• Outcome: 1) ACSSA now has an average annual savings of nearly $25M.

2) Real-time understanding of case and program status enable them to find the best assistance programs for each situation.

3) Real time tracking reveals relationships between benefit recipients and programs, helping to eliminate waste, fraud and redundancy.

4) Reports are generated in minutes instead of weeks or months.

5) The system has increased the productivity and win rates of agency lawyers who defend the agency when a claimant appeals their discontinuation of benefits, which saves the agency $900,000 annually.

References from: IBM The SmarterCities Leadership Series. Smarter Government Services.

Page 21: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

21

Combat and prevent crime - Memphis PD

• Memphis PD use Blue CRUSH (Criminal Reduction Utilizing Statistical History) to reduce the rate of crime.

• At the heart of Blue CRUSH is a predictive model that incorporates fresh crime data from sources that range from the MPD’s records management system to video cameras monitoring events on the street.

• Blue CRUSH lays bare underlying crime trends in the way that promotes an effective fast response, as well as a deeper understanding of the longer-term factors (like abandoned housing) that affect crime trends.

References from: IBM Smarter Planet Leadership Series. Memphis PD: Keeping ahead of criminals by finding the “hot spots”. 2011

Page 22: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

22

Combat and prevent crime - Memphis PD

• It happens at the precinct level. Looking at multilayer maps that show crime hot spots, commanders can see not only current activity levels, but also any shifts in such activities that may have resulted from previous changes in policing deployment and tactics. At each weekly meeting, commanders go over these results with their officers to judge what worked, what didn’t and how to adjust tactics in the coming week.

References from:IBM Smarter Planet Leadership Series. Memphis PD: Keeping ahead of criminals by finding the “hot spots”. 2011

Page 23: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

23

Combat and prevent crime - Memphis PD

• Outcome:1) 30% reduction in serious crime overall, including a 36.8% reduction

in crime in one targeted area2) 15% reduction in violent crime3) 4x increase in the share of cases solved in the MPD’s Felony Assault

Unit (FAU), from 16 percent to nearly 70 percent4) Overall improvement in the ability to allocate police resource in a

budget-constrained fiscal environment

References from: IBM Smarter Planet Leadership Series. Memphis PD: Keeping ahead of criminals by finding the “hot spots”. 2011

Page 24: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

24

Improve traffic problems - Stockholm

• The Swedish National Road Administration(SNRA) and the Stockholm City Council announced a trial Congestion Tax.

• The goal was not only to reduce congestion, but encourage ancillary benefits, such as improving public transport and alleviating environmental damage. The government’s plan is to devote revenue from the tax to completing a ring road around the city.

• With help from IBM, the solution they came up with was an innovative, high-tech traffic charging system that directly charges drivers who use city center roads during peak business hours.

References from: Driving Change in Stockholm. 2008

Page 25: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

25

Improve traffic problems - Stockholm

• The way it works, drivers can install simple transponder tags that communicate with receivers at the control points and trigger automatic payment of road use fees. Once a vehicle passes a roadside control point during designated congestion hours, it is recognized by the transponder that is read by sensors.

• In addition, cars passing through these control points are photographed, and the license plate numbers are used to identify those vehicles without tags and to provide evidence to support the enforcement of non-payers. The information is sent to a computer system that matches the vehicle with its registration data, and a fee is charged to the owner. All of the above steps can be completed within milliseconds.

References from: Driving Change in Stockholm. 2008

Page 26: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

26

Improve traffic problems - Stockholm

• Outcome:1) traffic was down nearly 25 percent. 2) Public transport schedules had to be redesigned because of the

increase in speed from reduced congestion. 3) 40,000 more travelers used Stockholm Transport on an ordinary

weekday than the year before—an increase of six percent. 4) The reduction in traffic has led to a drop in emissions from road

traffic by eight to 14 percent in the inner-city. 5) Greenhouse gases such as carbon dioxide have fallen by 40 percent

in the inner-city.

References from: Driving Change in Stockholm. 2008

Page 27: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

27

Opportunities - Manufacturing

• The opportunities for manufacturing.

Direct quote from: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. June 2011: 78

Page 28: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

28

Opportunities - Manufacturing

Direct quote from: McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity. June 2011: 78

Page 29: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

29

Opportunities - Manufacturing

• Example: Haitai Confectionery & Food Co., Ltd., a South

Korean company with its main business in retail and instant foods, especially confectionery, beverage and ice cream. Haitai use a business intelligence and analysis platform to analysis historical data, tracking changes in supply and demand, and forecast demand. That quickly grasp the market demand and reduce the day in inventory.

References from: 胡世忠 . 雲端時代的殺手級應用: Big Data 海量資料分析 . 臺北市 : 天下雜誌股份有限公司 . 2013: 201-202. 9789862416730

Page 30: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

30

Challenge ─ Private Security

• With the Big Data era, Internet will always release huge amounts of data, and society benefit from the use of Big Data, but the privacy is nowhere to hide.

• With the produce, storage, analysis, increasing the amount of data, whether it is about business sales, or personal spending habits, identity, etc., has stored in various forms.

SOURCE: R[1], R[2]

991616 李嘉芸

Page 31: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

31

Challenge ─ Private Security

• Large amounts of data hidden a large number of economic and political interests, particularly through data integration, analysis and mining.

• With technological innovations arising from Big Data era also gave birth to all sectors of society to face strong demand for personal privacy.

SOURCE: R[1], R[2]

Page 32: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

32

Big data era has the following behavior invasion of personal privacy:

• In the process of data storage : the user can not know the exact storage location of data, and users lose control of personal data collection, storage, use, and share.

• The process of data transmission results in violating personal privacy. Because the data transmission more open and pluralistic, it may result in data leakage or risk of eavesdropping

• In the process of data destruction : the data may already be backed up, will lead to the destruction incompletely.

SOURCE: R[1], R[2]

Page 33: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

33

How to strengthen the protection of personal privacy:

1. The personal information protection into national strategies for conservation and planning issues.

2. To build a completed Personal Privacy protection’s law : we need to create a personal privacy protection law and basic rules. In addition, we should actively promote laws and regulations related to the protection of privacy legislation to reduce violations of personal privacy.

3. Strengthen the technical protection of personal privacy : Encourage development of Privacy protection technologies. How to prevent personal data is processed by unnecessary and undesirable manners, and let the users know where their data is stored, how they are processed

SOURCE: R[1], R[2]

Page 34: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

34

Opportunities - Energy

• High oil and high electricity prices make sustainable energy issues exist persistently, and make big data analysis are increasingly important in the energy industry.

SOURCE: R[3]

Page 35: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

35

Opportunities - Energy

• Big Data analysis directly affect the profit, so many industry professionals already installed the intelligent monitoring equipment ,it can collect a large amounts of data immediately to proceed simulate analysis, and it use to increase productivity and reduce costs.

SOURCE: R[3]

Page 36: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

36

Opportunities - Energy

• Energy industry from the following aspects using big data analysis:

1 . Mobile Data Integration : Power Company can analyze consumer’s patterns of activity and comments on the website, and then develop more in line with the needs of service users.

SOURCE: R[3]

Page 37: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

37

Opportunities - Energy

2. Data link Thermostats : thermostats can record and transmit electricity which is consumed by adjusting temperature to user's home , each thermostat will generate tens of thousands records in a month, if take advantage of it, also helps power companies to regulate electricity and encourages the users to change consumption habits.

SOURCE: R[3]

Page 38: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

38

Opportunities - Energy

3. Study habits of electric vehicle owners charge : By tracking and analyzing the owners charging habits, the power company can understand people use electricity more relatively in which period, and encourages users to charge during off-peak hours.

SOURCE: R[3]

Page 39: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

39

References

• R[1], 大數據時代個人隱私保護刻不容緩 http://big5.ce.cn/gate/big5/www.ce.cn/xwzx/gnsz/gdxw/201212/20/t20121220_23958532.shtml

• R[2], 大數據時代﹕數據開放更注重個人隱私保護http://big5.gmw.cn/g2b/IT.gmw.cn/2013-04/12/content_7292096.htm

• R[3], 雲端時代的殺手級應用 -Big Data 海量資料分析 , 胡世忠 , 天下雜誌股份有限公司 ,2013/03/08

Page 40: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

40

Challenges – Storage• "Big data" refers to data sets that are too large to be

captured, handled, analyzed or stored in an appropriate timeframe using traditional infrastructures.

Bit 1 or 0

Byte 8 bits

Kilobyte 1,000 bytes

Megabyte 1,000 KB

Gigabyte 1,000 MB

Terabyte 1,000 GB

Petabyte 1,000 TB

Exabyte 1,000 PB

Zettabyte 1,000 EB

Source from: R[1],R[2]

1 PB=1000000000000000 B= bytes = 1,000 terabytes

991619 鍾佳琳

Page 41: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

41

Challenges – Storage

• Storage is especially challenging because there are many different kinds of data that needs to be stored.

• Persistence– Many big data applications involve regulatory compliance that

dictates data be saved for years or decades. – Medical information is often saved for the life of the patient.

Financial information is typically saved for seven years. – Big data users are also saving data longer because it’s part of an

historical record or used for time-based analysis. This requirement for longevity means storage manufacturers need to include on-going integrity checks and other long-term reliability features, as well as address the need for data-in-place upgrades. Source from: R[2]

Page 42: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

42

Challenges – Storage

• Storage must evolve–Big data has outgrown its own

infrastructure and it’s driving the development of storage, networking and computer systems designed to handle its specific.

Source From: R[2]

Page 43: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

43

Challenges – Storage

Types of data:– Structured Data

• Data that resides in fixed fields within a record or file.

– Semi-structured Data• XML, E-mail, Blog

– Unstructured Data• pictures, digital audio, video, Word, pdf

Source from: R[7]

Page 44: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

44

Structured Data (Traditional)

Direct quote from: R[3]

Data Warehouse Administrator Business Analyst

Business User

Page 45: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

45

Structured Data (Traditional)

• Traditionally, data processing for analytic purposes followed a fairly static blueprint. Namely, through the regular course of business enterprises create modest amounts of structured data with stable data models via enterprise applications like CRM, ERP and financial systems.

Source From: R[3]

Page 46: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

46

Structured Data (Traditional)

• Data integration tools are used to extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization occur and the data is modeled into neat rows and tables.

Source From: R[3]

Page 47: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

47

Structured Data (Traditional)

• The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently.

Source From: R[3]

Page 48: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

48

Structured Data (Traditional)User How to use Traditional Data Warehouse

Data Warehouse Administrator

Create and schedule regular reports to run against normalized data stored in the warehouse, which are distributed to the business. They also create dashboards and other limited visualization tools for executives and management.

Business Analyst

Use data analytics tools/engines to run advanced analytics against the warehouse, or more often against sample data migrated to a local data mart due to size limitations.

Business User

Perform basic data visualization and limited analytics against the data warehouse via front-end business intelligence tools from vendors like SAP BusinessObjects and IBM Cognos.

Source From: R[3]

Page 49: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

49

Semi-structured Data & Unstructured Data

• Hadoop is an open source framework for processing, storing and analyzing massive amounts of distributed, unstructured data.

• It was designed to handle petabytes and exabytes of data distributed over multiple nodes in parallel.

• Fundamental concept– Hadoop breaks up Big Data into multiple parts so

each part can be processed and analyzed at the same time.

Source from: R[3]

Page 50: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

50

Opportunities - Healthcare

Healthcare

Source from: R[4],R[7]

• Divide the healthcare into five broad categories:

Page 51: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

51

Opportunities - Healthcare

Source from: R[4],R[7]

Categories How to apply it

Clinical operations – Clinical decision support systems

The current generation of such systems analyzes physician entries and compares them against medical guidelines to alert for potential errors such as adverse drug reactions or events. By deploying these systems, providers can reduce adverse reactions and lower treatment error rates and liability claims, especially those arising from clinical mistakes.

Payment/pricing – Health Economics and Outcomes Research and performance-based pricing plans

Patients would obtain improved health outcomes with a value-based formulary and gain access to innovative drugs at reasonable costs.

Page 52: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

52

Opportunities - Healthcare

Source from: R[4],R[7]

Categories How to apply it

R&D – Personalized medicine

The objective of this lever is to examine the relationships among genetic variation, predisposition for specific diseases, and specific drug responses and then to account for the genetic variability of individuals in the drug development process.

New business models – Online platforms and communities

Example of this business model in practice include Web sites such as PatientsLikeMe.com, where individuals can share their experience as patients in the system.

Public health – Be better prepared for emerging diseases and outbreaks

This lever offers numerous benefits, including a smaller number of claims and payouts, thanks to a timely public health response that would result in a lower incidence of infection.

Page 53: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

53

Opportunities - Healthcare• ExampleThis type of Big Data healthcare company is focused on “Increasing Awareness”. A mobile app called Asthmapolis is an example of this type. A mobile sensor device is attached to an asthma inhaler, which then monitors where and when asthma attacks happen. The device wirelessly synchronizes with an iOS/Android app, allowing users to track their triggers and symptoms.

Source from: R[5],R[6]

Page 54: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

54

References• R[1], The Wall Street Journal, January 21, 2013

http://online.wsj.com/article/SB10001424127887323468604578245540627666664.html

• R[2], Storage for big data, page2 & page5, April 2, 2012http://searchstorage.techtarget.com/magazineContent/Storage-for-big-data?pageNo=1

• R[3], Big Data: Hadoop, Business Analytics and Beyond, April 16, 2013http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond

• R[4], McKinsey Global Institute, May 2011http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

• R[5], How Big Data Is Improving Healthcare, October 2, 2012http://readwrite.com/2012/10/02/how-big-data-is-improving-healthcare

• R[6], ASTHMAPOLIShttp://asthmapolis.com/

• R[7], 雲端時代的殺手級應用 -Big Data 海量資料分析 , 胡世忠 , 天下雜誌股份有限公司 ,2013/03/08

Page 55: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Question c: Explain how can a corporate

deal with the problems associated with big data and explain its possible solutions-

Problem: How to analyze and apply to Big Data

Page 56: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Pattern recognitionClassification

Anomaly Detection 991648 何冠儀

Page 57: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

57

Pattern recognition

• What is the pattern? [1]– The pattern is a picture, a string of characters, a set

of symbols, a sequence of signal, etc. • What is pattern recognition?

– The act of taking in raw data and making an action based on the “category” of the pattern.[1]

– Pattern recognition is a "decision" of science.[2]

Page 58: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

58

Pattern recognition

• Process [2]– Feature : the character of sample – Training sample : the sample is to build a system– Test sample : use test sample to test accuracy of system

Page 59: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

59

Pattern recognition

• Application [1][2]-Biometric Authentication : fingerprint 、 voice print

-Voice recognition : analysis of the contents of the speaker's talk

-Medical Image Analysis : X-rays 、 nuclear medicine imaging

- Wireless Telecommunication Analysis : Determine how many wireless networks in the space.

-Satellite image analysis : Determine which areas is grassland, river, sand, buildings, etc.

-Handwriting Recognition : Determining the handwritten text.

Page 60: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

60

Classification

• What is Classification? [3][4]– It is used to group items based on certain key

characteristics. – classifies data (constructs a model) based on the

training set and the values (class labels) in a classifying attribute and uses it in classifying new data

• Purposes [5]– Analysis of the factors affecting data classification– Predict the category of data (class label)

Page 61: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

61

Classification• The process of classification [5][6]1. Establish the model:

– Using the existing data to find out classification models.– Such as Decision tree、 classification rules

2. Assessment model:– Existing information will be divided into two groups: training samples and

testing samples.– First phase:use training sample to build the model– Second phase:use test sample to evaluate the accuracy of model

3.Using the model :– Find out the reasons for data classification– Predict new type of data

Page 62: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

62

Classification• Algorithms [7]

– Support vector machines : are supervised learning models with associated learning algorithms that analyze data and recognize patterns[8]

– Neural networks : consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation.[9]

– Kernel estimation : is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.[10]

– Decision trees : create a model that predicts the value of a target variable based on several input variables.[11]

Page 63: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

63

Classification

• Application[7]– Speech recognition : is the translation of spoken words

into text.[12]

– Biological classification : is a method of scientific taxonomy used to group and categorize organisms into groups such as genus or species.[13]

– Credit scoring : is a numerical expression based on a statistical analysis of a person's credit files, to represent the creditworthiness of that person.[14]

Page 64: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

64

Anomaly detection

• What is Anomaly detection?– Anomalies? The set of data points that are considerably

different than the remainder of the data [15]– Usually produce a large number of false alarms.[16]– Also referred to exceptions, deviation.[17]

Page 65: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

65

Anomaly detection

• Common causes of anomalies [19]– Data From Different Classes: objects different

because they are of a different type or class– Natural Variation: datasets modeled by statistical

distributions , where are admitted variations in data

– Data Measurement and Collection Errors: errors in the data collection or during the measurement process

Page 66: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

66

Anomaly detection

• Categories [17]1.Unsupervised :

– No labels assumed– Based on the assumption that anomalies are very rare compared to

normal data

2.Supervised :– Labels available for both normal data and anomalies– Similar to rare class mining

3.Semi-supervised :– Labels available only for normal data

Page 67: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

67

Anomaly detection

• Techniques [18]1.Model-Based :

– Build a model of the data.– Anomalies are objects that do not fit the model.

2.Proximity-Based :– Define a proximity measure between objects– Anomalies are objects that are distant from most of the

other objects

3.Density-Based :– Estimate the density of objects– Anomalies are objects that are in regions of low density

Page 68: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

68

Anomaly detection

• Applications [18]– Intrusion detection : monitoring systems and networks

for unusual behavior– Fraud detection : looking for buying patterns different

from typical behavior– System health monitoring : use unusual symptoms or

test result to indicate potential health problems– Detecting Eco-system disturbances : try to predict

events like hurricanes and floods– Public Health : use medical statistic reports for diagnosis

Page 69: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

69

References

[1]http://www.ie.ksu.edu.tw/ie1/100ie/web/files/download/2011.11.15.%E6%9C%B1%E5%AE%B6%E5%BE%B7.pdf

[2]http://nthur.lib.nthu.edu.tw/dspace/handle/987654321/4878[3]

http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed

[4] http://www.wisegeek.com/what-is-a-data-mining-classification.htm

[5] http://sls.weco.net/node/10936[6] http://faculty.stust.edu.tw/~jehuang/DMCourse/ch5-3.html[7] http://en.wikipedia.org/wiki/Statistical_classification[8]http://en.wikipedia.org/wiki/Support_vector_machine

Page 70: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

70

References

[9] http://en.wikipedia.org/wiki/Artificial_neural_networks[10] http://en.wikipedia.org/wiki/Kernel_density_estimation[11] http://en.wikipedia.org/wiki/Decision_tree_learning[12] http://en.wikipedia.org/wiki/Speech_recognition[13] http://en.wikipedia.org/wiki/Biological_classification[14] http://en.wikipedia.org/wiki/Credit_scoring[15]

http://www.slideshare.net/guest76d673/chap10-anomaly-detection

[16] 經濟部九十年度科技專案 國家資通安全技術服務計劃 入侵偵測系統簡介 陳培德 國立成功大學電機所博士候選人

[17] www.siam.org/meetings/sdm08/TS2.ppt[18]

http://www.cli.di.unipi.it/~tamberi/old/docs/tdm/anomaly-detection.pdf

Page 71: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Association rule learning &Predictive modeling

991634 陳鈺玟

Question c: Explain how can a corporate deal with the problems associated with big data and explain its possible solutions

Page 72: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

72

Association rule learning

• What is association rule learning?– Association Rules describe frequent co-occurences

in sets. – Association rule learning was first used by major

supermarket chains to discover interesting relations between products.

– A set of techniques for discovering interesting relationships among variables in data.

Ref[1]:http://dataminingintelligence.com/?p=60Ref[2]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf Ref[3]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/

Page 73: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

73

Association rule learning• Basic way of Association rule learning :

– If a supermarket has 100,000 transactions, out of which 2,000 include both butter and bread and 800 of these 2,000 transactions include milk,• Support : How many times does this rule cover?Þ 800 times in 100,000 transactionsÞ alternatively 0.8% = 800/100,000• confidence : How strong is the implication of the

rule?Þ 800 times in 2000 transactionsÞ 800/2000 = 40%Ref[4]:http://akashrajak.webs.com/%20New%20Folder/Association%20Rule%20MiningApplications%20in

%20Various%20Areas.pdfRef[5]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdfRef[6]:http://en.wikipedia.org/wiki/Association_rule_learning

Page 74: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

74

Association rule learning

• Examples: – Which products are frequently bought together by customers?

• DataTable = Receipts x Products• onions and potatoes → hamburger

– Which courses tend to be attended together?• DataTable = Students x Courses • Programming → Computer Science, Algorithm

Ref[7]:http://www.ke.tu-darmstadt.de/lehre/archiv/ws0405/mldm/association-rules.pdf

Page 75: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

75

Association rule learning• Applications:

– Market basket analysis : • 1. Encourage more purchases : to know if certain groups of

items are consistently purchased together, for adjusting store layouts

• 2. Improve efficient : by alerting which merchandising effort is ineffective, and which product is not selling

• 3. Enhance inventory management : by eliminating slow-moving items and increasing the supply of fast-moving merchandise

• 4. Extract information about visitors to websites from logs : a merchant could analyze data on visitor browsing patterns, login counts, past purchase behavior, and responses to promotions — to eliminate what isn't working and focus on what does

Ref[8]:http://www.practicalecommerce.com/articles/3945-4-Ways-Big-Data-Can-Help-Ecommerce-Merchants-Ref[9]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/

Page 76: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

76

Association rule learning• Applications:

– Protein sequences : • For healthy care : because proteins are important constituents of

cellular machinery of any organism, and they are sequences made up of 20 types of amino acids, with association rule learning, this can enhance our understanding of protein composition and hold the potential to give clues regarding the global interactions

– Census data : • For general public and government : a huge variety of

general statistical information on society, the information related to population and economic census can be forecasted in planning public services, such as education, health, transport, funds

Ref[10]:http://www.firmex.com/blog/7-big-data-techniques-that-create-business-value/Ref[11]:http://www.practicalecommerce.com/articles/3945-4-Ways-Big-Data-Can-Help-Ecommerce-Merchants-Ref[12]:http://akashrajak.webs.com/-%20New%20Folder/Association%20Rule%20Mining-Applications%20in%20Various%20Areas.pdf

Page 77: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

77

Predictive modeling

• What is Predictive modeling?– Short definition : using data to make decisions– Long definition : using data to take actions and

make decisions using models that are statistically valid and empirically derived

– A process by which a model is created or chosen to try to best predict the probability of an outcome given a set amount of input data.

Ref[13]:http://en.wikipedia.org/wiki/Predictive_modellingRef[14]:http://cdn.oreillystatic.com/en/assets/1/event/85/Best%20Practices%20for%20Building%20and%20Deploying%20Predictive%20Models%20over%20Big%20Data%20Presentation.pdf

Page 78: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

78

Predictive modeling

• Basic way of Predictive modeling:– The birth of a predictive model :

• A predictive model is the result of combining data and mathematics.

• To put it formally, data + model technique = Predictive modeling

Ref[15]:http://www.ibm.com/developerworks/library/ba-predictive-analytics2/Ref[16]:http://www.ibm.com/developerworks/library/ba-predictive-analytics2/fig01.gif

Page 79: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

79

Predictive modeling

• Common categories of models : – Predictive models : how likely an event is

• For example, how likely a credit card transaction is to be fraudulent, how likely a visitor to a web site is to click on an ad, or how likely a company is to go bankrupt.

– Summary models : summarize data• For example, divide credit card transactions or airline

passengers into different groups depending upon their characteristics.

Ref[17]:http://opendatagroup.com/predictive-analytics-faq/

Page 80: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

80

Predictive modeling

• Applications:– In financial way:

• 1.Optimize availability, allocation and yield of assets • 2.Improve business outcomes, make better decisions,

increase competitiveness

– In operational way:• 1.Exceed service level commitments by increasing

speed and reducing risk of failure • 2.Optimize maintenance schedules around conditions

Ref[18]:http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-KWYFVBRef[19]:http://www-01.ibm.com/software/data/bigdata/industry-retail.html  

Page 81: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

81

Predictive modeling

• Applications:– Customer Relationship Management : analyze and

understand the products in demand, predict customers' buying habits in order to promote

– Product or economy-level prediction : predicting store-level demand for inventory management purposes, predicting the unemployment rate for the next year

– Clinical decision support systems : experts use this in health care primarily to predict which patients are at risk of developing certain conditions

Ref[20]:http://en.wikipedia.org/wiki/Predictive_modellingRef[21]:http://en.wikipedia.org/wiki/Predictive_analytics

Page 82: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

82

Cluster analysisneural networks

Sentiment Analysis 991635 陸雨新

Question c: Explain how can a corporate deal with the problems associated with big data and explain its possible solutions

Page 83: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

83

Cluster analysis▲What is Cluster analysis?

• Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

• Cluster analysis Finding similarities between data according to the characteristics found in

the data and grouping similar data objects into clusters• Unsupervised learning: no predefined classes• Typical applications

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Ref[1][5]

Page 84: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

84

K-Means Clustering on Big Data

▲What are K-Means?

• Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the

current partition (the centroid is the center, i.e., mean point, of the cluster)

Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment

Ref[1][5]

Page 85: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

85

The K-Means Clustering Method• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Ref[1][5]

Page 86: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

86

What are neural networks?

• Connectionism refers to a computer modeling approach to computation that is loosely based upon the architecture of the brain.

• Many different models, but all include:

Multiple, individual “nodes” or “units” that operate at the same time (in parallel)

A network that connects the nodes together Learning can occur with gradual changes in connection

strength

Ref[2][3][5]

Page 87: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

87

Feed-forward nets

Information flow is unidirectional Data is presented to Input layer

Passed on to Hidden Layer

Passed on to Output layer

Information is distributed

Information processing is parallel

Internal representation (interpretation) of data

Ref[2][3][5]

Page 88: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

88

Feed-forward nets

Node 1

Node 2

Node i

Node j

Node k

Node 3

Input Layer Output LayerHidden Layer

1.0

0.7

0.4

Wjk

Wik

W3i

W3j

W2i

W2j

W1i

W1j

W

lj W

li W

2j W

2i W

3j W

3i W

jk W

ik 0.20 0.10 0.30 –0.10 –0.10 0.20 0.10 0.50

Ref[2][3][5]

Page 89: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

89

Neural Network Input Format

valueattribute possiblelargest the: uemaximumVal

attribute for the valuepossiblesmallest the: ueminimumVal

converted be to value the: lueoriginalVa

range interval [0,1] in the falling valuecomputed the: newValue

where

ueminimumValuemaximumVal

ueminimumVallueoriginalVanewValue

Ref[2][3][5]

Page 90: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

90

The Sigmoid Function(Output)

)25.0(

25.0-0.1)(0.70.3)(0.40.2)(1

)()()(

1

1)(

332211

2.718282.by edapproximat logarithms natural of base theis

where

fnode

WnodeWnodeWnode

exf

j

jjj

e

x

Ref[2][3][5]

Page 91: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

91

Sentiment Analysis

• SentimentA thought, view, or attitude, especially one based mainly on

emotion instead of reason• Sentiment Analysis

aka opinion mininguse of natural language processing (NLP) and computational

techniques to automate the extraction or classification of sentiment from typically unstructured text

Ref[4]

Page 92: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

92

Motivation• Consumer information

Product reviews(Is this customer email satisfied or dissatisfied?)

• MarketingConsumer attitudesTrends(Based on a sample of tweets, how are people responding to this ad campaign/product release/news item?)

• PoliticsPoliticians want to know voters’ viewsVoters want to know policitians’ stances and who else supports them(How have bloggers' attitudes about the president changed since the election?)

• SocialFind like-minded individuals or communities

Ref[4][5]

Page 93: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

93

Challenges

• People express opinions in complex ways• In opinion texts, lexical content alone can be misleading• Intra-textual and sub-sentential reversals, negation, topic change common• Rhetorical devices/modes such as sarcasm, irony, implication, etc.

Ref[5]

Page 94: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

94

References[1] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CE MQFjAC&url=http%3A%2F%2Fwww.gersteinlab.org%2Fcourses%2F545%2F07- spr%2Fslides%2FDM_clustering.ppt&ei=GOi2Ue7MDImakAX62YGQDg&usg=AFQjCNFHk7vRJAci6AD_PrgNBFytWJCnSA&sig2=wRKWMuMHDeaSLhox04LI4g[2] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CFMQFjAD&url=http%3A%2F%2Fweb.cecs.pdx.edu%2F~mperkows%2FCAPSTONE S%2F2005%2FL005.Neural_Networks.ppt&ei=OOq2UYm-BoWulQXOjYGoCw&usg=AFQjCNGlQtvoUuAoYEuWHcnGFnlzG55yYA&sig2=wv7 yax6HyWtxPJdCogjUAg[3] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CEkQFjAC&url=http%3A%2F%2Fwww.math.uaa.alaska.edu%2F~afkjm%2Fcs405% 2Fhandouts%2FNN.ppt&ei=OOq2UYm-BoWulQXOjYGoCw&usg=AFQjCNEoRa7VHnmdee2HkxsqhK_lCOIjrg&sig2=iNux5x- l1vzL2Du79jm3ow[4] https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=0CFgQFjAE&url=http%3A%2F

%2Fwww.public.asu.edu%2F~huanliu%2Fdmml_presentation%2F2008%2FSentiment%2BAnalysis.ppt&ei=cuu2UfPlJcXXkgXVoYHIAg&usg=AFQjCNGRQmOjDghJPcXGTrOCqLvjgRyDNw&sig2=T578hj5uiIb5ubsHovZ9rw[5] http://www.lct-master.org/files/MullenSentimentCourseSlides.pdf[6] “Data Mining-A Tutorial-Based Primer” ,Richard J. Roiger, Michael W. Geatz (2003)

Page 95: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

Assume that you are a team of IT staffs and your team is assigned to

provide a cost and benefit evaluation for the big data solutions.

Question d:

Page 96: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

96

Evaluation :Association rule learning (991660)

• Retail– Better understanding the correlation between

products or information– Effectively increase the income– Inventory management easier– Better understanding customer spending

patterns(market basket analyses)References :http://en.wikipedia.org/wiki/Association_rule_learning

雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 97: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

97

Evaluation : Classification

• Customer classification– Find more potential customers to increase the

income• Banking

– Define risk loan customers at each levels– Long-term credit ratings(Standard & Poor‘s)

References : 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 ) http://zh.wikipedia.org/wiki/%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98

Page 98: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

98

Classification :Standard & Poor‘s

• Long-term credit ratings– The company rates borrowers on a scale from AAA

to D. Intermediate ratings are offered at each level between AA and CCC (e.g., BBB+, BBB and BBB-). For some borrowers, the company may also offer guidance (termed a "credit watch") as to whether it is likely to be upgraded (positive), downgraded (negative) or uncertain (neutral).

References : http://en.wikipedia.org/wiki/Standard_%26_Poor's http://countryeconomy.com/ratings/taiwan

Page 99: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

99

Evaluation : Cluster analysis

• Find common traits in the consumer groups– Increase marketing effectiveness– Better understanding the customer behavior– In more detailed classification the customer

characteristics

References : 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 ) http://zh.wikipedia.org/wiki/%E8%81%9A%E7%B1%BB%E5%88%86%E6%9E%90

Page 100: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

100

Evaluation :Neural networks (991664)

• Enhanced information processing efficiency• Assist in establishing cost-effective IT modules

– Help to create the predictive modules• Stock market prediction module• Sales volume forecast module• Weather forecasting module

References : 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 ) http://zh.wikipedia.org/wiki/%E4%BA%BA%E5%B7%A5%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C

Page 101: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

101

Neural networks : CompStat• In 1994, Police Commissioner William Bratton

introduced a data-driven management model in the New York City Police Department called CompStat.

• CompStat has diffused quickly across the United States and has become a widely embraced management model focused on crime reduction.

• Reduce the crime rate 27%.References :http://www.compstat.umd.edu/what_is_cs.php

雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 102: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

102

Evaluation : Sentiment analysis

• Better understanding of the customer emotional mind for the company– Increase marketing effectiveness– Strengthen corporate image– Better understanding the customer behavior

References : 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 ) http://en.wikipedia.org/wiki/Sentiment_analysis

Page 103: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

103

Youtube with sentiment analysis

• Collect users' past record preview to analysis of user's preference.

• Recommended user‘s preference related videos to increased use time of user on youtube.

References : 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 104: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

104

WiseWindow:Mass Opinion Business Intelligence

• Wise Window, Inc. provides mass opinion business intelligence solutions. The company offers Mass Opinion Business Intelligence, a solution that translates mass opinions expressed on the Web into an actionable data for business.

• Correct interpretation of the blogs, news reports, online forums and social networking sites, the market for a particular product, service, person or news topics instant reaction and opinion trends(sentiment analysis).

References : 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )http://www.inside.com.tw/2011/03/03/emotion-robothttp://en.wikipedia.org/wiki/WiseWindow

Page 105: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

105

Pattern recognition (981919)

• depends on a number of different factors

• In many applications misclassification costs are hard to quantify, such as monetary costs, time and other more subjective costs.

References :http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed

雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 106: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

106

Pattern recognition

• Increasing information type– Transform more information type for analysis

• Voice recognition[2]• Image Recognition• Handwriting recognition• Face Recognition• medical diagnosis problem [1]References[1]http://zh.scribd.com/doc/137177757/Statistical-Pattern-Recognition-2nd-Ed

[2] http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=HA4PGp/search#result雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 107: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

107

Pattern recognition

• The Cost of Choose pattern recognition-it may be very difficult to assign costs

-they may be the subjective opinion of an expert

References[2] http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/ccd=HA4PGp/search#result 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 108: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

108

Pattern recognition

• The Benefit of Choose pattern recognition- favors simpler models

- Bayesian approach facilitates a seamless intermixing

References[3]http://en.wikipedia.org/wiki/Pattern_recognition

雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 109: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

109

Optical character recognition(OCR)

• Ocr is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used as a form of data entry from some sort of original paper data source, whether documents, sales receipts, mail, or any number of printed records.

References : http://en.wikipedia.org/wiki/Optical_character_recognition 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 110: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

110

Ocr:SinoPac Securities( 永豐金控 )

• SinoPac Securities selected Orc's trading and connectivity technology solutions to strengthen its Asian market trading capabilities.

• Orc Trading to enhance their use of electronic trading tools in trading, pricing and risk management capabilities.

References : http://www.orc-group.com/Global/Additional%20languages/Chinese%20Traditional/SinoPac_Orc_191109_Final_Chinese.pdf

雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 111: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

111

Anomaly detection• Reduce product defect rate

– Reduce the cost• Data flow anomaly detection

– Strengthen information securityUsed to detect whether the system is hacked

References[5]http://blog.udn.com/chungchia/3460421#ixzz2VpmDuTBS雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 112: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

112

Anomaly detection

• The Cost of Choose anomaly detection Where this is not the normal behavior of the scope, and both are considered abnormal, often resulting in miscarriage of justice to refuse normal network connection

References [4]http://avp.toko.edu.tw/docs/class/3/%E5%85%A5%E4%BE%B5%E5%81%B5%E6%B8%AC%E8%88%87%E9%A0%90%E9%98%B2%E7%B3%BB%E7%B5%B1%E7%B0%A1%E4%BB%8B%E8%88%87%E6%87%89%E7%94%A8.pdf

雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 113: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

113

Anomaly detection

• The Benefit of Choose pattern recognition- when the context of abuse and network intrusion detection- This pattern does not adhere to the common

statistical definition of an outlier as a rare object- a cluster analysis algorithm is able to detect the

micro clusters formed by these patterns.

References: http://en.wikipedia.org/wiki/Anomaly_detection 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 114: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

114

Predictive modeling

• Market price forecast– Crude oil price , price of gold– Stock Market Investing

• Stock market prediction module• Sales volume forecast module

• The Cost of Choose predictive modeling- History cannot always predict future- The issue of unknown unknowns- Self-defeat of an algorithm References: http://en.wikipedia.org/wiki/Predictive_modelling

http://www.forrester.com/pimages/rws/reprints/document/85601/oid/1-KWYFVB 雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )

Page 115: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

115

Public Sentiment with Stock Price

• Derwent Capital Markets is known as an early pioneer in the use of social media sentiment analysis to trade financial derivatives.

• Prediction accuracy rate 87.6%(in 2011)References :http://en.wikipedia.org/wiki/Derwent_Capital_Markets

http://www.forbes.com/sites/tomiogeron/2012/02/28/datasift-launches-historical-twitter-search-for-businesses/雲端時代的殺手級應用 : 海量資料分析( 胡世忠 )

Page 116: Big Data 981919 黃于庭 991604 林右千 991616 李嘉芸 991619 鍾佳琳 991632 游智鈞 991634 陳鈺玟 991635 陸雨新 991637 杜韋霆 991648 何冠儀 991660 魏松毅 991664

116

Predictive modeling

• The Benefit of Choose predictive modeling- Moore’s law increases the capability and drives down

the cost . - resulted in an exponentially increasing amount of

scientific data being produced each year. - allows the retention programme

References: http://opendatagroup.com/predictive-analytics-faq/http://en.wikipedia.org/wiki/Predictive_modelling雲端時代的殺手級應用 : 海量資料分析 ( 胡世忠 )