csc662 data mining, data warehouse and...
TRANSCRIPT
����� 4 ����� ������ ��. � . ก �� ������ ���� ��
�������� ������ � � !��������� � "�#��ก ��$������%�
CSC662 Data Mining, Data Warehouse and Visualization
©๒๕๕๐ กรุง สินอภิรมยสราญ 2
2
� ����!&'��((ก���ก� (OLAP and Data cube)
� !���&�����
� (ก���ก����
� ก� �)��$�*��&'��(ก%����
� �%��)�����ก� ���+�'+����
� �,�-��ก �����
��*.�$�
©๒๕๕๐ กรุง สินอภิรมยสราญ 3
3
�,�-��ก �&����/�
©๒๕๕๐ กรุง สินอภิรมยสราญ 4
4
��������
� ���$���,0�ก !���ก� ���� �!$���� ��'��� �!�)� �"&'��("�ก��1!�����2*���'�$� (���$ *������%�2%�3���� �ก4+������%�+��%�&'��(
gender
Date Produc
tReg
ion
East800 MHz ComputersMar-98
East800 MHz ComputersMar-98
NorthCD PlayersMar-98
West13" TelevisionsMar-98
East13" TelevisionsMar-98
South13" TelevisionsMar-98
North13" TelevisionsMar-98
West800 MHz ComputersFeb-98
South800 MHz ComputersFeb-98
WestCD PlayersFeb-98
EastCD PlayersFeb-98
NorthCD PlayersFeb-98
West13" TelevisionsFeb-98
North800 MHz ComputersJan-98
WestCD PlayersJan-98
SouthCD PlayersJan-98
South13" TelevisionsJan-98
North13" TelevisionsJan-98
RegionProductDate
©๒๕๕๐ กรุง สินอภิรมยสราญ 5
5
����&���� *�����1&1�����+�'ก%�5�6���� ����
� ��1&1�������$�1�� !���ก��$�0���%�
� ��1&1�������$�1�� !���ก���$�*��
ก%�$���%� (Symmetric
multiprocessor - SMP)
� ��1&1������)����2 '��ก%�$����1&1�� (Massively parallel processor - MPP)
©๒๕๕๐ กรุง สินอภิรมยสราญ 6
6
����*�ก&��ก� +�'�� *�����1&1�����
� ROLAP (Relational On-Line Analytical Process)
� MOLAP (Multidimensional On-Line Analytical Process)
� HOLAP (Hybrid On-Line Analytical Process)
©๒๕๕๐ กรุง สินอภิรมยสราญ 7
7
� ROLAP �*� Relational OLAP
� +�' !��"%�ก� 7��&'��(�����%�2%�3�+�ก� �ก8��!"%�ก� �%�&'��( ก� �)�/ !���+�����'��+�'ก� �%.�&'��)�,�� Query processing
� ��%�����ก� �&'�,0�&'��(��ก��1�����+�' !��ก� "%�ก� 7��&'��(����� !���3���2 2 '��ก%�+�'��9� SQL (Structured Query Language) +�ก� $�� ���!ก� � ��ก&'��(
� ���� ,+�'ก%�&'��(&���+$�1��ก : /�'
� !��ก� "%�ก� 7��&'��(����ก�� ��%�����ก� +�'���")������ก
�� *�����1&1�� !���� � (ROLAP)
©๒๕๕๐ กรุง สินอภิรมยสราญ 8
8
� MOLAP = Multidimensional OLAP (MOLAP) � +�'ก� �ก8�&'��(+�%ก9 !(ก���ก�$������ (Mutidimensional
cube)
� +�'������ก� �ก8�&'��(��� sparse �2 �!��&'��(����;��(�����(1��ก
� &'��(,(ก�%��0ก+� (&�� array $������ ก� �'��,0�+�'� �������;��%��&
� ก� �'��,0��5���+�'� ����)�+$' !����ก� �)��� ��� 8�
� �'��ก� ��*.����&��$�1������")� ��� ��1�ก%�&���&��(ก���ก� ( ��� &'��(���/�1+�1�(���)
�� *�����1&1�� !������ (MOLAP)
©๒๕๕๐ กรุง สินอภิรมยสราญ 9
9
� HOLAP = Hybrid OLAP� �;�(ก��� !$�1�� ROLAP (+�'ก%� !�%�1�����) �! MOLAP
(+�'ก%� !�%�����(�&0.�/)
� ��ก� +�'��*.����$�1������")���� %������/��� MDDB
� �������*�$��1��1�ก� +�'����(�
� Specialized SQL servers� �;���1&1�� SQL ����2���ก� +�'��'� 1������!�ก8�$��!
�� *�����1&1�� !����<��!�*�� :
©๒๕๕๐ กรุง สินอภิรมยสราญ 10
10
OLAP – Online Analytical Processing
� �����:
� &'��(�%��%���(1+�ก1��(ก���ก�
� ����2��������������� ,+�ก� ���� �!$���ก�$�*�"�ก SQL
� $%กก� &�����+�'ก� 2�"� �"�ก��1!���� �2*���(������� ก������ก��&0.�
©๒๕๕๐ กรุง สินอภิรมยสราญ 11
11
(ก���ก���� =
� age = Adult� product type = TV� date = 1/12/48� count = 10� value = $30000� cost = $5500
Age
Product type
Dat
e
©๒๕๕๐ กรุง สินอภิรมยสราญ 12
12
(ก���ก���� ๒
Age
Product type
Dat
e
Age = youngDate = 1/12/48Product type = TVCount = 6Value = $30000Cost = $5500
Age = youngDate = 1/12/48Product type = RCount = 10Value = $15000Cost = $400
Age = youngDate = 1/12/48Product type = SCount = 145Value = $50000Cost = $40000
©๒๕๕๐ กรุง สินอภิรมยสราญ 13
13
(ก���ก���� ๓
� ��'� 1�� (��� (Star Schema)
Facts
Week
Product
Product
Year
Region
Time
Channel
Revenue
Expenses
Units
Model
Type
Color
Channel
Region
Nation
District
Dealer
Time
©๒๕๕๐ กรุง สินอภิรมยสราญ 14
14
� ก� ����(ก���ก� 3 �������+�'�� ��������� ���������!$�'�
� ก� ����&'��($ก���� ���������!�1��&��(ก���ก�
(ก���ก���� ๔
Page Columns
Region:
North
Sales
Red
blob
Blue
blob
Total
1996
Rows 1997
Year Total
Dimension Example
Brand Mt. Airy
Store Atlanta
Customer segment Business
Product group Desks
Period January
Variable Units sold
©๒๕๕๐ กรุง สินอภิรมยสราญ 15
15
�%��)�����ก� ���: ก� $��� Pivot
©๒๕๕๐ กรุง สินอภิรมยสราญ 16
16
�%��)�����ก� ���: �"�!����� (Drill Down)
©๒๕๕๐ กรุง สินอภิรมยสราญ 17
17
ก� � �������� !$�1���$�*��&'��(ก%����
©๒๕๕๐ กรุง สินอภิรมยสราญ 18
18
� �%�&'��(�)��&'� multidimensional data model �2*��+�'ก%����
� &'��((ก���ก� !ก���'��
� ���� ��1� item(item_name, brand, type), time(day, week, month,
quarter, year)
� �����%��%���1� dollars_sold
� � �� ��ก cuboid ������%.� n �����1� 7�������� (base cuboid) �!� �� ��ก
cuboid �����(1 !�%��(����$ *����� 0 �1� apex cuboid �!"�����+�� ��ก data cube
�� ���!(ก���ก�
©๒๕๕๐ กรุง สินอภิรมยสราญ 19
19
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
(ก���ก�+�����������5 (Lattice)
©๒๕๕๐ กรุง สินอภิรมยสราญ 20
20
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
all
product date country
product,date product,country date, country
product, date, country
�%���1��&������5
©๒๕๕๐ กรุง สินอภิรมยสราญ 21
21
� ��� ก� &���;�6-�ก��%�&�� product, month �! region
Pro
duct
Regio
n
Month
����: Product, Location, Time
���� !�%��%.�/�'�%���.Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
&'��(���$������
©๒๕๕๐ กรุง สินอภิรมยสราญ 22
22
�%���1��&��&'��((ก���ก�
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntrysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Grand total����������
©๒๕๕๐ กรุง สินอภิรมยสราญ 23
23
�%���1����'� 1�����&�����ก&��
©๒๕๕๐ กรุง สินอภิรมยสราญ 24
24
� �� *����*������'����2
� ���� ,+�'6-�ก��%�&�����
� ��'���ก%��('+�'�%���
ก� �����'����2&��(ก���ก�
©๒๕๕๐ กรุง สินอภิรมยสราญ 25
25
� ��&0.� (Roll up): �;�&%.����� � ��&'��("�ก&'��(+� !�%���)�&0.�/
�(1 !�%�����(�&0.� �$�*��ก%�ก� �")����������� ���%23����/�1��������%.�
� �"�!� (Drill down): �;�&%.�����"�!�+� ��!����� ก1���*������
!�%�ก� �����+� !�%� ��!����� �$�*��ก%�ก� �2�������+$'ก%�(ก���ก�
� �?*���!�%���1� (Slice and dice): �;�&%.����ก� �*�ก����1��&��
(ก���ก�������� /�'�ก1ก� ก)�$���1�+$'ก%����� �!ก� �*�ก�?2�!����1�����
��+"
�%��)�����ก� +����
©๒๕๕๐ กรุง สินอภิรมยสราญ 26
26
� $��� (Pivot or rotate): �;�&%.����ก� �����������&��(ก���ก� ��3�
ก� �*��%��ก�ก� ����(ก���ก� ก��� �+�'�� ���������+�ก� �������
���&��(ก���ก� n ����
� �%��)�����ก� %ก9 !�*��
� �"�!&'�� (drill across): �;�ก� �"�!�/�(1 MDDB �*��
� �"�!�!� (drill through): �;�ก� �"�!�/+� !�%������)�ก�1�
(ก���ก�+�&%.���)������� ����"�!/�%������&��&'��(�����(1+�7��&'��(
�%��)�����ก� +����
©๒๕๕๐ กรุง สินอภิรมยสราญ 27
27
Shipping Method
AIR-EXPRESS
TRUCKORDER
Customer Orders
CONTRACTSCustomer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
OrganizationPromotion
CITY
COUNTRY
REGION
Location
DAILYQTRLYANNUALYTime
��ก�������� �ก footprint
���")���&'��)�,���� *�&1����� (Star-Net query)
©๒๕๕๐ กรุง สินอภิรมยสราญ 28
28
� ��������7��: �('�'�� ���"�กก� �%.������7�� �'��*�ก�%��)�����ก� ���
�2*���0�&'��(������
� &'����*� �('�%.������7���%ก�;��('� �$� ����� !��ก� � �)�+$'ก� 2�"� ��(&'��(�;�//�'��1���� !���3���2�! ��� 8�
� &'������*� �('+�'��"*������7�������1��$ *�!�����"!2�"� ��%�������%ก9 ! �)�+$'/�1�$8�������� ก����� �ก��&0.� ��1,'��('+�'�'���%.������7������;�//�'�%.�$����"+�'�������2 �! ��� �����7������;�//�'����ก
������ก� �)� �" MDDB
©๒๕๕๐ กรุง สินอภิรมยสราญ 29
29
� +�'���2����� ��!ก� �%����+":
� �('�'�ก)�$���1�2� ������ �����1���ก%ก9 !�����ก�1��+���ก !�%�
� ���2����� �+�'�ก @��%�ก1���%����&'��(+�ก1���1��;�ก� �'�2� ������ก�1�� (Exception) $ *�/�1
� &'��(+�ก1�������&'��(��ก�1�����2� ������ ����ก)�$��"!,(ก����+�%ก9 !�����ก�1��"�กก1���*�� ��1�+�'2*.��������ก�1��
� �1�����1���.������� ก�� ��1� SelfExp, InExp, PathExp
������ก� �)� �" MDDB
©๒๕๕๐ กรุง สินอภิรมยสราญ 30
30
�%���1��ก� �)� �"+�(ก���ก�
©๒๕๕๐ กรุง สินอภิรมยสราญ 31
31
� &%.����ก� "%�ก� �� ����� (Information processing)
� +�'+�ก� �%.�&'��)�,�� ���� �!$��,����%��/ ���� ������� �� ก �6
� &%.����ก� ���� �!$�3� ก�" (Analytical processing)
� ก� ���� �!$�$������&��&'��(+��%�&'��(
� +�'ก� ก !�)�ก� ��� /�'�ก1 slice-dice, drilling, pivoting
� &%.����ก� �)��$�*��&'��( (Data mining)
� �'�$����� ('����A���(1+�&'��(
� ก� ������'����2���3 �
ก� �)��%�&'��(/+�'
©๒๕๕๐ กรุง สินอภิรมยสราญ 32
32
� ��������� (On-Line Analytical Mining)
� +�'�%�&'��(�;�7�� �'���ก����'�� OLAP
� +�'ก !���ก� �� ������1�� ��1� ODBC, OLEDB, Web accessing,
service facilities, reporting �!�%��)�����ก� ���
� �;�ก� ���� �!$�&'��(���+�'����;�7��
� �;�&%.�ก� �*�ก6-�ก��%�ก� �)��$�*��&'��(����������/�'�1��
� ก� ���&'��'��ก%�&��6-�ก��%� &%.������3��!���ก� �)��$�*��&'��(
���� OLAM
©๒๕๕๐ กรุง สินอภิรมยสราญ 33
33
Data Warehous
e
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
�,�-��ก �����
©๒๕๕๐ กรุง สินอภิรมยสราญ 34
34
� �%�&'��( �*� subject-oriented, integrated, time-variant, nonvolatile collection
of data in support of management’s decision-making process
� ก� ��ก����%�&'��(+�'� Star schema, snowflake schema, fact constellations
� ��������� dimensions �!�%��%� measures
� �%��)�����ก� ���: drilling, rolling, slicing, dicing �! pivoting
� ��1&1�����: ROLAP, MOLAP, HOLAP
� ก� +�'����%�&'��(�! MDDB
� +�'ก� �)� �" $ *����� (OLAM:on-line analytical mining)
� �
©๒๕๕๐ กรุง สินอภิรมยสราญ 35
35
� S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases, 506-521, Bombay, India, Sept. 1996.
� D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, 417-427, Tucson, Arizona, May 1997.
� R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, 94-105, Seattle, Washington, June 1998.
� R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. 1997 Int. Conf. Data Engineering, 232-243, Birmingham, England, April 1997.
� K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. In Proc. 1999 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'99), 359-370, Philadelphia, PA, June 1999.
� S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.
� OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998.
��ก�� �'�����
©๒๕๕๐ กรุง สินอภิรมยสราญ 36
36
� J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
� V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. ACM-SIGMOD Int. Conf. Management of Data, pages 205-216, Montreal, Canada, June 1996.
� Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998.
� K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very Large Data Bases, 116-125, Athens, Greece, Aug. 1997.
� K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. Int. Conf. of Extending Database Technology (EDBT'98), 263-277, Valencia, Spain, March 1998.
� S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. Int. Conf. of Extending Database Technology (EDBT'98), 168-182, Valencia, Spain, March 1998.
� E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons, 1997.
��ก�� �'����� 2