multi- variate outliers in data cubes

54
Intelligent Data Systems Laboratory Multi-variate Outliers in Data Cubes 2012-03-05 JongHeum Yeon

Upload: brook

Post on 23-Mar-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Multi- variate Outliers in Data Cubes. 2012-03-05 JongHeum Yeon. Contents. Sentiment Analysis and Opinion Mining Materials from AAAI-2011 Tutorial Multi- variate Outliers in Data Cubes Motivation Technologies Issues. Sentiment Analysis and Opinion Mining. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multi- variate  Outliers in Data Cubes

Intelligent Data Systems Laboratory

Multi-variate Outliers in Data Cubes

2012-03-05JongHeum Yeon

Page 2: Multi- variate  Outliers in Data Cubes

Page 2IDS Lab.

Contents

Sentiment Analysis and Opinion Mining• Materials from AAAI-2011 Tutorial

Multi-variate Outliers in Data Cubes• Motivation• Technologies• Issues

Page 3: Multi- variate  Outliers in Data Cubes

Page 3IDS Lab.

SENTIMENT ANALYSIS AND OPIN-ION MINING

Page 4: Multi- variate  Outliers in Data Cubes

Page 4IDS Lab.

Sentiment Analysis and Opinion Mining

Opinion mining or sentiment analysis• Computational study of opinions, sentiments, subjectivity, evaluations, attitudes,

appraisal, affects, views, emotions, etc., expressed in text.• Opinion mining ~= sentiment analysis

Sources: Global Scale• Word-of-mouth on the Web

• Personal experiences and opinions about anything in reviews, forums, blogs, Twitter, micro-blogs, etc.

• Comments about articles, issues, topics, reviews, etc.• Postings at social networking sites, e.g., facebook.

• Organization internal data• News and reports

Applications• Businesses and organizations• Individuals• Ads placements• Opinion retrieval

Page 5: Multi- variate  Outliers in Data Cubes

Page 5IDS Lab.

Problem Statement

(1) Opinion Definition• Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice

phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”

• document level, i.e., is this review + or -?• sentence level, i.e., is each sentence + or -?• entity and feature/aspect level• Components

• Opinion targets: entities and their features/aspects• Sentiments: positive and negative• Opinion holders: persons who hold the opinions• Time: when opinions are expressed

(2) Opinion Summarization

Page 6: Multi- variate  Outliers in Data Cubes

Page 6IDS Lab.

OPINION DEFINITION

Page 7: Multi- variate  Outliers in Data Cubes

Page 7IDS Lab.

Two main types of opinions

Regular opinions: Sentiment/opinion expressions on some target entities• Direct opinions:

• “The touch screen is really cool.”

• Indirect opinions:• “After taking the drug, my pain has gone.”

Comparative opinions: Comparisons of more than one entity.• e.g., “iPhone is better than Blackberry.”

Opinion (a restricted definition)• An opinion (or regular opinion) is simply a positive or negative sentiment, view,

attitude, emotion, or appraisal about an entity or an aspect of the entity (Hu and Liu 2004; Liu 2006) from an opinion holder (Bethard et al 2004; Kim and Hovy 2004; Wiebe et al 2005).

• Sentiment orientation of an opinion• Positive, negative, or neutral (no opinion)• Also called opinion orientation, semantic orientation, sentiment polarity.

Page 8: Multi- variate  Outliers in Data Cubes

Page 8IDS Lab.

Entity and Aspect

Definition (entity)• An entity e is a product, person, event, organization, or topic. e is represented as

• a hierarchy of components, sub-components, and so on.• Each node represents a component and is associated with a set of attributes of the

component.

An opinion is a quintuple

Page 9: Multi- variate  Outliers in Data Cubes

Page 9IDS Lab.

Goal of Opinion Mining

Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”

Quintuples• (iPhone, GENERAL, +, Abc123, 5-1-2008)• (iPhone, touch_screen, +, Abc123, 5-1-2008)

Goal: Given an opinionated document,• Discover all quintuples (ej, ajk, soijkl, hi, tl),

• Or, solve some simpler forms of the problem (sentiment classification at the document or sentence level)

• Unstructured Text → Structured Data

Page 10: Multi- variate  Outliers in Data Cubes

Page 10IDS Lab.

Sentiment, subjectivity, and emotion

Sentiment ≠ Subjective ≠ Emotion• Sentence subjectivity: An objective sentence presents some factual information,

while a subjective sentence expresses some personal feelings, views, emotions, or beliefs.

• Emotion: Emotions are people’s subjective feelings and thoughts.

Most opinionated sentences are subjective, but objective sentences can imply opinions too.

Emotion• Rational evaluation: Many evaluation/opinion sentences express no emotion

• e.g., “The voice of this phone is clear”

• Emotional evaluation• e.g., “I love this phone”• “The voice of this phone is crystal clear”

Sentiment Subjectivity⊄ Emotion Subjectivity⊂ Sentiment Emotion⊄

Page 11: Multi- variate  Outliers in Data Cubes

Page 11IDS Lab.

OPINION SUMMARIZATION

Page 12: Multi- variate  Outliers in Data Cubes

Page 12IDS Lab.

Opinion Summarization

With a lot of opinions, a summary is necessary.• A multi-document summarization task

For factual texts, summarization is to select the most important facts and present them in a sensible order while avoiding repetition• 1 fact = any number of the same fact

But for opinion documents, it is different because opinions have a quantitative side & have targets• 1 opinion ≠ a number of opinions

Aspect-based summary is more suitable• Quintuples form the basis for opinion summarization

Page 13: Multi- variate  Outliers in Data Cubes

Page 13IDS Lab.

Aspect-based Opinion Summary

Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”

Feature Based Summary of iPhone

Opinion Observer

Page 14: Multi- variate  Outliers in Data Cubes

Page 14IDS Lab.

Aspect-based Opinion Summary

Opinion Observer

Page 15: Multi- variate  Outliers in Data Cubes

Page 15IDS Lab.

Aspect-based Opinion Summary

Bing

Google Product Search

Page 16: Multi- variate  Outliers in Data Cubes

Page 16IDS Lab.

Aspect-based Opinion Summary

OpinionEQ

Detail opinion sentences

Page 17: Multi- variate  Outliers in Data Cubes

Page 17IDS Lab.

OpinionEQ

% of +ve opinion and # of opinions

Aggregate opinion trend

Page 18: Multi- variate  Outliers in Data Cubes

Page 18IDS Lab.

Live tracking of two movies (Twitter)

Page 19: Multi- variate  Outliers in Data Cubes

Page 19IDS Lab.

OPINION MINING PROBLEM

Page 20: Multi- variate  Outliers in Data Cubes

Page 20IDS Lab.

Opinion Mining Problem

(ej, ajk, soijkl, hi, tl),

• ej - a target entity: Named Entity Extraction (more)

• ajk – an aspect of ej: Information Extraction

• soijkl is sentiment: Sentiment Identification

• hi is an opinion holder: Information/Data Extraction

• tl is the time: Information/Data Extraction

• 5 pieces of information must match Coreference resolution Synonym match (voice = sound quality) …

Page 21: Multi- variate  Outliers in Data Cubes

Page 21IDS Lab.

Opinion Mining Problem

Tweets from Twitter are the easiest• short and thus usually straight to the point

Reviews are next• entities are given (almost) and there is little noise

Discussions, comments, and blogs are hard.• Multiple entities, comparisons, noisy, sarcasm, etc

Determining sentiments seems to be easier. Extracting entities and aspects is harder. Combining them is even harder.

Page 22: Multi- variate  Outliers in Data Cubes

Page 22IDS Lab.

Opinion Mining Problem in the Real World

Source the data, e.g., reviews, blogs, etc• (1) Crawl all data, store and search them, or• (2) Crawl only the target data

Extract the right entities & aspects• Group entity and aspect expressions,

• Moto = Motorola, photo = picture, etc …

• Aspect-based opinion mining (sentiment analysis)• Discover all quintuples• (Store the quintuples in a database)

Aspect based opinion summary

Page 23: Multi- variate  Outliers in Data Cubes

Page 23IDS Lab.

Problems

Document sentiment classification Sentence subjectivity & sentiment classification Aspect-based sentiment analysis Aspect-based opinion summarization

Opinion lexicon generation Mining comparative opinions Some other problems Opinion spam detection Utility or helpfulness of reviews

Page 24: Multi- variate  Outliers in Data Cubes

Page 24IDS Lab.

APPROACHES

Page 25: Multi- variate  Outliers in Data Cubes

Page 25IDS Lab.

Approaches

Knowledge-based approach• Uses background knowledge of linguistics to identify sentiment polarity of a text• Background knowledge is generally represented as dictionaries capturing the

sentiments of lexicons Learning-based approach

• Based on supervised machine learning techniques• Formulating the problem of sentiment identification as a text classification,

utilizing bag-of-words model

Page 26: Multi- variate  Outliers in Data Cubes

Page 26IDS Lab.

Document Sentiment Classification

Classify a whole opinion document (e.g., a review) based on the overall sentiment of the opinion holder

A text classification task• It is basically a text classification problem

Assumption: The doc is written by a single person and express opinion/sentiment on a single entity.

Goal: discover (_, _, so, _, _), where e, a, h, and t are ignored Reviews usually satisfy the assumption.

• Almost all papers use reviews• Positive: 4 or 5 stars, negative: 1 or 2 stars

Page 27: Multi- variate  Outliers in Data Cubes

Page 27IDS Lab.

Document Unsupervised Classification

Data: reviews from epinions.com on automobiles, banks, movies, and travel destinations.

Three steps

Step 1• Part-of-speech (POS) tagging• Extracting two consecutive words (two-word phrases) from reviews if their tags

conform to some given patterns, e.g., (1) JJ, (2) NN. Step 2: Estimate the sentiment orientation (SO) of the extracted phrases

• Pointwise mutual information

• Semantic orientation (SO)

• Step 3: Compute the average SO of all phrases

Page 28: Multi- variate  Outliers in Data Cubes

Page 28IDS Lab.

Document Supervised Learning

Directly apply supervised learning techniques to classify reviews into positive and negative

Three classification techniques were tried• Naïve Bayes• Maximum entropy• Support vector machines

Pre-processing• Features: negation tag, unigram (single words), bigram, POS tag, position.

Training and test data• Movie reviews with star ratings

• 4-5 stars as positive• 1-2 stars as negative

Neutral is ignored. SVM gives the best classification accuracy based on balance training data

• 83%• Features: unigrams (bag of individual words)

Page 29: Multi- variate  Outliers in Data Cubes

Page 29IDS Lab.

Aspect-based Sentiment Analysis

(ej, ajk, soijkl, hi, tl)

Aspect extraction• Goal: Given an opinion corpus, extract all aspects• A frequency-based approach (Hu and Liu, 2004)

• nouns (NN) that are frequently talked about are likely to be true aspects (called frequent aspects)

• Infrequent aspect extraction• To improve recall due to loss of infrequent aspects. It uses opinions words to extract

them• Key idea: opinions have targets, i.e., opinion words are used to modify aspects and

entities.• “The pictures are absolutely amazing.”• “This is an amazing software.”

• The modifying relation was approximated with the nearest noun to the opinion word.

Page 30: Multi- variate  Outliers in Data Cubes

Page 30IDS Lab.

Aspect-based Sentiment Analysis

Using part-of relationship and the Web• Improved (Hu and Liu, 2004) by removing those frequent noun phrases that

may not be aspects: better precision (a small drop in recall).• It identifies part-of relationship

• Each noun phrase is given a pointwise mutual information score between the phrase and part discriminators associated with the product class, e.g., a scanner class.

• e.g., “of scanner”, “scanner has”, etc, which are used to find parts of scanners by searching on the Web:

Extract aspects using DP (Qiu et al. 2009; 2011)• A double propagation (DP) approach proposed• Based on the definition earlier, an opinion should have a target, entity or aspect.• Use dependency of opinions & aspects to extract both aspects & opinion words.

• Knowing one helps find the other.• E.g., “The rooms are spacious”

• It extracts both aspects and opinion words.• A domain independent method.

Page 31: Multi- variate  Outliers in Data Cubes

Page 31IDS Lab.

Aspect-based Sentiment Analysis

DP is a bootstrapping method• Input: a set of seed opinion words,• no aspect seeds needed

Based on dependency grammar (Tesniere 1959).• “This phone has good screen”

Page 32: Multi- variate  Outliers in Data Cubes

Page 32IDS Lab.

Aspect-based Sentiment Analysis

iKnow

Quite Delivery

Fast

modify subject

Keeping

Easy

subject

Page 33: Multi- variate  Outliers in Data Cubes

Page 33IDS Lab.

Aspect Sentiment Classification

For each aspect, identify the sentiment or opinion expressed on it. Almost all approaches make use of opinion words and phrases. But notice:

• Some opinion words have context independent orientations, e.g., “good” and “bad” (almost)

• Some other words have context dependent orientations, e.g., “small” and sucks” (+ve for vacuum cleaner)

Supervised learning• Sentence level classification can be used, but …• Need to consider target and thus to segment a sentence (e.g., Jiang et al. 2011)

Lexicon-based approach (Ding, Liu and Yu, 2008)• Need parsing to deal with: Simple sentences, compound sentences, comparative

sentences, conditional sentences, questions; different verb tenses, etc.• Negation (not), contrary (but), comparisons, etc.• A large opinion lexicon, context dependency, etc.• Easy: “Apple is doing well in this bad economy.”

Page 34: Multi- variate  Outliers in Data Cubes

Page 34IDS Lab.

Aspect Sentiment Classification

A lexicon-based method (Ding, Liu and Yu 2008)• Input: A set of opinion words and phrases. A pair (a, s), where a is an aspect and

s is a sentence that contains a.• Output: whether the opinion on a in s is +ve, -ve, or neutral.• Two steps

• Step 1: split the sentence if needed based on BUT words (but, except that, etc).• Step 2: work on the segment sf containing a. Let the set of opinion words in sf be

w1, .., wn. Sum up their orientations (1, -1, 0), and assign the orientation to (a, s) accordingly.

• where wi.o is the opinion orientation of wi.• d(wi, a) is the distance from a to wi.

Page 35: Multi- variate  Outliers in Data Cubes

Page 35IDS Lab.

MULTI-VARIATE OUTLIERS IN DATA CUBES

Page 36: Multi- variate  Outliers in Data Cubes

Page 36IDS Lab.

Previous Work

연종흠 , 이동주 , 심준호 , 이상구 , 상품 리뷰 데이터와 감성 분석 처리 모델링 , 한국 ,   한국전자거래학회지 , 2011

Jongheum Yeon, Dongjoo Lee, Jaehui Park and Sang-goo Lee, A Framework for Sentiment Analysis on Smartphone Application Stores, AITS, 2012

Page 37: Multi- variate  Outliers in Data Cubes

Page 37IDS Lab.

On-Line Sentiment Analytical Processing

의견 정보가 증가할수록 OLAP(On-Line Analytical Processing) 처럼 의견 정보를 다양한 각도로 분석 및 의사 결정 지원에 활용하는 요구 증가

하지만 기존의 오피니언 마이닝 기법은 결과가 정형화되어 있어 다각도로 데이터를 분석하기 어려움• 구매 예정자를 대상으로 리뷰를 특징 단위의 점수로 요약• 특정 키워드에 연관된 의견 성향을 판단

OLSAP: On-Line Sentiment Analytical Processing• 의사 결정 지원을 위해 의견 정보를 데이터 웨어하우스에 저장• 의견 정보를 온라인에서 동적으로 분석하고 통합하는 처리 기법

OLSAP 를 위한 의견 정보의 모델링 방안을 제시

Page 38: Multi- variate  Outliers in Data Cubes

Page 38IDS Lab.

의견 데이터 모델

OLSAP 에서는 다음과 같은 형태로 의견 데이터를 모델링

• 는 “아이폰”과 같은 의견이 표현된 대상• 는 “ LCD” 와 같은 의 세부 특징• 는 “좋다”와 같이 각 특징에 대한 어휘• 는 “꽤” 와 같은 의견의 강도를 나타내는 어휘• 와 는 각각 특징과 의견강도에 대한 실수 값

• 부정일 경우 음수 , 긍정일 경우 양수• 는 의견을 제시한 사용자• 는 의견이 작성된 시각• 는 의견이 작성된 위치

(o𝑖 , 𝑓 𝑗 ,𝑒𝑘 ,𝑚𝑙 ,𝑣𝑒 𝑗𝑘 ,𝑣𝑚𝑘𝑙 ,𝑢𝑚 , 𝑡𝑛 ,𝑝𝑜)

Page 39: Multi- variate  Outliers in Data Cubes

Page 39IDS Lab.

OLSAP 모델링

OLSAP 데이터베이스 스키마 의견 정보 연관 테이블

Page 40: Multi- variate  Outliers in Data Cubes

Page 40IDS Lab.

Related Work

Integration of Opinion into Customer Analysis Model, Eighth IEEE International Conference on e-Business Engineering, 2011

Page 41: Multi- variate  Outliers in Data Cubes

Page 41IDS Lab.

Motivation

Opinion Mining on top of Data Cubes OnLine Analysis of Data to provide “Clients” with “right” reviews

Interaction is the key between• Analysis of Review Data and Clients

Let the client decide how to view the result of analyzing the reviews• 1. Any opinion mining can’t be perfect.• 2. Mined data itself can have “malicious” outliers.

Data Warehousing• Data Cubes, Multidimensional Aggregation• A ‘real-systematic’ platform to give the birth of data mining.

Focus: • More system-like approach, towards the integrated Algorithm & Data Structure, and its

Performance, in order to integrate the OLAP with Opinion Mining.• In other words, no interests on traditional opinion mining issues such as natural language

processing and polarity classification stuffs

Page 42: Multi- variate  Outliers in Data Cubes

Page 42IDS Lab.

Motivation

Avg_GroupBy(User=Anti1, Product=Samsung, …) means the (average) value grouped by (user, product, …) where the values of user and product are the given literals, as to Anti1 and Samsung, respectively.

ALL represents the don’t care.

Page 43: Multi- variate  Outliers in Data Cubes

Page 43IDS Lab.

Motivation

To find out if Anti1’s review needs to be considered or out of concerned, we are interested in the following values:• 1. Avg_GroupBy(User=Anti1, Product=Samsung)• 2. Avg_GroupBy(User=Anti1, Product=~Samsung),

• where Product=~Samsung means U-{Samsung}.

• 3. Avg_GroupBy(User=Anti1, Product=ALL) • = Avg_GroupBy(User=Anti1)

• 4. Avg_GroupBy(User=~Anti1, Product=Samsung)• 5. Avg_GroupBy(User=ALL, Product=Samsung)

• = Avg_GroupBy(Product=Samsung)

Page 44: Multi- variate  Outliers in Data Cubes

Page 44IDS Lab.

Motivation

Look into Behavior of Anti1 & Anti2• Anti1 provides the values only to Samsung while Ant2 does to others as well.• 1) Avg_GroupBy(User=Anti1, Product=ALL) = Avg_GroupBy(User=Anti1,

Product=Samsung) • i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) = NULL

• &&• 2) Avg_GroupBy(User=Anti1, Product=Samsung) = 1 turns out to be an outlier,

considering a Avg_GroupBy(User=~Anti1, Product=Samsung) = 2.85 이 경우 ( Ant1 만 빼야하는지 , 아니면 Ant2 도 빼야하는지 , 아니면

이들을 다 포함한 평균값을 생각해야 하는지 ? 즉 Avg_GroupBy(User=ALL, Product=Samsung) = 2.18

이경우 User-3 는 Samsung 에만 줬는데 왜 Outlier 가 아닌지 ? 예제가 부족하지만 , User-3 의 avg 값은 outlier 가 아닐정도라고 가정 .

Page 45: Multi- variate  Outliers in Data Cubes

Page 45IDS Lab.

Look into Behavior of Anti1 & Anti2• Anti2 provides the values not only to Samsung, but to others as well.

• 1) Avg_GroupBy(User=Anti2, Product=ALL) != Avg_GroupBy(User=Anti1, Product=Samsung) &&• i.e., Avg_GroupBy(User=Anti1, Product=~Samsung) is NOT NULL

• &&• 2) Avg_GroupBy(User=Anti2, Product=Samsung) 와 Avg_GroupBy(User=Anti2,

Product=~Samsung) 가 너무 차이남• &&• 3) Avg_GroupBy(User=Anti2, Product=Samsung) turns out to be an outlier, considering that

Avg_GroupBy(User=ALL, Product=Samsung)

이경우 User-2 는 Samsung 과 다른 제품들에 모두 줬는데 왜 Outlier 가 아닌지 ? 위의 2) 번 조건에 위배 . 즉 User-2 자체의 점수들 자체가 짬 . i.e., 점수 분포는

not bias.

Page 46: Multi- variate  Outliers in Data Cubes

Page 46IDS Lab.

Motivation

Summary1) 특정 제품 그룹만 review 하고 , 그 review 평균값이 다른 user 들의 해당

그룹 review 평균값과 많은 차이가 날때 .2) 특정 제품 그룹과 다른 그룹들 모두 review 하고 , 그 그룹간 review

평균값이 많이 차이나면서 , 특정 그룹 review 평균값이 다른 user 들의 해당 그룹 review 평균값과 많은 차이가 날때 .

• 위의 2) 에서 및줄친부분이 만족하지 않으면 원래 review 점수가 짠 사람 .

User3 should be okay!• Why? 한 그룹만 review 했지만 그 평균값이 다른 user 들의 해당 그룹

평균값과 별로 차이가 나지 않음 . User2 should be okay!

• 원래 짠 사람 . User4 should be okay!

• 여러 그룹 review 하고 , 각 그룹의 평균값이 다른 user 들의 해당 그룹 평균값과 별로 차이가 나지 않음 .

Page 47: Multi- variate  Outliers in Data Cubes

Page 47IDS Lab.

Research Perspectives -1

Outlier Conditions• Most likely, we must consider some heuristics, to suit the domain; here opinion

(review) data.• Condition1• Condition2• …• Condition_n in forms of as followings

• Multi-variate Outlier Detection• Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq

[Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때 .

• Sort of …..

Can this conditions be interactively input by the user? (Rule-based approach)

For some users who are not likely to explore the interactive outline-detection features, can a default-rule be applied and give the user some hints wrt potential outliers?

Page 48: Multi- variate  Outliers in Data Cubes

Page 48IDS Lab.

Research Perspectives -2

Outlier-conscious Aggregation – Aggregation Construction Algorithm (& Data Structures)• Data cubes are constructed to contain Avg_groupby(X1=x1, X2=x2, …., Xn=xn)

for each dimension X1, …Xn.• However, after either interactive (manual) ot batch (heuristical automatic)

process of eliminating outliers, the cube also needs to be “effectively or efficiently” constructed to contain Avg_groupby without having those outliers.

• Most likely, cubes need to maintain not only Avg_groupby value. Instead, needs to have count, sum, max, min values as well.

• While1. Multi-variate Outlier Detection

Avg_groupby(X1=x1, X2=x2, …., Xn=xn) is an outlier only if for Xi_c = X – {Xi} Chisq [Avg_groupby(X1=x1, X2=x2, Xi-1=xi-1, Xi+1=xi+1,…., Xn=xn)] * Skew 보다도 값이 넘어갈때 .

2. To see if the lower-variate also cause the outlier: |Xn-1|. In other words, ant1 can input outlier for all individual Samsung products. Then avg_gourpby(ant1,samsung) will be a outlier while avg_groupby(ant1,samsung,samsung_prod1) is an outlier.

=> So find out the loweset-dimension outlier, and removes all the containing outlier elements.

Page 49: Multi- variate  Outliers in Data Cubes

Page 49IDS Lab.

Research Perspectives -3

Outlier-conscious Aggregation – Visualization of Aggregation and possible outliers and their effects.• Instead of showing the Avgs• Not only Average

• Med or• Distribution

Page 50: Multi- variate  Outliers in Data Cubes

Page 50IDS Lab.

Research Perspectives -3

Containing possible outliers

Showing the distribution

Showing Not only Mean but Median (and Mode)

Page 51: Multi- variate  Outliers in Data Cubes

Page 51IDS Lab.

Research Perspectives -3

Or combining together Autos.yahoo.com

Page 52: Multi- variate  Outliers in Data Cubes

Page 52IDS Lab.

Research Perspectives -4

Outlier-conscious Aggregation – Aggregation Construction (RP-2) & Visualization (RP-3)• After either interactive (manual) or batch (heuristically automatic) process of

eliminating outliers, the cube also needs to be “effectively or efficiently” constructed to contain Avg_groupby without having those outliers.

This process, hopely, be done in-teractively, i.e, ONLINE.

Page 53: Multi- variate  Outliers in Data Cubes

Page 53IDS Lab.

References to Start with

Cube Data Structures for outliers• R* Trees, Efficient Online Aggregates in Dense-Region-Based Data Cube

Representation, K. Haddadin and T. Lauer, Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science Vol 5691, p177-188, 2009.

• TP-Trees, Pushing Theoretically-Founded Probabilistic Guarantees in Highly-Efficient OLAP Engines, A. Cuzzocrea and W. Wang, New Trends in Data Warehousing and Data Analysis, Annals of Information Systems Vol 3, p1-30, 2009.

R• The R Project for Statistical Computing, http://www.r-project.org/• “Introduction to Data Mining.pdf”, Technical Document, Well explained Outliers

and R. “pdf included” Etc

• Outlier-based Data Association: Combining OLAP and Data Mining, Technical Report, University of Virginia, Song Lin & Donald E. Brown, 2002. “pdf included”

• Selected Topics in Graphical Analytic Techniques, http://www.statsoft.com/textbook/graphical-analytic-techniques/

Page 54: Multi- variate  Outliers in Data Cubes

Page 54IDS Lab.

Applications

Election (Malicious SNS) Biased Product Reviews Business Perspectives

• Quick Testbed Environment