what are the advantages of parallelism? parallel execution...

25
What are the Advantages of Parallelism? Parallel execution improves processing for: Large table scans and joins Creation of large indexes Partitioned index scans Bulk inserts, updates, and deletes Aggregations and copying What is Parallelism? parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. Do you think it will create the problem of non-standardized attributes, if one source uses 0/1 and second source uses 1/0 to store male/female attribute respectively? Give a reason to support your answer. 2 marks Yes It Will Create the Problem of Non-standardized attritubes because the data from 2 resources is in inconsistent in column. Two ways in which parallelizm can reduce system’s performance. 2 marks\ Parallelism can reduce system performance on over-utilized systems or systems with small I/O bandwidth There are two primary techniques for gathering requirements i.e. interviews or facilitated sessions. Kimball prefers using which one? 2 marks. There are two primary techniques for gathering requirements i.e. interviews or facilitated sessions. Both have advantages and disadvantages. Interviews encourage lot of individual participation. They are also easier to schedule. Facilitated sessions may reduce the time elapsed to gather requirements, although they require more time commitment from each participant. Kimball prefers using hybrid approach with interviews to gather the gory details and then facilitation to bring the group to consensus. Why Analytic Track is considered as “fun part”. 2 marks. Write any three complete warehouse deliverable. 3 marks Data Analytic applications Data access tools Education tools

Upload: others

Post on 08-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

What are the Advantages of Parallelism?Parallel execution improves processing for:� Large table scans and joins� Creation of large indexes� Partitioned index scans� Bulk inserts, updates, and deletes� Aggregations and copying

What is Parallelism? parallelism is theidea of breaking down a task so that, instead of one process doing all of the work in aquery, many processes do part of the work at the same time.

Do you think it will create the problem of non-standardized attributes, if one source uses 0/1 and second source uses 1/0 to store male/female attribute respectively? Give a reason to support your answer. 2 marksYes It Will Create the Problem of Non-standardized attritubes because the data from 2 resources is in inconsistent in column.

Two ways in which parallelizm can reduce system’s performance. 2 marks\

Parallelism can reduce system performance on over-utilized systems or systems withsmall I/O bandwidth

There are two primary techniques for gathering requirements i.e. interviews or facilitated sessions. Kimball prefers using which one? 2 marks.There are two primary techniques for gathering requirements i.e. interviews or facilitated sessions. Both have advantages and disadvantages. Interviews encourage lot of individual participation. They are also easier to schedule. Facilitated sessions may reduce the time elapsed to gather requirements, although they require more time commitment from each participant. Kimball prefers using hybrid approach with interviews to gather the gory details and then facilitation to bring the group to consensus.

Why Analytic Track is considered as “fun part”. 2 marks.

Write any three complete warehouse deliverable. 3 marks

Data

� Analytic applications

� Data access tools

� Education tools

Page 2: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Dimensions in context with Web data warehouse. 3 marks

Page key

Page source

Page function

Page template

Item type

Graphic type

Animation type

Sound type

Page file name

Name of three DWH development methodologies? 3 marks

Development methodologies •Waterfall model

•Spiral model

•RAD Model

•Structured Methodology

•Data Driven

•Goal Driven

•User Driven

This query was given SELECT*FROM R WHERE A= 5 and we have to tell which technique is appropriate from dense, sparse, B-tree and has indexing. 5 marks

B-tree indexes are the most common index type used in typical OLTP applications andprovide excellent levels of functionality and performance. Used in both OLTP and datawarehouse applications, they speed access to table data when users execute queries withvarying criteria, such as equality conditions and range conditions. B-tree indexes improvethe performance of queries that select a small percentage of rows from a table.

Why should companies entertain students to visit their company's place? 5 marks

Page 3: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Two tables were given employee and exception table.

Employee Table Exception table

EmpID EmpName Age1 Ali 282 Faisal 323 Waseem 3894 Arham 398EmpID IsAgeValid1 12 13 14 1.

We have to write a query to access employee table and set the value of IsAgeValid =0 whereage is greater than and equal to 25 and less than and equal to 75. 5 marks

Pest scoutingPest scouting is a systematic field sampling process that provide field specific information on pest pressure and crop injury.

Conventional indexes

� Basic Types:

� Sparse

� Dense

� Multi-level (or B-Tree)

Correct statement (2)Aik table tha us ki SQL query daini thi (5)

4 types of partitioning

Hash partitioning

Key range partitioning

List partitioning

Page 4: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Round robin

Web warehouse waly ki 5 attrobutes (5)

RDBMS kimball waly ki 3 process likhni tho (3)(Business process-->Grains-->Facts-->dimension).

1. It is called a _____________ violation, if we have null values for attributes where NOT NULL constraint exists

Load

Transform

Constraint page 161 Extraction

1. UAT stands for

User acceptance testing page 1931. Implementing a DWH requires ____________ integrated activities.

Tightly page 289 Loosely

Slackly

Lethargically

1. The application development quality assurance activities cannot be completed until the data is _____________

Stabilized page 308 Identified

Finalized

Computerized

1. Normalization is a process of efficiently organizing data in a data base by _________ a relational table into smaller table by projection.

Composing

Decomposing page 41 Joining/merging

Combining

1. “Dirty data” class of anomalies include

1. Lexical errors

Page 5: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

2. Integrity constraints violation

3. Business rule contradiction

4. Irregularities

5. Duplication

i and iii and iv

i and ii and v

i and iiSyntactically Dirty data: lexical errors, irregularitiesSemantically dirty data: integrity constraint violation, business rule contradiction, duplicationCoverage anomalies: missing attributes, missing records

1. Quality sold is stored as fact.

Additive Non-additive

Association

Non-association

1. Product selection phase fall in Kimball

Lifecycle Technology Track page 290 Lifecycle Data Track

Lifecycle Analytic Applications Track

None of the given

1. Give least time to ____ can prove suicidal attempt of DWH project

OLAP

De-normalization

ETL page 313 None of the given

1. Multan division is the cotton hub1. Which is not an issue of “Click stream data”.

Identifying the Visitor Origin

Identifying the Session

Identifying the Visitor

Another option was given which is not issue of click stream.

1. HTTP true statement

Is stateless page 364 Non world wide web protocol

Used to maintain session

Page 6: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Message routing protocol

1. SMP stand for Symmetric Multi-Processing

1. K-clustering is equal to sequence of n

K much greater than n

K much smaller than n

K is equal to square of n

None of the given

1. The ith bit is set to 1, if ith row of the base table has the value for the indexed column. The statement refer to

Inverted

Bitmap page 233 Dense

Sparse index

1. __________ is a systematic sampling process that provides field specific information on pest pressure and crop injury.

Pest scouting page 333 Soil survey

Seed survey

Water survey

1. In context of web data ware house. Which is NOT one of way to identify session

Using asynchronous session tracking protocol Using Time-contiguous Log Entries

Using Transient Cookies

Using HTTP's secure sockets layer (SSL)

Using session ID Ping-pong

Using Persistent Cookies

Some mcqs from my midterm paper. 2 underlined MCQs are also included in my final paper

1. The telecommunication data warehouse is dominated by the sheer volume of data generated at the call level _________ area.

Subject page 35 Object

Aggregate

Details

1. 4NF has an additional requirements which is

Page 7: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Data is in 3NF and no null key dependency

Data is in 2NF and no Multi value dependency page 48 Data is in 3NF and no multi value dependency

Data is in 3NF and no foreign key table

1. 3NF remove even more data redundancy than 2NF but it is at the cost of

Simplicity and performance page 48 Complexity

No of table

Relations

1. In full extraction, data extracted completely from source. No need to keep track of change to the_________

Data source page 133 DWH

Data mart

Data destination

1. Which is not the characteristics of DWH

Ad-hoc access

Complete repository

Historical data

Volatile page 271. Experienced showed that for a single pass of magnetic tape that scanned 100% of the record

only________ of the records. 5% page 12 30%

50%

80%

1. HOLAP provides a combination of relational database access and “cube” data structures within a single framework. The goal is to get the best of both MOLAP and ROLAP:

scalability and high performance page 781. ____________ are created out of the data warehouse to service the needs of different

departments such as marketing, sales etc.

MIS

OLAPs

Data mart page 31 None of the given

Page 8: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

1. Write two unsupervised learning? page no. 270

Answer:

one way clustering

two way clustering

1. Bitmap index: run length encoding ka ek question tha input di hoi output find out kerni thi Page no.234

Answer: If we apply Run length Encoding on the input “11001100”, the output will be

12#02#12#02

1. B-tree vs. hash indexes men se ye query di hoi thi

SELECT*FROM R WHERE A= 5 page no.228

Btana tha k is men dense index sparse index B-tree index and bitmap index men se konsi technique use ho gi aur explain kerna tha ise

1. Identify kerna tha k ye statement correct he ya incorrect aur reason btana tha

Bayesian modeling is an example of unsupervised learning” page no 270

Answer: incorrect. Bayesian modeling is an example of supervised learning

Forward Proxy (2)

Answer: Ch#40 Page no: 369

The type of proxy we are referring to in this discussion is called a forward proxy. It is outside of our control because it belongs to a networking company or an ISP..Drawbacks of waterfall model for DWH (3)

First and foremost, the project is likely to occur over an extended period of time, during which the users may not have had an opportunity to review what will be delivered.

Second, in today's demanding competitive environment there is a need to produce results in a much shorter timeframe.

In which scenario we can use waterfall (2)The model is a linear sequence of activities like requirements definition, system design, detailed design, integration and testing, and finally operations and maintenance. The model is used when the system requirements and objectives are known and clearly specified.

How gender guide used.

If for very large number of records gender is missing, it would become impossible for us to manually check each and every individual‘s name and identify the gender. In such

Page 9: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

cases we can formulate a mechanism to correct gender. We can either use a standard gender guide or create a new table Gender_guide. Gender_guide contains only two columns name and gender. Populate Gender_guide table by a query for selecting all distinct first names from student table. Then manually placing their gender.

This table can serve us as guide by telling what can be the gender against this particular name. For example if we have hundred students in our database with first name equal to ‘Muhammad’. Then in our Gender guide table we will have just one entry ‘Muhammad’ and we will manually set the gender as ‘Male’ against ‘Muhammad’. Now to fill missing genders in exception table we will just do an inner join on Error table and Gender guide table.

run length encoding on these 2 ad-hoe the or output btana the .

Run length used in bitmap indexing

Output 1 may be

15#02# 18# (mean 1 come 5 time and 0 come 2 times and 1 come 1 8 times

(111110011111111))

Output 2 may be

11#01#11#

Output 3 may be

112#012#

Step of Kimball approach for data life cycle.

Kimball Process. Four step approach. (Business process-->Grains-->Facts-->dimension). He defines a business process as a major operational process in the organization that is supported by some kind of legacy system (or systems). (Read "Business Development Lifecycle") page see #290

Drawback of traditional web search. Ch: 39 page 351

1. Limited to keyword based matching.

2. Cannot distinguish between the contexts in which a link is used.

3. Coupling of files has to be done manually.

Two ways of session describe in World Wide Web.

Identifying the Session

Web-centric data warehouse applications require every visitor session (visit) to have its own unique identity

The basic protocol for the World Wide Web, HTTP, stateless so session identity must be established in some other way.

There are several ways to do this

Inam Khan
Highlight
Page 10: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Using Time-contiguous Log Entries

Using Transient Cookies

Using HTTP's secure sockets layer (SSL)

Using session ID Ping-pong

Using Persistent Cookies

MCQs

Execution will be terminated abnormally.... (Quiz 4 file- 2 MCQs)Kimball’s approach ......driven (quiz 4 file-5 mcqs)Pipeline per increase through..... (Quiz 4 file- 1 mcq)Selectivity of query in olap... (Queries must be executed in a small number of seconds.)star schema simplify ...Majority of data ...fail if (Majority of projects fail due to the complexity of the development process.)er is .......design (constituted to optimize OLTP performance)

Survival of fittest is.....algorithm (Genetic Algorithms: These are based on the principle survival of the fittest. In these techniques, a model is formed to solve problems having multiple options and many values. Briefly, these techniques are used to select the optimal solution out of a number of possible solutions. However, are not much robust as can not perform well in the presence of noise.

Shipy in kobol develop....... (In 1972 the Mitsubishi Shipyards in Kobe developed a technique in which customer wants were linked to product specifications via a matrix format. Technique is known today as The House of Quality and is one of many techniques of Quality Function Deployment, which can briefly be defined as “a system for translating customer requirements into appropriate company requirements”. The purpose of the technique is to reduce two types of risk. First, the risk that the product specification does not comply with the wants of the predetermined target group of customers. Secondly, the risk that the final product does not comply with the product specificationQ: 1 briefly explains any two types of precedence constraints that we can use in DTS.

Answer: page 395

Precedence constraints sequentially link tasks in a package. In DTS, you can use three types of precedence constraints, which can be accessed either through DTS Designer or programmatically:Unconditional: If you want Task 2 to wait until Task 1 completes, regardless of the outcome, link Task 1 to Task 2 with an unconditional precedence constraint.On Success: If you want Task 2 to wait until Task 1 has successfully completed, link Task 1 to Task 2 with an On Success precedence constraint.

Page 11: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

On Failure: If you want Task 2 to begin execution only if Task 1 fails to execute successfully, link Task 1 to Task 2 with an On Failure precedence constraint. If you want to run an alternative branch of the workflow when an error is encountered, use this constraint.Q:2 Time complexity of K-means algorithm is O(tkn) what does t,k,and n represents here?

Page 281

Answer: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.

Normally, k, t n.

Q: 3 what are the problems you will face if low priority is given to cube construction?

Answer: page 313

Low priority for OLAP Cube Construction: Make sure your OLAP cube-building or pre-calculation process is optimized and given the right priority. It is common for the data warehouse to be on the bottom of the nightly batch loads, and after the loading the DWH,usually there isn't much time left for the OLAP cube to be refreshed. As a result, it is worthwhile to experiment with the OLAP cube generation paths to ensure optimal performance.

Q: 4 List down any two parallel software Architectures?

Answer: Shared Memory, Shard Disk and Shared Nothing

Q: 5 what is unsupervised learning in Data mining?

Answer: page 27

Unsupervised learning where you don’t know the number of clusters and obviously no idea about their attributes too. In other words you are not guiding in any way the DM process for performing the DM, no guidance and no input. Unsupervised learning is closer to the exploratory spirit of Data Mining as small a stressed in the definitions given above. In unsupervised learning situations all variables are treated in the same way, there is no distinction between explanatory and dependent variables. However, in contrast to the name undirected data mining there is still some target to achieve. This target might beas general as data reduction or more specific like clustering. For unsupervised learning typically either the target variable is unknown or has only been recorded for too number of cases.

Q: 6 which scripting language are used to perform complex transformations in Data packages?

Answer: Microsoft SQL Server provides graphical tools to build DTS packages. These tools provide good support for transformations. Complex transformations are achieved through VB Script or Java Script that is loaded in DTS package. Package can also be

Inam Khan
Highlight
Page 12: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

programmed by using DTS object model instead of using graphical tools but DTS programming is rather complicated.

Q: 7 "Dense index consist of a number of bit vector" justify it .

Answer Dense Index: Every key in the data file is represented in the index file. Bitmap index record (Value, Bit Vector): Bit Vector has one bit for every record in the file, ith bit of Bit Vector is set off record it has Value in the given column. Bit vectors typically compressed. Converted to sets of rids during query evaluation.

Q :8 It is essential: to have a sub-matter expert as part of data modeling team . What will be the implication if such expert is not present in organization?Answer: It is essential to have a subject-matter expert as part of the data modeling team. This person can be an outside consultant or can be someone in-house with extensive industry experience. Without this person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets dragged out, as the end users may not always be available

Suppose there is a large enterprise which uses the same server for the development and production environments. What problems can arise if it uses single server for both purposes? 5m

To save capital, often data warehousing teams will decide to use only a single database and a single server for the different environments i.e. development and production. Environment separation is achieved by either a directory structure or setting up distinct instances of the database.

This is awkward for the following reasons:

• Sometimes it is possible that the server needs to be rebooted for the development environment. Having a separate development environment will prevent the production environment from being effected by this.

• There may be interference while having different database environments on a single server. For example, having multiple long queries running on the development server could affect the performance on the production server, as both are same.

Write down any two drawbacks if “Date” is stored in text format rather than using proper date format like “dd-MMM-yy” etc. 5mIn context of Web data warehousing, consider the “web page” dimension, list at least five possible attributes of this dimension. 5m

Page key

Page source

Page function

Inam Khan
Highlight
Page 13: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Page template

Item type

Graphic type

Animation type

Sound type

Page file name

There are different data mining techniques e.g. “clustering”, “description” etc. Each of the following statement corresponds to some data mining technique. For each statement name the technique the statement corresponds to. 5ma) Assigning customers to predefined customer segments (i.e. good vs. bad) classification b) Assigning credit applicants to predefined classes (i.e. low, medium, or high risk) classificationc) Guessing how much customers will spend during next 6 months predictiond) Building a model and assigning a value from 0 to 1 to each member of the set. Then classifying the members into categories based on a threshold value. Estimatione) Guessing how much students will score more than 65% grades in midterm. PredictionSpecify at least one implication, if you don’t provide proper documentation as part of data warehouse development.3 m

Usually by this time most, if not all, of the developers will have left the project, so it is essential that proper documentation is left for those who are handling production maintenance. There is nothing more frustrating than staring at something another person did, yet unable to figure it out due to the lack of proper documentation.

Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the data warehouse planned, start on that as soon as possible.

In context of nested loop join, mention two guidelines for selecting a table as inner table. 3m

For a Nested-Loop join inner and outer tables are determined as follows: page 242

The outer table is usually the one that has:

• The smallest number of qualifying rows, and/or

• The largest numbers of I/Os required to locate the rows.

The inner table usually has:

• The largest number of qualifying rows, and/or

The smallest number of reads required to locate rows

Page 14: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

We can identify the Session in Word Wide Web by using “Time-contiguous Log Entries” however there are some limitations of this technique. Briefly explain any two limitations. 3m

Answer: A session can be consolidated by collecting time-contiguous log entries from thesame host (Internet Protocol, or IP, address). In many cases, the individual hits comprising a session can be consolidated by collating time-contiguous log entries from the same host (Internet Protocol, or IP, address). If the log contains a number of entries with the same host ID in a short period of time (for example, one hour), one can reasonably assume that the entries are for the same session.

Limitations: • This method breaks down for visitors from large ISPs because different visitors may reuse dynamically assigned IP addresses over a brief time period.

• Different IP addresses may be used within the same session for the same visitor.

• This approach also presents problems when dealing with browsers that are behind some firewalls.

Identify the given statement as correct or incorrect and justify your answer in either case."The problem of Referential Integrity always occurs in traditional OLTP system as well as in DWH". 3mAnswer: While doing total quality measurement, you measure RI every week (or month) and hopefully the number of orphan records will be going down, as you will be fine tuning the processes to get rid of the RI problems. Remember, RI problem is peculiar to aDWH, this will not happen in a traditional OLTP system.There are two primary techniques for gathering requirements i.e. interviews or facilitated sessions. Which technique is preferred by Ralph Kimball? 2mBoth have their advantages and disadvantages. Interviews encourage lots of individual participation. They are also easier to schedule. Facilitated sessions may reduce the elapsed time to gather requirements, although they require more time commitment from each participant. Kimball prefers using a hybrid approach with interviews to gather the gory details and then facilitation to bring the group to consensus.List down any two Parallel Software Architectures? 2mBrief Intro to Parallel Processing:

Parallel Hardware Architectures

Symmetric Multi-Processing (SMP)

Distributed Memory or Massively Parallel Processing (MPP)

Non-uniform Memory Access (NUMA)

Parallel Software Architectures

Shared Memory

Shard Disk

Shared Nothing

Page 15: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Types of parallelism

Data Parallelism

Spatial Parallelism

List down any four Static Attributes recorded by the scouts in Agriculture Data Warehouse Case Study. 2m

Static attributes Dynamic attributes

Farmer name Date of visit

Farmer address Pest population

Field acre age CLCV

Variety sown Predator population

Sowing date Pesticide spray dates

Sowing method Pesticides used

List down any four issues of Click stream Data. 2m

Issues of Click stream Data: (Page#341)

Click stream data has many issues:

Identifying the Visitor Origin

Identifying the Session

Identifying the Visitor

Proxy Servers

Browser Caches

Subjective: 1. what is Web Data Warehouse? (2 marks)

Answer: Page no: 350 Chapter: 39

Web Warehousing can be used to mine the huge web content for searching information ofinterest. It’s like searching the golden needle from the haystack. Second reason of Web warehousing is to analyze the huge web traffic. This can be of interest to Web Site owners, for e-commerce, for e-advertisement and so on. Last but not least reason of Web warehousing is to archive the huge web content because of its dynamic nature.

3. Write first two phases of Kimball's Approach of business dimensional lifecycle. (2 marks)

Page 16: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Answer= Kimball also proposes a four-step approach where he starts to choose a businessprocess, takes the grain of the process, and chooses dimensions and organization that is supported by some kind of legacy system (or systems).facts. He defines a business process as a major operational process in the

4. There are four categories of data quality improvement. Write any two. (2 marks)

Ans. The four categories of Data Quality Improvement

• Process

• ƒSystem

• ƒPolicy & Procedure

• ƒData Design

1. Data profiling is a process which involves gathering of information. What are the purposes that itmust fulfill? (3 marks)

Answer: Data profiling is a process which involves gathering of information about column through execution of certain queries with intention to identify erroneous records. In this process we identify the following:

• Total number of values in a column

• Number of distinct values in a column

• Domain of a column

• Values out of domain of a column

• Validation of business rules

We run different SQL queries to get the answers of above questions. During this process we can identify the erroneous records. Whenever we will come across an erroneous record, we will just copy it in error or exception table and set the dirty bit of record in the actual student table. Then we will correct the exception table. After this profiling process we will transform the records and load them into a new table

Student_Info

Ref: Handout Page No. 354

7. Apply Run length encoding on the given code and write output. (3 marks)

Case-I: 1111111110000111

Answer: 19#04#13

Case-II: 00001111000000

Answer: 04#14#06

8. Identify the given statement as correct or incorrect and justify your answer in either case. (3 marks)

"One-way clustering is used to get local view and Two-way clustering is used to get global view."

Page 17: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Answer: Incorrect

One-way clustering gives global view and bi-clustering gives local view

9. A pilot project strategy is highly recommended in data warehouse. What are the reasons for its recommendation? (5 marks)

Answer: A pilot project strategy is highly recommended in data warehouse construction, as a full blown data warehouse construction requires significant capital investment, effort and resources. Therefore, the same must be attempted only after a thorough analysis, and a valid proof of concept. A small scale project in this regard serves many purposes such as (i) Show users the value of DSS information, (ii) establish blue print processes for laterfull-blown project, (iii) identify problem areas and, (iv) reveal true data demographics. Hence doing a pilot project on a small scale seemed to be the best strategy.

10. Data acquisition and cleansing. (5 marks)

• The pest scouting sheets are larger than A4 size (8.5” x 11”), hence the right end was cropped when scanned on a flat-bed A4 size scanner.

• The right part of the scouting sheet is also the most troublesome, because of pesticide names for a single record typed on multiple lines i.e. for multiple farmers.

• As a first step, OCR (Optical Character Reader) based image to text transformation of the pest scouting sheets was attempted. But it did not work even for relatively clean sheets with very high scanning resolutions.

• Subsequently DEO’s (Data Entry Operators) were employed to digitize the scouting sheets by typing.

Data cleansing and standardization is probably the largest part in an ETL exercise. For Agri-DWH major issues of data cleansing had arisen due to data processing and handling at four levels by different groups of people i.e.

(i) Hand recordings by the scouts at the field level

(ii) typing hand recordings into data sheets at the DPWQCP office

(iii) photocopying of the scouting sheets by DPWQCPpersonnel and finally

(iv) Data entry or digitization by hired data entry operators.

12. 1 table dia hua tha us mein Name, item, time aur gender dia hua tha aur sath ye statement di Hui thi. (5 marks)

IF

Items/Time >= 6Then Gender= ‘F’else Gender = ‘M’

a) Find the accuracy % of given data.

Page 18: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

b) If Name: Ali, Items: 2, time: 14 then find the gender of Ali.

Answer: page 278

The model in our case is a rule that if the per item minutes for any customer is greater or equal than 6 than the customer is female else a male i.e.The above rule is based on the common notion that females spend more time during shopping than male customers. Exceptions can be there and are treated as outliers.

Since for the first record the ration is greater than 6 meaning that our model will assign it to the female class, but that may be an exception or noise. The second and the third records are as per rule. Thus, the accuracy of our model is 2/3 i.e. .66%. In other words we can say the confidence level of our classification model is 66%. The accuracy may change as we add more data. Now unseen data is brought into the picture. Suppose there is a record with name Firdous, time spent 15 minutes and 1 item purchased. We predict the gender by using our classification model and as per our model the customer is assigned ‘F’ (15/1=15 which is greater than 6).

Subjective:

1. 1. Write 4 partitioning types of shared nothing in Parallel Software

Architecture?

Answer

Shared nothing RDBMS architecture requires a static partitioning of each table in the database.

How do you perform the partitioning?

Hash partitioning

Key range partitioning.

List partitioning.

Round-Robin

Combinations (Range-Hash & Range-List

1. 2. What is Web data ware house? (Answer in current solution file)1. 3. Variants of nested-loop?

Answer:

Nested-Loop Join: Variants

1. Naive nested-loop join

2. Index nested-loop join

Page 19: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

3. Temporary index join nested-loop

1. 4. Is there any strategy to standardize a column

Answer: page 480

There are no fixed strategies to standardize the columns.

1. 5. Dynamic attributes of agri data ware house(answer in current solution file)1. 6. Write 2 limitation of persistent cookies

Answer:

Answer= Limitations

• It's possible that the visitor will have his or her browser set to refuse cookies or may clean out his or her cookie file manually, so there is no absolute guarantee that even a persistent cookie will survive.

• Although any given cookie can be read only by the Web site that caused it to be created, certain groups of Web sites can agree to store a common ID tag that would let these sites combine their separate notions of a visitor session into a super session

1. 7. as the number of processes increase, the speedup should also increase. Thus theoretically there should be a linear speedup; however this is not the case in real. List at least 2 barrier of linear speedup.

Answer:

Amdahl’ Law

Startup

Interference

Skew

1. 8. In context of nested loop join, mention two guide lines for outer table.(answer in current solution file)

1. 9. before sitting down with the business community to gather information, it is suggested to set you up for a productive session. Write three activities requirement preplanning phase

Answer:

Requirements preplanning: This phase consists of activities like choosing the forum, identifying and preparing the requirements team and finally selecting, scheduling and preparing the business representatives.

Do you think it will create the problem of non-standardized attributes, if one source uses 0/1 and second source uses 1/0 to store male/female attribute respectively? Give a reason to support your answer. 2 marks

Page 20: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Two ways in which parallelisms can reduce system’s performance. 2 marks

There are two primary techniques for gathering requirements i.e. interviews or facilitated sessions. Kimball prefers using which one? 2 marks.

. Kimball prefers using a hybrid approach with interviews to gather the gory details and then facilitation to bring the group to consensus.

Why Analytic Track is considered as “fun part”. 2 marks.

Write any three complete warehouse deliverable. 3 marks

Dimensions in context with Web data warehouse. 3 marks

Page key

Page source

Page function

Page template

Item type

Graphic type

Animation type

Sound type

Page file name

Name of three DWH development methodologies? 3 marks

Page 21: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

This query was given SELECT*FROM R WHERE A= 5 and we have to tell which technique is appropriate from dense, sparse, B-tree and has indexing. 5 marks

Why should companies entertain students to visit their company's place? 5 marks

Two tables were given employee and exception table.

Employee Table Exception table

EmpID EmpName Age1 Ali 282 Faisal 323 Waseem 389

Page 22: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

4 Arham 398EmpID IsAgeValid1 12 13 14 1.

We have to write a query to access employee table and set the value of IsAgeValid =0 whereage is greater than and equal to 25 and less than and equal to 75. 5 marks

. CS614 today paper Mcq 15 past sy thy..Subjective mein unsupervised (2)

Answer: page 27

Unsupervised learning where you don’t know the number of clusters and obviously no idea about their attributes too. In other words you are not guiding in any way the DM process for performing the DM, no guidance and no input. Unsupervised learning is closer to the exploratory spirit of Data Mining as small a stressed in the definitions given above. In unsupervised learning situations all variables are treated in the same way, there is no distinction between explanatory and dependent variables. However, in contrast to the name undirected data mining there is still some target to achieve. This target might beas general as data reduction or more specific like clustering. For unsupervised learning typically either the target variable is unknown or has only been recorded for too number of cases.

Pest scouting (2)

Correct statement (2)Table tha index table bnana tha (5)Aik table tha us ki SQL query daini thi (5)

Partitining ki four tupes thin(3)

Hash partitioning Key range partitioning. List partitioning.

Page 23: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

Round-Robin

Web warehouse waly ki 5 attrobutes (5)

Subject Oriented

Integrated

Nonvolatile

Time Variant

RDBMS Kimball waly ki 3 process likhni tho (3

. (Business process-->Grains-->Facts-->dimension).

There was also one question of Output of run length encoding (2 or 3 marks)

Mcqs are mostly from past papers almost 35 out of 40 cs614FinaltermSolvedMCQsWithReferencesbyMoaaz.pdf

CS614FinaltermSolvedMCQsWithReferencesUpdate.pdf

Quizes b sary daikh lijye ga

Paper was easy

List down any 2 parallel hardware architecture (2)

Which Scripting languages are used to perform complete transformation in DTS packages(2)

A dataware house project is more like scientific research than anything in traditional informational system do you agree or not justify in either case(2)

4 static attribute recognized by scouts in agriculture (2)

In a distributed memory machine a processor can write a value into a shared memory and all process can read this value(Correct or incorrect statement )(3)

3 draw backs of traditional web searches?(3)

In context of nested loop join mention 2 guideline for selecting a table as inner table (3)

Page 24: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

What is purpose that data profiling must fulfill?(3)

Differentiate between static and data mining w.r.t No.of parameters (dimension)and type of date ?(5)

3 activities that u will consider as part of required preplanning phase(5)

Web page dimensions ,5 possible attributes.(5)

How gender guide is used, when gender is missing ?(5)

4 Questions of 2 marks, 3 marks and 5 marks

Total marks: 80

Q. Do you agree a single technology/tool is sufficient to fulfill all needs of users? (2 marks) Q. How would you determine out table in Nested-Loop join? (2 marks)Answer: The smallest number of qualifying rows, and/or

The largest numbers of I/Os required to locate the rows.

Q. Write names of first two steps of Kimball DWH lifecycle? (3 marks)Answer:

1. Project Planning

2. Business Requirements Definition

Q. Write Drawbacks of traditional web searches? (3 marks)Answer:

1. Limited to keyword based matching.

2. Can not distinguish between the contexts in which a link is used.

3. Coupling of files has to be done manually.

Q. Write at least three name of Shared nothing RDBMS architecture? (3 marks)Answer: Hash partitioning

Key range partitioning.

List partitioning.

Q. Why RAD methodology is successful, write at least two reasons? (5 marks)Answer:

Rapid Application Development (RAD) is an iterative model consisting of stages

like scope, analyze, design, construct, test, implement, and review. It is much better suited

to the development of a data warehouse because of its iterative nature and fast iterations.

Q. Make bitmap index (a table was given) (5 marks)Q. One question was from agri-DWH (pest scouting) (5 marks)

What is pest scouting?

Pest scouting is a systematic field sampling process that provide field specific information on pest pressure and crop injury. The pest scouting data is being constantly recorded by the Directorate of Pest Warning and Quality Control of Pesticides (DPWQCP), Punjab since 1984. However, despite pest scouting, yield losses have been occurring. The most recent being the Boll Worm attackon the cotton crop during 2003-04, resulting in a loss of nearly 0.5 million bales. This loss can not be attributed to weather alone, but points to a multitude of factors, requiring efficient and effective data analysis, for better decision making.

Page 25: What are the Advantages of Parallelism? Parallel execution ...api.ning.com/files/pYyRsX8*pkl12bMLeOXtWjL89iKNLp4...We have to write a query to access employee table and set the value

what is reverse proxy?

Reverse Proxy Another type of proxy server, called a reverse proxy, can be placed in front of our enterprise's Web servers to help them offload requests for frequently accessed content. This kind of proxy is entirely within our control and usually presents no impediment to Web warehouse data collection. It should be able to supply the same kind of log information as that produced by a Web server.

List down three steps which are performed in requirement defination phase of kimball's approach in data warehosue development?

print page 294-299

which common measurements can be used to measure the success of specific email, advertisement, marketing?

Success of email or other marketing campaign can be measured by integrating with other operational systems. Common measurements are: • Number of visitors • Number of sessions • Most requested pages • Robot activity Etc.

Being a part of training team specify three guidlines that you consider as part of effective user education program?Some options are:

• Invest in just-in-time training (provided by data warehousing tool vendors) • Use pilot projects as seeds for new technology training • Develop reward systems that encourage experimentation • Use outside system integrators and individual consultants