2004-42

Upload: arteepu4

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 2004-42

    1/14

    Two Can Keep a Secret: A Distributed Architecture forSecure Database Services

    G. Aggarwal, M. Bawa, P. Ganesan, H. Garcia-Molina,K. Kenthapadi, R. Motwani, U. Srivastava, D. Thomas, Y. Xu

    Stanford University{gagan, bawa, prasannag, hector, kngk, rajeev, usriv, dilys, xuying }@cs.stanford.edu

    Abstract

    Recent trends towards database outsourcing,as well as concerns and laws governing dataprivacy, have led to great interest in enablingsecure database services. Previous approachesto enabling such a service have been basedon data encryption, causing a large overheadin query processing. We propose a new, dis-tributed architecture that allows an organiza-tion to outsource its data management to twountrusted servers while preserving data pri-vacy. We show how the presence of two serversenables efficient partitioning of data so thatthe contents at any one server are guaranteednot to breach data privacy. We show how tooptimize and execute queries in this architec-ture, and discuss new challenges that emergein designing the database schema.

    1 IntroductionThe database community is witnessing the emergenceof two recent trends set on a collision course. Onthe one hand, the outsourcing of data managementhas become increasingly attractive for many organiza-tions [HIM02]; the use of an external database servicepromises reliable data storage at a low cost, eliminat-ing the need for expensive in-house data-managementinfrastructure, e.g., [CW02]. On the other hand, esca-lating concerns about data privacy, recent governmen-tal legislation [SB02], as well as high-prole instancesof database theft [WP04], have sparked keen interest

    in enabling secure data storage.This work was supported in part by NSF Grant ITR- 0331640.Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice isgiven that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.Proceedings of the 2005 CIDR Conference

    The two trends described above are in direct con-ict with each other. A client using a database serviceneeds to trust the service provider with potentially sen-sitive data, leaving the door open for damaging leaksof private information. Consequently, there has beenmuch recent interest in a so-called Secure Database Service a DBMS that provides reliable storage andefficient query execution, while not knowing the con-tents of the database [KC04]. Such a service also helpsthe service provider by limiting their liability in caseof break-ins into their system if the service providersdo not know the contents of the database, neither willa hacker who breaks into the system.

    Existing proposals for secure database serviceshave typically been founded on encryption [HILM02,HIM04, AKSX04]. Data is encrypted on the (trusted)client side before being stored in the (untrusted) ex-ternal database. Observe that there is always a trivialway to answer all database queries: the client can fetch

    the entire database from the server, decrypt it, and ex-ecute the query on this decrypted database. Of course,such an approach is far too expensive to be practical.Instead, the hope is that queries can be transformedby the client to execute directly on the encrypted data;the results of such transformed queries could be post-processed by the client to obtain the nal results.

    Unfortunately, such hopes are often dashed by theprivacy-efficiency trade-off of encryption. Weak en-cryption functions that allow efficient queries leak fartoo much information and thus do not preserve dataprivacy [KC04]. On the other hand, stronger encryp-tion functions often necessitate resorting to Plan A forqueries fetching the entire database from the server which is simply too expensive. Moreover, despite in-creasing processor speeds, encryption and decryptionare not exactly cheap, especially when performed overdata at ne granularity.

    We propose a new approach to enabling a securedatabase service. The key idea is to allow the clientto partition its data across two, (and more generally,any number of ) logically independent database sys-tems that cannot communicate with each other. The

  • 8/12/2019 2004-42

    2/14

  • 8/12/2019 2004-42

    3/14

    server is honest, but potentially curious: the servermay actively monitor all the data that it stores, aswell as the queries it receives from the client, in thehope of breaching privacy; it does not, however, actmaliciously by providing erroneous service to the clientor by altering the stored data.

    The client maintains separate, permanent channelsof communication to each server. We do not require

    communication to be encrypted; however, we assumethat no eavesdropper is capable of listening in on both communication channels. The two servers are assumedto be unable to communicate directly with each other(depicted by the wall between them in Figure 1)and, in fact, need not even be aware of each othersexistence.

    Note that the client side is assumed to be com-pletely trusted and secure. There would not be muchpoint in developing a secure database service if ahacker can simply penetrate the client side and trans-parently access the database. Preventing client-sidebreaches is a traditional security problem unrelated to

    privacy-preserving data storage, and we do not con-cern ourselves with this problem here.

    2.1 Relation Decomposition

    We now consider different techniques to partitiondata across the two servers in the distributed ar-chitecture described above. Say the client needsto support queries over the universal relationR(A1 , A2 , . . . , A n ). Since the client itself possesses nopermanent storage, the contents of relation R need tobe decomposed and stored across the two servers. Werequire that the decomposition be both lossless andprivacy preserving .

    A lossless decomposition is simply one in which itis possible to reconstruct the original relation R us-ing only the contents in the two servers S 1 and S 2 .The exact manner in which such a reconstruction isperformed is exible, and may involve not only tra-ditional relational operators such as joins and unions,but also other user-dened functions, as we discussshortly. We also require the decomposition to be pri-vacy preserving: the contents stored at server S 1 orS 2 must not, in themselves, reveal any private infor-mation about R. We postpone our discussion of whatconstitutes private information to the next section.

    Traditional relation decomposition in distributed

    databases is of two types:Horizontal Fragmentation Each tuple of the rela-tion R is stored at S 1 or S 2 . Thus, server S 1 containsa relation R1 , and S 2 contains a relation R2 such thatR = R1 R2 .Vertical Fragmentation The attributes of relationR are partitioned across S 1 and S 2 . The key attributesare stored at both sites to ensure lossless decomposi-tion. Optionally, other attributes may also be repli-

    cated at both sites in order to improve query perfor-mance. If the relations at S 1 and S 2 are R1 and R2respectively, then R = R1 R 2 , where refers to thenatural join on all common attributes.

    We believe that horizontal fragmentation is of lim-ited use in enabling privacy-preserving decomposition.For example, a company might potentially store itsAmerican sales records as R1 and its European recordsas R2 to prevent an adversary from gathering statis-tics about overall sales, thus providing a crude form of privacy. In this paper, we will focus on vertical frag-mentation which appears to hold much more promise.

    We now discuss a variety of extensions to verticalfragmentation which all aid in making the decomposi-tion privacy preserving.

    Unique Tuple IDs Vertical partitioning requires akey to be present in both databases in order to en-sure lossless decomposition. Since key attributes maythemselves be private (and can therefore not be storedin the clear), we may introduce a unique tuple ID foreach tuple and replicate this tuple ID alone across thetwo sites. (This concept is not new. Tuple IDs havebeen considered as an alternative to key replication tolower update costs in distributed databases [OV99].)

    There are a variety of ways to generate unique tu-ple IDs when inserting new tuples. Tuple IDs couldsimply be sequence numbers generated independentlyat the two servers with each new tuple insertion; solong as the client ensures that tuple insertions areatomic, both servers will automatically generate thesame sequence number for corresponding tuples. Al-ternatively, the client could generate random numbersas tuple IDs, potentially performing a query to a serverto make sure that the tuple ID does not already exist.

    Semantic Attribute Decomposition It may beuseful to split an attribute A into two separate, butrelated, attributes A1 and A2 , in order to exploit pri-vacy constraints that may apply to A1 but not to A2 .To illustrate, consider an attribute representing peo-ples telephone numbers. The area code of a personsnumber may be sufficiently harmless to be consideredpublic information, but the phone number, in its en-tirety, is information subject to misuse and should bekept private. In such a case, we may decompose thephone-number attribute into two: a private attributeA1 representing the last seven digits of the phone num-ber, and a public attribute A2 representing the rstthree digits.

    We can immediately see the benets of such at-tribute decomposition. Selection queries based onphone numbers, or queries that perform aggregationwhen grouping by area code, could benet greatly fromthe availability of attribute A2 . In contrast, in the ab-sence of A2 , and if the phone numbers were completelyhidden (e.g., by encryption), query processing becomesmore expensive.

  • 8/12/2019 2004-42

    4/14

    Attribute Encoding It may be necessary to encodean attribute value across both databases so that nei-ther database can deduce the value. For example, con-sider an attribute that needs to be kept private, say theemployee salary. We may encode the salary attributeA as two separate attributes A1 and A2 , to be storedin the two databases. The encoding of a salary valuea as two separate values a1 and a2 may be performed

    in different fashions, three of which we outline here:

    1. One-time Pad: a1 = a r, a 2 = r , where r is arandom value;

    2. Deterministic Encryption: a1 = E (a, k ), a2 =k, where E is a deterministic encryption functionsuch as AES or RSA;

    3. Random Addition: a1 = a + r, a 2 = r , where ris a random number drawn from a domain muchlarger than that of a .

    In all the above cases, observe that we may recon-struct the original value of a using the values a1 anda2 . The three different encoding schemes above offerdifferent trade-offs between privacy and efficiency.

    The rst scheme offers true information-theoreticprivacy, since both a1 and a2 consist of random bitsthat reveal no information about a1 . It also offersfast reconstruction of a from a1 and a2 , since only aXOR is required. However, such an encoding rules outany hope of pushing down a selection condition onattribute A; such conditions can be evaluated only onthe client side, after fetching all corresponding a1 anda2 values from the servers and reconstructing value a .

    The second scheme offers no benets over the rst if the key k is chosen to be an independent random valuefor each tuple. However, one could use the same key kfor all tuples, i.e., all values for attribute A2 are equalto the same key k. In such a case, we may be able toexecute selection conditions on A more efficiently bypushing down a condition on A1 . Consider the selec-tion condition A = v . We may evaluate this conditionefficiently as follows, assuming key k is stored at S 2 :

    1. Fetch key k from database S 2 .

    2. Send selection condition A 1 = E (v,k ) to databaseS 1 .

    3. Obtain matching tuples from S 1 .

    However, a drawback of such an encoding schemeis a loss in privacy. For example, if two different tu-ples have the same value in attribute A, they will alsopossess the same encrypted value in attribute A1 , thus

    1 We assume that attribute A is of xed length. Variable-length attributes may leak information about the length of thevalue unless encoded as xed-length.

    allowing database S 1 to deduce that the two tuples cor-respond to individuals with identical salaries. A sec-ond drawback of the scheme is that it requires encryp-tion and decryption of attribute values at the clientside which may be computationally expensive. Finally,such an encryption scheme does not help if the selec-tion condition is a range predicate on the attributerather than an equality predicate.

    The third scheme outlined above is useful when han-dling queries that perform aggregation on attribute A.For example, a query that requires the average salaryof employees may be answered by obtaining the aver-age of attribute A1 from database S 1 , and subtractingout the average of attribute A2 from database S 2 . Theprice paid for this efficiency is once again a compromiseof true privacy: in theory, it is possible for databaseS 1 to guess whether the salary value in a particulartuple is higher than that in another tuple.Adding Noise Another technique for enabling pri-vacy is the addition of noise tuples to both databasesS 1 and S 2 in order to improve privacy. Recall that the

    actual relation R is constructed by the natural joinof R1 and R2 . We may thus add dangling tuplesto both R1 and R2 without compromising the losslessdecomposition property. The addition of such noisemay help provide privacy, say by guaranteeing that theprobability of any set of attribute values being part of a real tuple is less than a desired constant.

    3 Dening the Privacy Requirementsand Achieving It

    So far, we have not stated the exact requirements of data privacy. There are many different formulations

    of data privacy, some of which are harder to achievethan others, e.g., [AS00, Swe02]. We introduce oneparticular denition that we believe is appropriate forthe database-service context. We refer the reader toAppendix A to see how our denition captures privacyrequirements imposed in legislation such as Californiabill SB1386 [SB02].

    Our privacy requirements are specied as a set of privacy constraints P , expressed on the schema of re-lation R2 . Each privacy constraint is represented by asubset, say P , of the attributes of R, and informallymeans the following:

    Let R be decomposed into R1 and R2 , and let anadversary have access to the entire contents of eitherR1 or R2 . For every tuple in R, the value of at leastone of the attributes in P must be completely opaqueto the adversary, i.e., the adversary should be unableto infer anything about the value of that attribute.Note that it is permissible for the values of some at-tributes in P to be open, so long as there is at least

    2 Other notions of privacy, such as k-anonymity [Swe02], aredened on the actual relation instances and may be harder toenforce in an efficient fashion.

  • 8/12/2019 2004-42

    5/14

    one attribute completely hidden from the adversary.We illustrate this denition by an example. Con-

    sider a company desiring to store relation R consistingof the following attributes of employees: Name, Date of Birth (DoB), Gender, Zipcode, Position, Salary,Email, Telephone . The company may have the fol-lowing considerations about privacy:

    1. Telephone and Email are sensitive informationsubject to misuse, even on their own. Thereforeboth these attributes form singleton privacy con-straints and cannot be stored in the clear underany circumstances.

    2. Salary , Position and DoB are considered privatedetails of individuals, and so cannot be storedtogether with an individuals name in the clear.Therefore, the sets {Name, Salary }, {Name, Po-sition } and {Name, DoB } are all privacy con-straints.

    3. The set of attributes {DoB, Gender, Zipcode } canhelp identify a person in conjunction with otherpublicly available data. Since we already statedthat {Name, DoB } is a privacy constraint, we alsoneed to add {DoB, Gender, Zipcode } as a privacyconstraint.

    4. We may also want to prevent an adversary fromlearning sensitive association rules, for example,between position and salary, or between age andsalary. Therefore, we may add two privacy con-straints: {Position, Salary }, {Salary, DoB }.

    What does it mean to not be able to infer thevalue of an attribute A? We have left this deni-tion intentionally vague to accommodate the require-ments of different applications. On one end, we mayrequire true information-theoretic privacy the adver-sary must be unable to gain any information about thevalue of the attribute from examining the contents of either R1 or R2 . We may also settle for weaker formsof privacy, such as the one provided by encoding an at-tribute using encryption or random addition. Neitherof these schemes provides true information-theoreticprivacy as above, but may be sufficient in practice. Inthis paper, we will restrict ourselves to the stricter no-tion of privacy, noting the advantages of the weakerforms where appropriate.

    3.1 Obtaining Privacy via Decompositions

    Let us consider how we might decompose data, us-ing the methodologies outlined in Section 2, into tworelations R1 and R2 so as to obey a given set of pri-vacy constraints. We will restrict ourselves to the casewhere R1 and R2 are obtained by vertical fragmenta-tion of R, fragmented by unique tuple IDs, with someof the attributes possibly being encoded. (We ignore

    semantic attribute decomposition, as well as the addi-tion of noise tuples. The former may be assumed tohave been applied beforehand, while the latter is notuseful for obeying our privacy constraints.)

    We abuse notation by allowing R to refer to theset of attributes in the relation. We may then denotea decomposition of R as D(R) = R1 , R2 , E , whereR1 and R2 are the sets of attributes in the two frag-

    ments, and E refers to the set of attributes that areencoded (using one of the schemes outlined in Sec-tion 2.1). Note that R1 R2 = R, E R1 , andE R2 , since encoded attributes are stored in bothfragments. We denote the privacy constraints P as aset of subsets of R, i.e., P 2R .

    From our denition of privacy constraints, wemay state the following requirement for a privacy-preserving decomposition:

    The decomposition D(R) is said to obey the privacy constraints P if, for every P P ,P

    (R1 E ) and P

    (R2 E ).

    To understand how to obtain such a decomposi-tion, we observe that each privacy constraint P maybe obeyed in two ways:

    1. Ensure that P is not contained in either R1 or R2 ,using vertical fragmentation. For example, theprivacy constraint {Name, Salary } may be obeyedby placing Name in R1 and Salary in R2 .

    2. Encode at least one of the attributes in P . Forexample, a different way to obey the privacy con-straint {Name, Salary } would be to use, say, aone-time pad to encode Salary across R1 and R2 .

    Observe that such encoding is the only way toobey the privacy constraint on singleton sets.

    Example Let us return to our earlier example andsee how we may nd a decomposition satisfying allthe privacy constraints. We observe that both Email and Telephone are singleton privacy constraints; theonly way to tackle them is to encode both these at-tributes. The constraints specied in items (2) and(3) may be tackled by vertical fragmentation of theattributes, e.g., R1(ID, Name, Gender, Zipcode) , andR2(ID, Position, Salary, DoB) , with Email and Tele-phone being stored in both R1 and R2 .

    Such a partitioning satises the privacy constraintsoutlined in item (2) since Name is in R1 while Salary ,Position and DoB are in R2 . It also satises the con-straint in item (3), since DoB is separated from Gender and Zipcode . However, we are still stuck with the con-straints in item (4) which dictate that Salary cannotbe stored with either Position or DoB . We cannot xthe problem by moving Salary to R1 since that wouldviolate the constraint of item (2) by placing Name andSalary together.

  • 8/12/2019 2004-42

    6/14

    The solution is to resort to encoding Salary acrossboth databases. Thus, the resulting decompositionis R1 = {ID, Name, Gender, Zipcode, Salary, Email,Telephone }, R2 = {ID, Position, DoB, Salary, Email,Telephone } and E = {Salary, Email, Telephone }.Such a decomposition obeys all the stated privacy con-straints.Identifying the Best Decomposition It is clear bynow that it is always possible to obtain a decomposi-tion of attributes that obeys all the privacy constraints in the worst case, we could encode all the attributesto obey all possible privacy constraints. A key questionthat remains is: What is the best decomposition to use,where best refers to the decomposition that mini-mizes the cost of the workload being executed againstthe database?

    An answer to the above question requires us to un-derstand two issues. First, we need to know how an ar-bitrary query on the original relation R is transformed,optimized and executed using the two fragments R1and R2 . We will address this issue in the next section.

    We will then consider how to exploit this knowledge informulating an optimization problem to nd the bestdecomposition in Section 5.

    4 Query Reformulation, Optimizationand Execution

    In this section, we discuss how a SQL query on relationR is reformulated and optimized by the client as sub-queries on R1 and R2 , and how the results are com-bined to produce the answer to the original query. Forthe most part, it turns out that simple generalizationsof standard database optimization techniques suffice tosolve our problems. In Sections 4.1 and 4.2, we explainhow to repurpose the well-understood distributed-database optimization techniques [OV99] for use in ourcontext. We discuss the privacy implications of queryexecution in Section 4.3 and present some open issuesin query optimization in Section 4.4.

    4.1 Query Reformulation

    Query reformulation is straightforward and identicalto that in distributed databases. Consider a typicalconjunctive query that applies a conjunction of selec-tion conditions C to R, groups the results using a setof attributes G, and applies an aggregation functionto a set of attributes A. We may translate the logi-cal query plan of this query into a query on R1 andR2 by the following simple expedient: replace R byR1 R 2 (with the understanding that the oper-ation also replaces encoded pairs of attribute valueswith the unencoded value). When the query involvesa self-join on R, we simply replace each occurrence of R by the above-mentioned join 3 .

    3 Note that since R is considered to be the universal relation,all joins are self-joins.

    Figure 2 shows how a typical query involving se-lections and projections on R (part (a) of gure) isreformulated as a join query on R1 and R2 (part (b)of gure).

    4.2 Query Optimization

    The trivial query plan for answering a query reformu-

    lated in the above fashion is as follows: Fetch R1 fromS 1 , fetch R2 from S 2 , execute all plan operators locallyat the client. Of course, such a plan is extremely ex-pensive, as it requires reading and transmitting entirerelations across the network.Optimizing the Logical Query Plan The logicalquery plan is rst improved, just as in traditionalquery optimization, by pushing down selection con-ditions, with minor modications to account for at-tribute fragmentation. Consider a selection conditionc:

    If c is of the form Attr op value , and Attr has not undergone attribute fragmentation, condi-tion c may be pushed down to R1 or R2 , whichevercontains Attr . (In case Attr is replicated onboth relations, the condition may be pushed downto both.)

    If c is of the form Attr1 op Attr2 , and bothAttr1 and Attr2 are unfragmented and present

    in either R1 or R2 , the condition may be pusheddown to the appropriate relation.

    Similarly, projections may also be pushed down toboth relations, taking care not to project out tupleIDs necessary for the join. Group-by clauses and ag-

    gregates may also pushed down, provided all attributesmentioned in the Group-by and aggregate are unfrag-mented and present together in either R 1 or R2 . Self- joins of R with itself, translated into four-way joinsinvolving two copies each of R1 and R2 , may be rear-ranged to form a bushy join tree where the two R1copies, and the two R2 copies, are joined rst.

    Figure 2(c) shows the pushing down of selectionsand projections to alter the logical query plan. Weassume attributes are fragmented as in the exampleof Section 3: R1 contains Name , while DoB is in R2and Salary is encoded across both relations. Thus, thecondition on Name is pushed down to R1 , while thecondition on DoB is pushed down to R2 . The selec-tion on Salary cannot be pushed down since Salary isencoded. In addition, we may push down projectionsas shown in the gure.Choosing the Physical Query Plan Having opti-mized the logical query plan, the physical plan needsto be chosen, determining how the query execution ispartitioned across the two servers and the client. Thebasic partitioning of the query plan is straightforward:all operators present above the top-most join have to

  • 8/12/2019 2004-42

    7/14

    Site 1 Site 2

    Client

    (a) (b) (c)

    Name

    R 1

    R 2

    Salary > 80K

    Name

    Salary > 80K

    R

    Name

    Salary > 80K

    R 1 R 2Name LIKE Bob

    Name LIKE Bob

    ID, Name,SalaryName LIKE Bob

    DoB > 1970

    DoB > 1970

    DoB > 1970

    ID, Salary

    Figure 2: Example of Query Reformulation and Optimizationbe executed on the client side; all operators under-neath the join and above Ri are executed by a sub-query at server S i for i = 1, 2 (shown by the dashedboxes in Figure 2).

    In the ideal case, it may be possible to push alloperators to one of R1 or R2 , eliminating the need fora join. Otherwise, we are left with a choice in decidinghow to perform the join.

    The rst option is to send sub-queries to both S 1and S 2 in parallel, and join the results at the client.The second option is to send only one of the two sub-queries, say to server S 1 ; the tuple IDs of the results

    obtained from S 1 are then used to perform a semi-join with the content on server S 2 , in addition to applyingthe S 2-subquery to lter R 2 .

    To illustrate with the example of Figure 2(c), con-sider the following two sub-queries:Q1 : SELECT Name, ID, Salary FROM R1 WHERE(Name LIKE Bob)Q2 : SELECT ID, Salary FROM R2 WHERE (DoB >1970)

    There are then three options available to the clientfor executing the query:

    1. Send sub-query Q1 to S 1 , send Q2 to S 2 , join theresults on ID at the client, and apply the selectionon Salary .

    2. Send sub-query Q1 to S 1 . Apply ID to the re-sults of Q1 ; call the resulting set . Send S 2 thequery Q3 : SELECT ID, Position, Salary FROMR2 WHERE (ID IN ) AND (DoB > 1970) . Jointhe results of Q3 with the results of Q1 .

    3. Send sub-query Q2 to S 2 . Apply ID to the re-sults of Q2 and rewrite Q1 in an analogous fashion

    to the previous case.

    The rst plan may be expensive, since it requires alot of data to be transmitted from site S 2 . The secondplan is potentially a lot more efficient, since only tu-ples that match the condition Name LIKE Bob are evertransmitted from both S 1 and S 2 . However, there maybe a greater delay in obtaining query answers; the re-sults from S 1 need to be obtained before a query issent to S 2 . In our example, the third plan is unlikelyto be efficient unless the company consists only of oldpeople.

    We illustrate query optimization with two more ex-amples.Example 2 Consider the query:

    SELECT SUM(Salary) FROM RSay Salary is encoded using Random Addition in-

    stead of by a one-time pad. In this case, the clientmay simultaneously issue two subqueries:

    Q1 : SELECT SUM(Salary) FROM R1Q2 : SELECT SUM(Salary) FROM R2The client then computes the difference between the

    results of Q1 and Q2 as the answer.Example 3 Consider the query:SELECT Name FROM RWHERE DoB> 1970 AND Gender=M

    The client rst issues the following sub-query to S 2 :Q2 : SELECT ID FROM R2 WHERE DoB> 1970After it gets the ID list from Q2 , it sends S 1 the

    query: SELECT Name FROM R1 WHERE Gender=M ANDID in . Note that the alternative plan sendingsub-queries in parallel to both S 1 and S 2 may bemore efficient if the condition DoB> 1970 is highly un-selective.

  • 8/12/2019 2004-42

    8/14

    4.3 Query Execution and Data Privacy

    One question that may arise from our discussion of query execution is: Is it possible for an adversary mon-itoring activity at either S 1 or S 2 to breach privacyby viewing the queries received at either of the twodatabases?

    We claim that the answer is No. Observe that

    when using simple joins as the query plan, the sub-queries sent to S 1 and S 2 are dependent only on theoriginal query and not on the data stored at eitherlocation; thus, the sub-queries do not form a covertchannel by which information about the content at S 1could potentially be transmitted to S 2 or vice versa.

    However, when semi-joins are used, we observe thatthe query sent to S 2 is inuenced by the results of thesub-query sent to S 1 . Therefore, a legitimate concernmight be that the sub-query sent to S 2 leaks informa-tion about the content stored at S 1 . We avoid privacybreaches in this case by ensuring that only tuple IDs are carried over from the results of S 1 into the querysent to S 2 . Since knowledge of some tuple IDs beingpresent in S 1 does not help an adversary at S 2 in in-ferring anything about other attribute values, such aquery execution plan continues to preserve privacy.

    4.4 Discussion

    The above discussion does not by any means exhaustall query-optimization issues in the two-server context.We now list some areas for future work, with prelimi-nary observations about potential solutions.Maintaining Statistics for Optimization Our dis-cussion of the space of query plans implicitly assumedthat the client has sufficient database statistics athand, and a sufficiently good cost model, for it tochoose the best possible plan for the query. More workis required to validate both the above assumptions.

    For example, one question is to understand wherethe client obtains its statistics from. Statistics onthe individual relations R1 and R2 could be obtaineddirectly from the servers S 1 and S 2 . The statisticsmay be cached on the client side in order to avoidhaving to fetch them from the servers for each opti-mization. There may potentially be statistics, e.g.,multi-dimensional histograms, that require knowledgeof both relations R1 and R2 in order to be maintained.If necessary, such statistics could conceivably be main-tained on the client side and may be constructed bymeans of appropriate SQL sub-queries sent to the twoservers.Supporting Other Decomposition TechniquesOur discussion of query optimization so far has onlycovered the case of vertical fragmentation with at-tribute encoding using one-time pads or random addi-tion. It is possible to optimize the query plans furtherwhen performing attribute encoding by deterministic

    encryption, or when using semantic attribute decom-position.

    For example, when an attribute is encrypted by adeterministic encryption function, it is possible to pushdown selection conditions of the form attr = const ,by obtaining the encryption key from one database andencrypting const with this key before pushing downthe condition.

    When an attribute is decomposed by semantic de-composition, the resulting functional dependenciesacross the decomposed attributes may potentially beused to push down additional selection conditions.To illustrate, consider a PhoneNumber (PN) attributewhich is decomposed into AreaCode (AC) and Lo-calNumber(LN) . A selection condition of the formP N =5551234567 could still be pushed down partiallyas AC =555 and LN =1234567 . (Note that the orig-inal condition cannot be eliminated, and still needsto be applied as a lter at the end.) Such rewrit-ing depends on the nature of the semantic decomposi-tion; the automatic application of such rewriting there-

    fore requires support for specifying the relationship be-tween attributes in a simple fashion.

    5 Identifying the Optimal Decomposi-tion

    Having seen how queries may be executed over a de-composed database, our next task at hand is to iden-tify the best decomposition that minimizes query costs.Say, a workload W consisting of the actual queries tobe executed on R is available. We may then think of the following brute-force approach:

    For each possible decomposition of R that obeys theprivacy constraints P :

    Optimize each query in W for that decompositionof R, and

    Estimate the total cost of executing all queries inW using the optimized query plans.

    We may then select that decomposition which of-fers the lowest overall query cost. Observe that suchan approach could be prohibitively expensive, sincethere may be an extremely large number of legitimatedecompositions to consider, against each of which weneed to evaluate the cost of executing all queries in theworkload.

    To work around this difficulty, we attempt to cap-ture the effects of different decompositions on querycosts in a more structured fashion, so that we may ef-ciently prune the space of all decompositions withoutactually having to evaluate each decomposition inde-pendently. A standard framework to capture the costsof different decompositions, for a given workload W ,is the notion of the affinity matrix [OV99] M , whichwe adopt and generalize as follows:

  • 8/12/2019 2004-42

    9/14

    1. The entry M ij represents the cost of placingthe unencoded attributes i and j in different frag-ments.

    2. The entry M ii represents the cost of encodingattribute i across both fragments.

    We assume that the cost of a decomposition maybe expressed simply by a linear combination of en-tries in the affinity matrix. Let R = {A1 , A2 , . . . An }represents the original set of n attributes, and con-sider a decomposition of D(R) = R1 , R2 , E . Then,we assume that the cost of this decomposition C (D)is i (R 1 E ) ,j (R 2 E ) M ij + i E M ii . ( For sim-plicity, we do not consider replicating any unencodedattribute, other than the tupleID, at both sites.)

    In other words, we add up all matrix entries corre-sponding to pairs of attributes that are separated byfragmentation, as well as diagonal entries correspond-ing to encoded attributes, and consider this sum to bethe cost of the decomposition.

    Given this simple model of the cost of decomposi-tions, we may now dene an optimization problem toidentify the best decomposition:

    Given a set of privacy constraints P 2R and an affinity matrix M , nd a decomposition D(R) =R1 , R2 , E such that

    (a) D obeys all privacy constraints in P , and (c) i,j :i (R 1 E ) ,j (R 2 E ) M ij + i E M i is mini-

    mized.We are left with two questions:

    How is the affinity matrix M generated from aknowledge of the query workload?

    How can we solve the optimization problem?

    We address the rst question in Appendix B, wherewe present heuristics for generating the affinity matrix.We discuss the second question next.

    5.1 Solving the Optimization Problem

    We may dene our optimization problem as the fol-lowing hypergraph-coloring problem:

    We are given a complete graph G(R), withboth vertex and edge weights dened by theaffinity matrix M . (Diagonal entries standfor vertex weights.) We are also given a set

    of privacy constraints P 2R

    , representinga hypergraph H (R, P ) on the same vertices.We require a 2-coloring of the vertices in Rsuch that (a) no hypergraph edge in H ismonochromatic, and (b) the weight of bichro-matic graph edges in G is minimized. Thetwist is that we are allowed to delete any ver-tex in R (and all hyperedges in P that con-tain the vertex) by paying a price equal tothe vertex weight.

    Observe that coloring a vertex is equivalent to plac-ing it in one of the two partitions. Deleting the vertexis equivalent to encoding the attribute; so, all privacyconstraints associated with that attribute are satisedby the vertex deletion. Also observe that vertex dele-tion may be necessary, since it is not always possibleto 2-color a hypergraph.

    The above problem is very hard to solve even if

    we remove the feature of vertex deletion (allowing en-coding), say by guaranteeing that the hypergraph is 2-colorable. In fact, much more restrictive special casesare NP-hard, even to approximate, as the followingresult shows:

    It is NP-hard to color a 2-colorable, 4-uniform hypergraph using only c colors forany constant c [GHS00].

    In other words, even if all privacy constraints were4-tuples of attributes, and it is known that there existsa partitioning of attributes into two sets that satisesall constraints, it is NP-hard to partition the attributes

    into any xed number of sets, let alone two, to satisfyall the constraints!Given the hardness of the hypergraph-coloring

    problem, we consider three different heuristics to solveour optimization problem. All our heuristics utilizethe following two solution components:Approximate Min-Cuts If we were to ignore theprivacy constraints for a moment, observe that the re-sulting problem is to two-color the vertices to minimizethe weight of bichromatic edges in G(R); this is equiv-alent to nding the min-cut in G (assuming that atleast one vertex needs to be of each of the two colors).This problem can be solved optimally in polynomial

    time, but we will be interested in a slightly more gen-eral version: We will require all cuts of the graph thathave a weight within a specied constant factor of themin-cut.

    Intuitively, we want to produce a lot of cuts thatare near-optimal in terms of their quality, and we willchoose among the cuts to pick one that helps satisfythe most privacy constraints. This approximate min-cut problem can still be solved efficiently in polyno-mial time using an algorithm based on edge contrac-tion [KS96]. (Note that this also implies that the num-ber of cuts produced by the algorithm is only polyno-mially large.)Approximate Weighted Set Cover

    Our secondcomponent uses a well-known tool in order to tacklethe satisfaction of the privacy constraints via vertexdeletion. Let us ignore the coloring problem and con-sider the following problem instead: Find the mini-mum weight choice of vertices to delete so that all hy-pergraph edges are removed, i.e., each set P P losesat least one vertex.

    This is the minimum weighted set-cover problem,and the best known solution is to use the following

  • 8/12/2019 2004-42

    10/14

  • 8/12/2019 2004-42

    11/14

    For example, we could allow attributes to be repli-cated across partitions, trying to exploit such replica-tion to lower query costs. In the terminology of theoptimization problem, vertices are allowed to take onboth colors. Edges in G emanating from a vertex vwith two colors will not be considered bichromatic;however, hyperedges involving v will need to be bichro-matic even when ignoring v.

    Another extension is to deal with constraints im-posed by functional dependencies, normal forms andmultiple relations. For example, we may want our de-composition to be dependency-preserving, which dic-tates that functional dependencies should not be de-stroyed by data partitioning. Different partitioningschemes may have different impacts on the cost of checking various constraints. Factoring these issuesinto the optimization problem is a subject for futurework.

    Finally, expanding the denition of the optimiza-tion problem to accommodate the space of differentencoding schemes for each attribute is also an area as

    yet unexplored.

    6 Related Work

    Secure Database Services As discussed in the in-troduction, the outsourcing of data management hasmotivated the model where a DBMS provides reliablestorage and efficient query execution, while not know-ing the contents of the database [HIM02]. Schemesproposed so far for this model encrypt data on theclient side and then store the encrypted database onthe server side [HILM02, HIM04, AKSX04]. However,in order to achieve efficient query processing, all theabove schemes only provide very weak notions of dataprivacy. In fact a server that is secure under formalcryptographic notions can be proved to be hopelesslyinefficient for data processing [KC04]. Our architec-ture of using multiple servers helps to achieve bothefficiency and provable privacy together.Trusted Computing With trusted computing[TCG03], a tamper-proof secure co-processor could beinstalled on the server side, which allows executing afunction while hiding the function from the server. Us-ing trusted tamper-proof hardware for enabling securedatabase services has been proposed in [KC04]. How-ever, such a scheme could involve signicant compu-tational overhead due to repeated encryption and de-cryption at the tuple level. Understanding the role of tamper-proof hardware in our architecture remains asubject of future work.Secure Multi-party Computation Secure multi-party computation [Yao86, GMW87] discusses how tocompute the output of a function whose inputs arestored at different parties, such that each party learnsonly the function output and nothing about the inputsof the other parties. In our context, there are two par-

    ties the server and the client with the servers inputbeing encrypted data, the clients input being the en-cryption key, and the function being the desired query.In principle, the client and the server could then en-gage in a one-sided secure computation protocol tocompute the function output that is revealed only tothe client. However, in principle is the operativephrase, as the excessive communication overhead in-

    volved makes this approach even more inefficient thanthe trivial scheme in which the client fetches the entiredatabase from the server. More efficient specializedsecure multi-party computation techniques have beenstudied recently[LP00, AMP04, FNP04]. However allof this work is to enable different organizations to se-curely analyze their combined data, rather than theclient-server model we are interested in.Privacy-preserving Data Mining Different ap-proaches for privacy-preserving data mining stud-ied recently include: (1) perturbation techniques[AS00, AA01, EGS03, DN03, DN04] (2) query restric-tion/auditing [CO82, DJL79, KPR00] (3) k-anonymity

    [Swe02, MW04, AFK+

    04]. However, research here ismotivated by the need to ensure individual privacywhile at the same time allowing the inference of higher-granularity patterns from the data. Our problem israther different in nature, and the above techniquesare not directly relevant in our context.Access Control Access control is used to controlwhich parts of data can be accessed by different users.Several models have been proposed for specifying andenforcing access control in databases [CFMS95]. Ac-cess control does not solve the problem of maintainingan untrusted storage server as even the the adminis-trator or an insider having complete control over the

    data at the server is not trusted by the client in ourmodel.

    7 Conclusions

    We have introduced a new distributed architecturefor enabling privacy-preserving outsourced storage of data. We demonstrated different techniques that couldbe used to decompose data, and explained how queriesmay be optimized and executed in this distributed sys-tem. We introduced a denition of privacy based onhiding sets of attribute values, demonstrated how ourdecomposition techniques help in achieving privacy,and considered the problem of identifying the bestprivacy-preserving decomposition. Given the increas-ing instances of database outsourcing, as well as theincreasing prominence of privacy concerns as well asregulations, we expect that our architecture will proveuseful both in ensuring compliance with laws and inreducing the risk of privacy breaches.

    A key element of future work is to test the viabilityof our architecture through a real-world case study.Other future work includes identifying improved al-

  • 8/12/2019 2004-42

    12/14

    gorithms for decomposition, expanding the scope of techniques available for decomposition, e.g., support-ing replication, and incorporation of these techniquesinto the query optimization framework.

    References[AA01] D. Agrawal and C. Aggarwal. On the de-

    sign and quantication of privacy preservingdatamining algorithms. In Proc. PODS , 2001.

    [AFK + 04] G. Aggarwal, T. Feder, K. Kenthapadi, R. Mot-wani, R. Panigrahy, D. Thomas, and A. Zhu.Anonymizing tables. Technical report, StanfordUniversity, 2004.

    [AKSX04] Rakesh Agrawal, Jerry Kiernan, RamakrishnanSrikant, and Yirong Xu. Order-preserving en-cryption for numeric data. In Proc. SIGMOD ,2004.

    [AMP04] G. Aggarwal, N. Mishra, and B. Pinkas. Se-cure computation of the k-th ranked element.In Proc. EUROCRYPT , 2004.

    [AS00] R. Agrawal and R. Srikant. Privacy-preservingdata mining. In Proc. SIGMOD , 2000.

    [CFMS95] S. Castano, M. Fugini, G. Martella, andP. Samarati. Principles of Distributed Database Systems . Addison Wesley, 1995.

    [Chv79] Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research , 4(3):233235, 1979.

    [CO82] F. Chin and G. Ozsoyoglu. Auditing and infer-ence control in statistical databases. In IEEE TSE, 8(6) , 1982.

    [CW02] J.P. Morgan signs outsourcing deal with IBM.ComputerWorld, Dec 30, 2002.

    [DJL79] D. Dobkin, A. Jones, and R. Lipton. Securedatabases: Protection against user inuence. InACM TODS, 4(1) , 1979.

    [DN03] I. Dinur and K. Nissim. Revealing informationwhile preserving privacy. In Proc. PODS , 2003.

    [DN04] C. Dwork and K. Nissim. Privacy-preservingdatamining on vertically partitioned databases.In Proc. CRYPTO , 2004.

    [EGS03] A. Evmievski, J. Gehrke, and R. Srikant. Lim-iting privacy breaches in privacy preservingdata mining. In Proc. PODS , 2003.

    [FNP04] M. Freedman, K. Nissim, and B. Pinkas. Effi-cient private matching and set intersection. InProc. EUROCRYPT , 2004.

    [GHS00] V. Guruswami, J. Hastad, and M. Sudan. Hard-ness of approximate hypergraph coloring. InProc. 41st Annual Symposium on Foundations of Computer Science (FOCS) , 2000.

    [GMW87] O. Goldreich, S. Micali, and A. Wigderson.How to play any mental game a completenesstheorem for protocols with a honest majority.In Proc. STOC , 1987.

    [HILM02] H. Hacigumus, B. Iyer, C. Li, and S. Mehro-tra. Executing sql over encrypted data in thedatabase-service-provider model. In Proc. SIG-MOD , 2002.

    [HIM02] H. Hacigumus, B. Iyer, and S. Mehrotra. Pro-viding database as a service. In Proc. ICDE ,2002.

    [HIM04] H. Hacigumus, B. Iyer, and S. Mehrotra. Effi-

    cient execution of aggregation queries over en-crypted relational databases. In Proc. DAS-FAA , 2004.

    [Joh73] David S. Johnson. Approximation algorithmsfor combinatorial problems. In Proc. 5th an-nual ACM Symposium on Theory of Comput-ing(STOC) , 1973.

    [KC04] Murat Kantarcioglu and Chris Clifton. Securityissues in querying encrypted data. TechnicalReport TR-04-013, Purdue University, 2004.

    [KPR00] Jon M. Kleinberg, Christos H. Papadimitriou,and Prabhakar Raghavan. Auditing boolean at-tributes. In Proc. PODS , 2000.

    [KS96] David Karger and Clifford Stein. A new ap-proach to the minimum cut problem. Journal of the ACM , 43(4):601640, July 1996.

    [LP00] Y. Lindell and B. Pinkas. Privacy-preservingdata mining. In Proc. CRYPTO , 2000.

    [MW04] A. Meyerson and R. Williams. On the complex-ity of optimal k-anonymity. In Proc. PODS ,2004.

    [OV99] M. Tamer Ozsu and Patrick Valduriez. Princi-ples of Distributed Database Systems . PrenticeHall, 2nd edition, 1999.

    [SAC + 79] Patricia G. Selinger, Morton M. Astrahan,Donald D. Chamberlin, Raymond A. Lorie, and

    Thomas G. Price. Access path selection in a re-lational database management system. In Proc.SIGMOD , pages 2334, 1979.

    [SB02] California senate bill SB 1386. http://info.sen.ca.gov/pub/01-02/bill/sen/sb_1351-1400/sb_1386_bill_20020926_chaptered.html , Sept. 2002.

    [Swe02] L. Sweeney. k-Anonymity: A model for preserv-ing privacy. In International Journal on Un-certainty, Fuzziness and Knowledge-based Sys-tems, 10(5) , 2002.

    [TCG03] TCG TPM specication version 1.2.https://www.trustedcomputinggroup.org,Nov 2003.

    [WP04] Advertiser charged in massive database theft.The Washington Post, July 22, 2004.

    [Yao86] Andrew Yao. How to generate and exchangesecrets. In Proc. FOCS , 1986.

  • 8/12/2019 2004-42

    13/14

    A Extract from California SB 1386

    The California Senate Bill SB 1386, which went into ef-fect on July 1, 2003, denes what constitutes personalinformation of individuals, and mandates various pro-cedures to be followed by state agencies and businessesin California in case of a breach of data security in thatorganization. We present below its denition of per-sonal information, observing how it is captured by ourdenition of privacy constraints (the italics are ours):

    For purposes of this section, personal in-formation means an individuals rst nameor rst initial and last name in combinationwith any one or more of the following dataelements, when either the name or the data elements are not encrypted :(1) Social security number.(2) Drivers license number or CaliforniaIdentication Card number.(3) Account number, credit or debit cardnumber, in combination with any requiredsecurity code, access code, or password thatwould permit access to an individuals nan-cial account.For purposes of this section, personal infor-mation does not include publicly availableinformation that is lawfully made availableto the general public from federal, state, orlocal government records. [SB02]

    B Computing the Affinity Matrix

    Let us revisit the denition of the affinity matrix to

    examine its semantics: the entry M ij is required torepresent the cost of placing attributes i and j indifferent partitions of a decomposition, while the entryM ii represents the cost of encoding attribute i, withthe overall cost of decomposition being expressed as alinear combination of these entries.

    Note that it is likely impossible to obtain a ma-trix that accurately captures the costs of all decom-positions. The costs of partitioning different pairs of attributes are unlikely to be independent; placing at-tributes i and j in different partitions may have differ-ent effects on query costs, depending on how the otherattributes are placed and encoded. Our objective is tocome up with simple heuristics to obtain a matrix thatdoes a reasonable job of capturing the relative costs of different decompositions.

    Similar matrices are used to capture costs in othercontexts too, e.g., the allocation problem in dis-tributed databases [OV99]. Our problem is compli-cated somewhat by the fact that we need to accountfor the effects of attribute encoding on query costs, aswell as the interactions between relation fragmentationand encoding.

    A First Cut As a rst cut, we may consider the fol-lowing simple way to populate the affinity matrix froma given query workload, along the lines of earlier ap-proaches [OV99]:

    M ij is set to be the number of queries that refer-ence both attributes i and j .

    M ii is set to be the number of queries involvingattribute i .

    Of course this simple heuristic ignores many issues:different queries in the workload may have differentcosts and should not be weighted equally; the effect of partitioning attributes i and j may depend on how iand j are used in the query, e.g., in a selection condi-tion, in the projection list, etc.; the cost of encodingan attribute may be very different from that of par-titioning two attributes, so that counting both on thesame scale may be a poor approximation.

    In order to improve on this rst cut, we dig deeperto understand the effects of fragmentation and encod-

    ing on query costs.B.1 The Effects of Fragmentation

    Let us consider a query that involves attributes i and j and evaluate the effect of a fragmentation that sepa-rates i and j on the query. We may make the followingobservations:

    If i and j are the only attributes referenced in thequery, the fragmentation forces the query to touchboth databases, and increases the communicationcost for the query; the extra communication costis proportional to the number of tuples satisfying

    the most selective conditions on one of the twoattributes.

    If attributes other than i and j are involved inthe query, it is possible that the query may haveto touch both databases even if i and j were heldtogether, since the separation of other attributesmay be the culprit. Therefore, the query cost thatmay be attributed to M ij should be only a fractionof the query overhead caused by fragmentation.

    If i or j is part of a GROUP BY clause, fragmen-tation makes it impossible to apply the GROUPBY, making the query overhead very high.

    Using the above observations, we may devise ascheme to populate the matrix entries M ij for i = q .Each entry M ij is computed as a sum of contribu-tions from each query that references both i and j .The contribution of a query Q to M ij , for any pair iand j referenced in Q, is a measure of the fraction of extra tuple fetches from disk, and transmissions acrossthe network, that are induced by the partitioning of iand j . We dene this contribution as follows: ( Let s i

  • 8/12/2019 2004-42

    14/14

    be the selectivity of Q, ignoring all conditions involv-ing i, and sj be its selectivity ignoring all conditionsinvolving j .)

    If Q involves either i or j in a GROUP BY, or aselection condition, the contribution of Q to M ijis set to min( s i , s j ).

    If Q involves i and j only in the projection list,the contribution of Q to M ij is set to min( s i , s j )/nwhere n is the number of attributes referenced inQs projection list.

    Note that the approach above requires the estima-tion of query selectivities; this may be performed us-ing standard database techniques, i.e., using a combi-nation of selectivity estimates from histograms, inde-pendence assumptions, and ad hoc guesses about theselectivity of predicates [SAC + 79].

    B.2 The Effects of Encoding

    Let us now consider the effects of encoding attributeson query costs. We make the following observationsabout the effects of encoding attribute i on a queryQ: (We will assume that encoding is performed usingone-time pads or random addition.)

    If Q contains a selection condition involving i, thecondition cannot be pushed down; the overheaddue to this is proportional to the selectivity of the query ignoring the conditions on i.

    If Q involves i only in the projection list, theremay be additional overhead equal to the cost of fetching i from both sides.

    If Q involves i in the GROUP BY clause, groupingcannot be pushed down, and may cause additionaloverhead.

    If Q involves i only as an attribute to be aggre-gated, the use of Random Addition for encodingensures that the overhead of encoding is low.

    From these observations, we use the following rulesto determine the contributions of Q to M ii : (Again, welet si be the selectivity of the query ignoring predicatesinvolving i.)

    If i is in a selection condition or a GROUP BYclause, the contribution to M ii is set to s i .

    Else, if i is in the projection list, the contributionto M ii is set to 1/n , where n is the total numberof attributes referenced by Q.