effective error-tolerant keyword search for secure cloud computing

Yang B, Pang XQ, Du JQ et al. Effective error-tolerant keyword search for secure cloud computing. JOURNAL OF

COMPUTER SCIENCE AND TECHNOLOGY 29(1): 81–89 Jan. 2014. DOI 10.1007/s11390-013-1413-6

Effective Error-Tolerant Keyword Search for Secure Cloud Computing

Bo Yang1 (杨波), Xiao-Qiong Pang2 (庞晓琼), Jun-Qiang Du1 (杜军强), and Dan Xie1 (解丹)

1School of Computer Science, Shaanxi Normal University, Xi’an 710062, China2College of Computer and Control Engineering, North University of China, Taiyuan 030051, China

E-mail: [email protected]; [email protected]; [email protected]; [email protected]

Received January 15, 2013; revised April 28, 2013.

Abstract The existing solutions to keyword search in the cloud can be divided into two categories: searching on exactkeywords and searching on error-tolerant keywords. An error-tolerant keyword search scheme permits to make searches onencrypted data with only an approximation of some keyword. The scheme is suitable to the case where users’ searchinginput might not exactly match those pre-set keywords. In this paper, we first present a general framework for searching onerror-tolerant keywords. Then we propose a concrete scheme, based on a fuzzy extractor, which is proved secure againstan adaptive adversary under well-defined security definition. The scheme is suitable for all similarity metrics includingHamming distance, edit distance, and set difference. It does not require the user to construct or store anything in advance,other than the key used to calculate the trapdoor of keywords and the key to encrypt data documents. Thus, our schemetremendously eases the users’ burden. What is more, our scheme is able to transform the servers’ searching for error-tolerantkeywords on ciphertexts to the searching for exact keywords on plaintexts. The server can use any existing approaches ofexact keywords search to search plaintexts on an index table.

Keywords cloud computing, searchable encryption, error-tolerant, keyword search, fuzzy extractor

1 Introduction

Cloud computing is a model for enabling convenient,on-demand network access to a shared pool of con-figurable computing resources (e.g., networks, servers,storage, applications, and services) that can be rapidlyprovisioned and released with minimal management ef-fort or service provider interaction[1]. The cloud com-puting model offers the promise of massive cost savingscombined with increased IT agility. However, securityis cited as one of major barriers to broader adoption[2].

When users store data into the cloud, lack of controlin the cloud is the users’ major worry. An approachto retaining the control of data is to encrypt the clouddata. However, the encrypted data will limit the user’soperation on the data. In particular, searching and in-dexing the data becomes problematic. If data is storedin clear-text, one can efficiently search for a documentby specifying a keyword. This is impossible to do withtraditional, randomized encryption schemes. State-of-the-art cryptography may offer new tools to solve theseproblems. The research community has invented versa-tile encryption schemes that allow operations and com-putations on ciphertexts. For example, the fully ho-

momorphic encryption scheme presented in [3] allowsone to compute arbitrary functions over encrypted datawithout the decryption key. The scheme can be appliedto private queries to a search engine: the user submitsan encrypted query and the search engine computesa succinct encrypted answer without ever looking atthe query in the clear. It also enables searching onencrypted data: a user stores encrypted files on a re-mote file server and later the server can retrieve onlyfiles which (when decrypted) satisfy some boolean con-straint, even though the server cannot decrypt the fileson its own. However, the search is performed on en-crypted files instead of encrypted keywords. This makesthe scheme highly inefficient when it is used to cloudcomputing scenarios, because Gentry’s scheme[3] cur-rently requires a huge amount of computations. Thehomomorphic encryption, especially the fully homo-morphic encryption, is a breakthrough in cryptographiccomputation because it dramatically improves the effi-ciency of secure computations, especially secure multi-party computations. However, it really puts emphasison “computation on ciphertexts” rather than “search-ing on ciphertexts”. One of the most popular ways tosecurely search over encrypted data, is to selectively re-

Regular PaperThis work is supported by the National Natural Science Foundation of China under Grant Nos. 61272436, 61003232 and 61272404,

and the Natural Science Foundation of Guangdong Province of China under Grant No. 10351806001000000.©2014 Springer Science+Business Media, LLC & Science Press, China

82 J. Comput. Sci. & Technol., Jan. 2014, Vol.29, No.1

trieve files through keyword-based search. The search-able encryption schemes (also referred to as predicateencryption schemes), e.g., [4-18], have been developedin recent years. These schemes allow the data ownerto generate a search query according to keywords andencode the search query to form a trapdoor. The cloudserver can use this trapdoor to decide which documentsmatch the search query, without learning any additionalinformation.

An error-tolerant keyword search scheme, alsoknown as a fuzzy keyword search scheme[11] or a noisykeyword search scheme[18], permits to make searcheson encrypted data with only an approximation of somekeyword, which is suitable to the situation where users’searching input might not exactly match those pre-setkeywords due to the possible typos, representation in-consistencies, and/or lack of exact knowledge aboutthe data. Especially, in the documents about citizens’health records, financial records, or criminal dossiers,the keywords are pre-set as users’ biometrics. As weknow, biometric traits, which can be obtained thanksto a sensor, are the deciding factors to recognize a per-son’s identity by using special features such as face,finger-prints, iris, voice, DNA, and so on. However,biometric data are noisy, even two readings of the samebiometric source are rarely identical. Therefore, exact-match search over biometric data does not work. Howto tolerate a (limited) number of errors in the searchedkeywords while retaining the search secure and correctis an important and interesting issue because of theirbroad applications in reality.

1.1 Related Work

A number of approaches have been proposed forsearching on encrypted data. The existing solutions canbe divided into two categories according to the modelsof searching on encrypted data with privacy.

Searching on Exact Keywords. In this setting, theuser itself encrypts the database using a private-keyencryption scheme or a public-key encryption scheme,and uploads the encrypted data to the server so thatonly somebody holding the private key can obtain therecords it retrieves. The solution by a private-key en-cryption scheme can be realized with optimal secu-rity by using the oblivious RAM due to Goldreich andOstrovsky[19]. Because of the heavy cost of the obliv-ious RAM protocol (see the analysis in [14]), it wasalways quoted as a theoretical solution that was clearlyinfeasible in practice. Compared with the method of[19], Song et al.[16] presented a solution with one roundof interaction. However, their construction is not secureagainst statistical analysis. In [8], Chang and Mitzen-macher proposed a scheme based on index. In their

scheme, the overhead required for a query is propor-tional to the number of files. In [9], Curtmola et al.presented two solutions, in which the server’s searchtime is optimal but to update to the index is ineffi-cient. The authors also considered adaptive security oftheir schemes. In the solution by a public-key encryp-tion scheme, the party who searches over the data canbe different from the party that generates it. In otherwords, anyone with access to a party’s public key canadd encrypted data to the database, but only the partyholding the decryption key can decrypt and retrieve thedata. Boneh et al. proposed the first public encryptionwith keyword search scheme in [4], however the schemereveals the users access pattern. They also presented asolution in [5] that guarantees the complete privacy ofqueries by sacrificing efficiency.

Searching on Error-Tolerant Keywords. Bringer etal.[7] described an error-tolerant searchable encryptionscheme based on two utilities, namely, locally-sensitivehashing function and Bloom filter with storage. Alocally-sensitive hashing function is used to reduce thedifference occurring between similar data with highprobability, whereas different data remain significantlyremote. A Bloom filter with storage is an extension ofthe Bloom filter. It is a data structure used for answer-ing set membership queries and giving index associatedto the queried member. Since the Bloom filter may cap-ture incorrect results, to reduce the probability of errorevent, it is necessary to increase the number of hashfunctions, which makes the scheme inefficient. Li etal.[11] provided a solution for privacy-preserving fuzzysearch on encrypted data. In their scheme, the edit dis-tance is used to quantify keywords similarity. A usermust construct and store a wild-based fuzzy keywordset for each keyword before outsourcing data, which in-curs extra large storage. In addition, wildcards are notwell-adapted to most applications, as a wildcard per-mits to catch errors providing that we know where it islocated. This is a too strict constraint.

In [18], we developed a noisy-keyword-based search-able private-key encryption scheme in a fault-tolerantmanner, where Hamming distance is used to quantifykeywords similarity. Our scheme utilizes the fuzzy ex-tractor, which can extract a uniformly random stringfrom each noisy keyword in a noise-tolerant way. Theextracted string is used as the key to encrypt the docu-ments which are associated with this keyword. A draw-back of the scheme is that the user has to spend an extraround-trip time to search the keywords.

1.2 Our Contributions

In this paper, we propose a novel scheme for error-tolerant keyword search based on a fuzzy extractor.

Bo Yang et al.: Effective Error-Tolerant Keyword Search 83

The main contributions include the followings:1) The scheme is suitable for all similarity metrics:

Hamming distance, edit distance, and set difference.2) The scheme does not require the user to construct

or store anything in advance, other than the key used tocalculate the trapdoor of keywords and the key to en-crypt data documents. Thus, our scheme tremendouslyeases the users’ burden.

3) The index table consists of the random stringsextracted from a fuzzy extractor. So, by the fuzzy ex-tractor, we transform the servers’ searching for error-tolerant keywords on ciphertexts to the searching forexact keywords on plaintexts. The server can use anyexisting approaches of exact keywords search to searchthe index table.

2 General Framework and Security Definitions

2.1 General Framework

We begin by defining a general framework for search-ing on encrypted data by keywords in a fault-tolerantmanner. A user owns a private document set D ={D1, . . . , DN}, and each document is associated withcorresponding keywords. We consider an honest-but-curious server in the sense that it correctly follows theprotocol specification while it attempts to derive asmuch information as possible from user’s queries andaccess patterns.

In the following definitions, we use dis(wi, wj) to de-note the distance between wi and wj , η to denote thepredetermined threshold value of the distance, δ(D)to denote the set of all keywords in D, D(w) to de-note a set of identifiers of the documents containingthe keyword w, and id(Di,j) to denote the j-th iden-tifier in D(wi). Let SKE1 and SKE2 be the PCPA(pseudo-randomness against chosen-plaintext attacks)-secure symmetric encryption schemes, which are used toencrypt keywords and data files respectively. PCPA se-curity is a stronger notion than CPA (chosen-plaintextattacks) security, which guarantees that ciphertexts arecomputationally indistinguishable from random values.

Definition 1. An error-tolerant keyword searchscheme ETKS = (KeyGen, Document-Storage, Search,Decrypt) consists of the following four phases:

1) KeyGen(1k): taking a security parameter kas input, the algorithm outputs a secret key K =(K1,K2,K3) for a user.

2) Document-Storage(K,D): given a document setD = {D1, . . . , DN} and a secret key K as inputs, theuser outputs a secure index I, and stores a series ofciphertexts c = (c1, . . . , cN ) to the server as follows:

• Init(D) : scans (D) and generates the set δ(D);for all wi ∈ δ(D), outputs D(wi);

• BuildIndex ((K1,K2), δ(D), {D(wi)|wi ∈ δ(D)}) :given a secret key K1, δ(D) and {D(wi)|wi ∈ δ(D)}as inputs, the user calculates the trapdoor for eachwi ∈ δ(D) as ti = f(K1, wi), where f is a one-wayfunction, and constructs a secure index table I usingK2 by means of a way that it can be used to efficientlyretrieve the wanted documents afterwards;

• Data-Storage(D,K3): the user encrypts the docu-ment set D = {D1, . . . , DN} using the key K3 to obtainthe ciphertexts c, and outsources c and the index tableI to the cloud server for storage.

3) Search(I, (K1,K2, w))• for any query keyword w, the user computes the

trapdoor t′ = f(K1, w), and sends it as a query requestto the server;

• the server searches the index table I, and returnsthe results Iwi′ (for some wi′ ∈ δ(D)) to the user;

• the user decrypts Iwi′ and obtains the file identi-fiers id(Di′,j)(1 6 j 6 |D(wi′)|).

4) Decrypt(K3, id(Di′,j) (1 6 j 6 |D(wi′)|))• the user sends id(Di′,j) (1 6 j 6 |D(wi′)|) to the

server, and the server returns the associated ci′,j to theuser;

• the user decrypts ci′,j and obtains the filesDi′,j(1 6 j 6 |D(wi′)|) of its interest.

Definition 2. For all k ∈ N , let K, I, c be the out-puts of KeyGen(1k),BuildIndex ,Document-Storage,respectively. For any w, if there exists a wi ∈ δ(D), sat-isfying dis(w,wi) 6 η, such that Search(I, (K1,K2, w))and Decrypt(K3, id(D)) return the set of files correctly,that is,

id(Di,j)(1 6 j 6 |D(wi)|) ← Search(I, (K1,K2, w))

∧Decrypt(K3, id(Di,j)) = Di,j(1 6 j 6 |D(wi)|),

then we say the ETKS scheme is correct.

2.2 Security Definitions

We follow the security definitions presented in [9].The security requirement for searchable encryption istypically characterized as one that nothing should beleaked except the result of a search, which is referredto as an access pattern. However, except for obliviousRAMs, there exists no practical construction that satis-fies this requirement. All existing exact-match schemesalso disclose whether queries are for the same key-word or not, which is referred to as a search pattern.Curtmola et al.[9] analyzed two existing security def-initions that had been used for searching on private-key encrypted data: IND2-CKA (Indistinguishabil-ity Against Chosen-Keyword Attacks) in [10] and asimulation-based definition in [8]. They pointed outIND2-CKA was not strong enough to ensure that an


index could be safely employed to construct a search-able private-key encryption scheme. For the simulation-based definition[8], they pointed out that even an in-secure scheme would satisfy this definition and thisdefinition was inherently non-adaptive. Curtmola etal.[9] proposed more accurate security definitions forthe searchable private-key encryption scheme undernon-adaptive and adaptive adversarial models① respec-tively.

Next we introduce some auxiliary notions we will usein security definitions, which are based on [9].

The first notion is a history, which is an interactioninstantiation between the user and the server. Formally,we give the following definition.

Definition 3 (History). A q-query history overD = {D1, . . . , DN} is a tuple Hq = (D,w), where Dis a set of N documents, and w = (w1, . . . , wq) is a setof q keywords searched by the user.

By an interaction instantiation, the server will berevealed the identifiers of the documents which containthe queried keywords, and whether searches were forthe same word or not. Those two notions are referredas the access pattern and the search pattern, formally,

Definition 4 (Access Pattern). The access patterninduced by a q-query history Hq = (D,w) is a tupleα(Hq) = (D(w1), . . . , D(wq)).

Definition 5 (Search Pattern). The search patterninduced by a q-query history Hq = (D,w) is a symmet-ric binary matrix σ(Hq) such that for 1 6 i, j 6 q, theelement in the i-th row and j-th column is 1 if wi = wj,and 0 otherwise.

The fourth notion is the trace of a history, whichis the precise information leaked about the history Hq,including the access pattern and the search pattern in-duced by the history Hq = (D,w), and the size of theencrypted documents in D.

Definition 6 (Trace). The trace induced by a q-query history Hq = (D,w) is

τ(Hq) = (α(Hq), σ(Hq), |D1|, . . . , |Dn|).

The final notion is the view of the server, denotedas Vq(Hq), which is what the cloud server actually seesduring the interaction of a given history Hq under somesecret K. This includes the index I of D, the cipher-texts c = (c1, . . . , cn) and the trapdoors t = (t1, . . . , tq)of the queried keywords, that is:

Definition 7 (View). The view induced by a q-query history Hq = (D,w) is Vq(Hq) = (I, c, t) =(I, (c1, . . . , cn), (t1, . . . , tq)).

We now present the adaptive simulation-based secu-rity definition which requires the view of an adversary(including the index, the trapdoors and the ciphertexts)generated from an adversarially and adaptively chosenhistory be simulatable given only the trace.

Definition 8 (Adaptive Semantic Security). LetETKS = (KeyGen,Document-Storage,Search,Decrypt)be an error-tolerant keyword search scheme, k ∈ N bethe security parameter, A = (A0, . . . ,Aq) be an adver-sary, S = (S0, . . . ,Sq) be a simulator (q ∈ N ), andconsider the probabilistic experiments RealETKS ,A(k)and SimETKS ,A,S(k) described in Table 1.

We say that ETKS is adaptively semantically secureif for all polynomial-size adversaries A = (A0, . . . ,Aq)(q = poly(k)), there exists a non-uniform polynomial-size simulator S = (S0, . . . ,Sq), such that for allpolynomial-size distinguisher D:

|Pr[D(v, stA) = 1 : (v, stA) ← RealETKS ,A(k)]

− Pr[D(v, stS) = 1 : (v, stS) ← SimETKS ,A,S(k)]|6 negl(k),

where stA and stS are two strings which capture A’sstate and S’s state, respectively, and the probabili-ties are taken over the random coins of KeyGen andDocument-Storage.

Table 1. Two Probabilistic Experiments RealETKS,A(k) and SimETKS,A,S(k)

RealETKS,A(k) SimETKS,A,S(k)K ← KeyGen(1k) (D, stA) ← A0(1k)

(D, stA) ← A0(1k) (I, c, stS) ← S0(τ(D))

(I, c) ← Document-Storage(K,D) (w1, stA) ← A1(stA, I, c)

(w1, stA) ← A1(stA, I, c) (t1, stS) ← S1(stS , τ(D,w1))

Generate trapdoor t1 from K and w1 For 2 6 i 6 q

For 2 6 i 6 q (wi, stA) ← Ai(stA, I, c, t1, . . . , ti−1)

(wi, stA) ← Ai(stA, I, c, t1, . . . , ti−1) (ti, stS) ← Si(stS , τ(D,w1, . . . , wi))

Generate trapdoor ti from K and wi Let t = (t1, . . . , tq)

Let t = (t1, . . . , tq) Output v = (I, c, t) and stAOutput v = (I, c, t) and stA

①Non-adaptive adversaries make queries without considering previous trapdoors and search outcomes while adaptive ones makequeries according to previous trapdoors and search results.


3 Proposal

3.1 Tools

The basic tools used are the secure sketch and thefuzzy extractor. The notions are introduced by Dodis etal.[20], who gave constructions for three different simi-larity metrics, namely, Hamming distance, set differ-ence, and edit distance. Under their framework, thereliable and almost uniformly distributed data can beextracted from the noisy data by reconstructing theoriginal data with a given sketch, and then applying anormal “strong-extractor” (such as pair-wise indepen-dent hash functions) on the original data.

The following definitions are due to Dodis et al.[20].Let X be a random variable with alphabet X and

distribution PX , the min-entropy of X is H∞(X) =− log(maxx PX(x)), the conditional min-entropy of Xgiven Y is H̃∞(X|Y ) = − log(Ey←Y (2−H∞(X|Y=y)) (allthe logarithms in this paper are to the base 2). Thestatistical distance between two probability distribu-tions PX , PY with the same alphabet X is defined as

SD(X,Y ) =12

∑x∈X |PX(x)− PY (x)|.

A metric space is a set M with a distance func-tion dis : M × M → R+ = [0,∞), satisfyingdis(x, y) = 0 if and only if x = y, and symme-try dis(x, y) = dis(y, x) and the triangle inequalitydis(x, z) 6 dis(x, y)+dis(y, z). There are three metricsdescribed as follows.

1) Hamming Metric. Here M = Fn for some al-phabet F , and dis(w,w′) is the number of positions inwhich the strings w and w′ differ.

2) Set Difference Metric. Here M consists of all sub-sets of a universe U . For two sets w and w′, their sym-metric difference is w∆w′ = {x ∈ w ∪w′ | x /∈ w ∩w′}.The distance between the two sets is | w∆w′ |.

3) Edit Metric. Here M = F∗, and the distance be-tween w and w′ is defined to be the smallest number ofcharacter insertions and deletions needed to transformw into w′.

Definition 9[20]. A function Ext : {0, 1}n ×{0, 1}r → {0, 1}l is called (n,m, l, ε)-strong extractor iffor all distributions W over {0, 1}n with H∞(W ) > m,SD((Ext(W ;X), X), (Ul, X)) 6 ε, where Ul is uniformover {0, 1}l and X is random on {0, 1}r, respectively.

Strong extractors can extract at most l = m −2 log 1

ε + O(1) nearly random bits, and pairwise inde-pendent hash functions will already give us the optimall = m− 2 log 1

ε + 2.Definition 10[20]. An (M,m, m̃, η)-secure sketch

is a pair of randomized procedures, “sketch” (SS) and“recover” (Rec), with the following properties:

1) The sketching procedure SS on input w ∈ Mreturns a bit string s ∈ {0, 1}∗. The recovery proce-

dure Rec takes an element w′ ∈ M and a bit strings ∈ {0, 1}∗ as inputs.

2) Correctness: if dis(w,w′) 6 η, thenRec(w′,SS (w)) = w.

3) Security: for any distribution W over M, ifH∞(W ) > m, then H̃∞(W |SS (W )) > m̃.

Here is a typical sketch construction due to Dodis etal.[20] On input w, select a random codeword c (this isequivalent to choosing a random x ∈ Fk and computingC(x)), and set SS (w) to be the shift needed to get fromc to w: SS (w) = w − c. Then Rec(w′, s) is computedby subtracting the shift s from w′ to get c′ = w′ − s,decoding c′ to get c, and computing w = c+ s.

In the case of F = {0, 1}, addition and subtractionare the same, SS (w) = w ⊕ C(x). In this case, to re-cover w given w′ and s = SS (w), compute c′ = w′ ⊕ s,decode c′ to get c, and compute w = c⊕ s.

Definition 11[20]. An (M,m, l, η, ε)-fuzzy extractoris a pair of randomized procedures, “generate” (Gen)and “reproduce” (Rep), with the following properties:

1) The generation procedure Gen on input w ∈ Moutputs an extracted string R ∈ {0, 1}l and a helperstring P ∈ {0, 1}∗. The reproduction procedure Reptakes an element w′ ∈ M and a bit string P ∈ {0, 1}∗as inputs.

2) Correctness: if dis(w,w′) 6 η and (R,P ) ←Gen(w), then Rep(w′, P ) = R.

3) Security: for any distribution W over M,if H∞(W ) > m and (R,P ) ← Gen(w), thenSD((R,P ), (Ul, P )) 6 ε.

Lemma 1[20] (Fuzzy Extractor from Sketch). As-sure (SS, Rec) is an (M,m, m̃, η)-secure sketch, andExt is an (n, m̃, l, ε)-strong extractor given by pairwise-independent hashing (in particular, l = m̃−2 log 1

ε +2).Then the following (Gen, Rep) is an (M,m, l, η, ε)-fuzzy extractor:

• Gen(w; r, x): set P = (SS (w; r), x), R = Ext(w;x),and output (R,P ).

• Rep(w′, (s, x)): recover w = Rec(w′, s) and outputR = Ext(w, x).

The construction is shown in Fig.1.

Fig.1. Fuzzy extractor from sketch.

3.2 Our Scheme

We assume that the keywords in D can be repre-sented by at most d bits. The basic idea is that, foreach keyword wi ∈ δ(D), the user computes the trap-


door by exclusive-ORing a key as ti = wi ⊕ K1, thencomputes the index by applying the fuzzy extractor toti as Gen(ti). In the search stage, if a query w is simi-lar to wi, then t′ = w ⊕ K1 is similar to ti, thereforeGen(t′) = Gen(ti).

The scheme is as follows:1) KeyGen(1k): the user samples K1 ←r

{0, 1}d, generates K2 ← SKE1.Gen(1k) and K3 ←SKE2.Gen(1k), and outputs K = (K1,K2,K3).

2) Document-Storage(K,D) : the user performs thefollowing steps:

• Init(D) : scan (D) and generate the set δ(D); forall wi ∈ δ(D), output D(wi);

• BuildIndex ((K1,K2), δ(D), {D(wi)|wi ∈ δ(D)}) :for each wi ∈ δ(D), compute ti = wi ⊕K1 as the trap-door of wi (if the length of wi is smaller than d, d−|wi|“0”s are padded into the most significant bits of wi),and compute Gen(ti) = (Ri, Pi) = (Ri, (si, xi)), buildthe index of wi as

Iwi = {(Ri, Pi),SKE1.EncK2(id(Di,1)|| · · · ||id(Di,|D(wi)|))}.

The index table for all wi ∈ δ(D) is

I = {{(Ri, Pi),SKE1.EncK2(id(Di,1)|| · · · ||id(Di,|D(wi)|))}}|δ(D)|

i=1 ;

• Data-Storage(D,K3) : for each Di,j ∈ D(wi)(1 6 i 6 |δ(D)|, 1 6 j 6 |D(wi)|), compute and outputciphertext ci,j = SKE2.EncK3(Di,j) under the privatekey K3.

3) Search(I, (K1,K2, w)):• for any query w (if the length of w is smaller than

d, the user pads w with d − |w| “0”s in the most sig-nificant bits of w), the user computes the trapdoor ast′ = w ⊕K1 and sends it to the server;

• the server carries out the following search:for i = 1 to |δ(D)| do {

R′i = Ext(Rec(t′, si), xi),

if R′i = Ri, returns

Iwi= {(Ri, Pi),SKE1.EncK2(id(Di,1)|| · · · ||

id(Di,|D(wi)|))}

to the user };outputs fail;• after receiving

Iwi= {(Ri, Pi),SKE1.EncK2(id(Di,1)|| · · · ||id(Di,|D(wi)|))},

the user checks whether Ri = Ext(Rec(t′, si), xi).If the check passes, the user decrypts

SKE1.EncK2(id(Di,1)|| · · · ||id(Di,|D(wi)|)) and obtainsthe search result id(Di,1)|| · · · ||id(Di,|D(wi)|). The usersends id(Di,1)|| · · · ||id(Di,|D(wi)|) to the server and re-trieves the documents he/she wishes to obtain;

• the server returns ci,j(1 6 i 6 |δ(D)|, 1 6 j 6|D(wi)|) to the user;

• the user outputs SKE2.DecK3(ci,j)(1 6 i 6|δ(D)|, 1 6 j 6 |D(wi)|).

From the construction of the scheme, we see thatthe scheme is constructed based on the fuzzy extrac-tor which is suitable for all similarity metrics, henceour scheme is still suitable for all similarity metrics.The server’s search is conducted on {(Ri, Pi)}|δ(D)|

i=1 , sowe transform the servers’ searching for error-tolerantkeywords on ciphertexts to the searching for exact key-words on plaintexts. The server can use any existingapproaches of exact keywords search to search the in-dex table.

Theorem 1. The proposed scheme is correct.Proof. For any query w, if there exists a wi such that

dis(w,wi) 6 η, then from the property of exclusive-or,we get dis(t′, ti) 6 η. If the user correctly computesGen(ti) = (Ri, Pi) = (Ri, (si, xi)) in BuildIndex step,then Rec(t′, si) = ti, and Ext(ti, xi) = R′

i, computedby the server in Search step, must be equal to Ri fromthe correctness of the fuzzy extractor. If the checkRi = Ext(Rec(t′, si), xi) passes, the user is convincedthat the index Iwi

that he/she has received is correct.Therefore the user can correctly get all identifiers of thedocuments, and further obtain the documents he/shewishes to retrieve. ¤

3.3 Proof of Security

In this subsection, we analyze the security of ourconcrete scheme.

Theorem 2. The proposed scheme is adaptivelysecure (i.e., satisfies Definition 8) assuming that bothprivate-key encryption schemes SKE1 and SKE2 arePCPA-secure.

Proof. To prove the semantic security of ourscheme, what we need to do is to construct a simu-lator S = (S0, . . . ,Sq) such that for the adversaryA = (A0, . . . ,Aq), the outputs of RealETKS ,A(k)and SimETKS ,A,S(k) are computationally indistin-guishable. We construct a simulator S = (S0, . . . ,Sq)that adaptively produces a string v′ = (I ′, c′, t′) =(I ′, c′1, . . . , c

′N , t′1, . . . , t

′q) as follows:

1) S0(1k, τ(D)): it constructs a simulated index I ′

and random ciphertexts C ′ as follows:• initializes I ′ as a null table, i = 0;• if |I ′| < |I|, selects uniformly and at random a

keyword w′i, calculates I

′w′

i= Gen(w′

i), inserts I′w′

iinto

I ′, i = i+ 1, repeats this step;


• outputs I ′ as a simulated index.S0 selects uniformly and at random c′i from

{0, 1}|Di|(i = 1, . . . , N), constructs c′ as (c′1, . . . , c′N ),

and outputs v′ = (I ′, c′).From the security of the fuzzy extractor, I ′ is in-

distinguishable from I. Otherwise, an algorithm canbe built to distinguish between at least one of the ele-ments of I ′ and I. This will break the security of Gen.And due to the PCPA-security of encryption SKE2, c′

is indistinguishable from c.2) S1(stS , τ(D,w1)): it samples K1 ←r {0, 1}d,

calculates the trapdoor t′1 and the index I ′w1of

w1 with the same way the algorithm does, whereSKE1.EncK2(id(D1,1)|| · · · ||id(D1,|D(w1)|)) is replacedby a random number c′ selected uniformly, outputsv′ = (I ′, c′, t′), where I ′ = {I ′w1

}, c′ = {c′}, t′ = {t′1}.(I ′, c′) is indistinguishable with (I, c) for the same rea-son that the both are indistinguishable in S0(1k, τ(D)).t′1 is indistinguishable with t1 because of the way thatboth t′1 and t1 are generated.

3) Si(stS , τ(D,w1, . . . , wi)) (1 6 i 6 q): first Si

checks whether wi has appeared before. This can bedone by checking whether there exists a j (1 6 j 6 i−1)such that σ[i, j] = 1. If wi has not previously ap-peared, then Si generates the trapdoor t′i and the indexI ′wi

of wi with the same way S1 does, and substitutesI ′wi

for some I ′wkselected uniformly and at random in

I ′ − {Iw1 , . . . , Iwi−1}. On the other hand, if wi didpreviously appear, then Si retrieves the trapdoor pre-viously used for wi and uses it as t′i, and inserts t′i intot′. Si outputs v′ = (I ′, c′, t′). It is obvious that v′ isindistinguishable from v. This completes the proof. ¤

4 Performance

4.1 Exact Efficiency of the Proposed Scheme

For each keyword w, the user needs to calculate anexclusive-or operation and a Gen of a fuzzy extractor

(which needs a coding operation and a hash functionoperation) in the BuildIndex stage, so the user’s com-puting cost is O(|δ(D)|) in BuildIndex stage. Howeverin the Search stage, for a query, only an exclusive-or operation is needed. The server’s storage cost andsearch cost are O(|δ(D)|), respectively.

We conduct a thorough experimental evaluation ofthe proposed scheme on a real dataset: Digital Bibliog-raphy & Library Project (DBLP), which was developedand maintained by a team of Germany Trier Univer-sity. DBLP only stores the associated metadata of theliteratures, such as titles, authors, published date, andcan provide search service of high-quality science litera-tures in computer field. Up to March of 2012, DBLPincluded 1 912 813 literatures. We generate 1 000 key-words with the average length 13.59 selected randomlyin our experiments and take Hamming metrics as anexample, where the coding scheme is taken as [255, 175,21] binary BCH code, and the hash function is taken asSHA-1 in the fuzzy extractor. The experiment is con-ducted on the following running environment: Windowsxp sp3 professional edition, 32 bits (Direct X 9.0c),with Intelr Core2 processor running at Kingstone 4GBDDR2 800MHza and DDR2-800 memory. Fig.2 showsthe time of the index construction and the time of theserver’s searching in sequence. Because in the indexconstruction, the user can precompute codewords tobe used in the fuzzy extractor, the time of the indexconstruction can be divided into two cases: with pre-computation and without precomputation in Fig.2. Wecan see that both the time of the index construction andthe time of server’s searching increase linearly with thenumber of keywords.

4.2 Comparisons

In this subsection, we compare our scheme with twoexisting solutions for searching on error-tolerant key-

Fig.2. (a) Time of index construction. (b) Time of searching a single keyword.


words in [7] and [11]. In the comparison of computingcosts, we regard a computation of a hash function as abasic unit. An exclusive-or operation is still regardedas a hash function (although an exclusive-or operationis more simpler than a hash function), which will makeit easy for us to compare between them.

In [7], for a keyword w, the user must calculatehc(w) for all hc ∈ Hc in the BuildIndex stage, andcalculate hc(w′) for all hc ∈ Hc in the Search stagefor a query w′. Here Hc is a set of ν × µ compos-ite hash functions where ν hash functions are chosenfrom LSH (Locality-Sensitive Hashing) family, and µhash functions are dedicated to a Bloom filter. There-fore, the user’s computing costs are O(|δ(D)||Hc|) inthe BuildIndex stage and in the Search stage, respec-tively. Since the Bloom filter may capture incorrect re-sults, |Hc| must be large enough to reduce the probabil-ity of error event, which renders the scheme inefficient.For the storage, to store |D| encrypted documents isindispensable in all schemes, so we do no longer con-sider it and consider only the size of the index table tobe stored in the storage. The server needs to store |Hc|hash function values for each keyword, so the size of theindex table to be stored in the server is O(|δ(D)||Hc|).Regarding the search, for each query, the server needsto compute |Hc| hash functions and compares the re-sults with the Bloom filter. So the search cost for theserver is O(|Hc|).

In [11], the computing cost and search cost varybased on the method with which the server searchesthe index table, which is divided into the following twocases.

1) In the Listing Approach. The user needs firstlyto construct the fuzzy keyword set Swi,d for each key-word wi ∈ δ(D)(1 6 i 6 |δ(D)|) with predefined editdistance d (0 6 d 6 η, here η is the predeterminedthreshold value of distance). The construction is madeby using two proposed techniques, namely, wildcard-based fuzzy set construction and gram-based fuzzy setconstruction, where the size of the fuzzy set Swi,d iscll + cl−1

l + · · ·+ cl−dl in gram-based construction (here

l is the length of keyword), which is smaller than thatin wildcard-based construction. Therefore, the user’s

computing cost is O(η|δ(D)|(cll + cl−1l + · · · + cl−d

l )).Regarding the storage, the index table to store in theserver is with size O(η|δ(D)|(cll + cl−1

l + · · · + cl−dl )).

To search the files containing a query w, for each d(0 6 d 6 η), the user generates the fuzzy keyword setSw,d, computes the trapdoor Tw′ for each w′ ∈ Sw,d,and submits the trapdoor set Tw′∈Sw,d

as the searchquery to the cloud server. Therefore, the user’s com-puting cost is O(η(cll + cl−1

l + · · · + cl−dl )) calculations

of a one-way function. For each query, the search costfor the server is O(η|δ(D)|(cll + cl−1

l + · · ·+ cl−dl )).

2) In a Symbol-Based Trie-Traverse Search Scheme.To enhance the server’s search efficiency, the user needsto build a trie② of all possible keywords as follows. Foreach wi,j ∈ Swi,d (0 6 j 6 |Swi,d|), the user calcu-lates the tripdoor Twi,j

and splits it as αi1 , . . . , αil/n

(where l is the length of Twi,j, n = log |δ(D)|), takes

αi1 , . . . , αil/n as a path of trie. Therefore the cost forthe user to build a trie is still O(η|δ(D)|(cll + cl−1

l +· · · + cl−d

l )), and the user’s total computing cost isO(2η|δ(D)|(cll+cl−1

l +· · ·+cl−dl )). To store the trie, the

server’s storage cost is O(η|δ(D)|(cll+cl−1l + · · ·+cl−d

l ))and its search cost is O(l/n).

We use A,B,C to denote |δ(D)|, |Hc|, η(cll + cl−1l +

· · · + cl−dl ), respectively. The comparison is given in

Table 2, where [11] denotes the listing search approachin [11], and [11′] denotes the symbol-based trie-traversesearch scheme in [11].

In Table 2, cost1 and cost2 denote the user’s com-puting costs in the BuildIndex stage and Search stage,respectively. IND2-CKA means the scheme is se-cure against the adaptive chosen-keyword attacks, andIND1-CKA means that the scheme is secure againstthe non-adaptive chosen-keyword attacks. The formerguarantees security even when the user’s queries arebased on the encrypted index and the result of previousqueries. The latter only guarantees security if the user’squeries are independent of the index and the previousresults.

The comparison shows that our scheme is more effi-cient than the other two schemes. To users, all calcu-lations in the BuildIndex stage of any scheme can beprecomputed, and the calculations in the Search stage

Table 2. Comparison of Four Schemes

Scheme Security Model Metric Cost1 Cost2 Server’s Storage Search Cost

[7] IND1-CKA Hamming distance O(AB) O(AB) O(AB) O(B)

[11] IND2-CKA Edit distance O(AC ) O(C) O(AC ) O(AC )

[11′] IND2-CKA Edit distance O(2AC ) O(C) O(AC ) O(l/n)

Ours IND2-CKA All the metrics O(A) O(1) O(A) O(A)

②A trie is a tree data structure that allows strings with similar character prefixes to use the same prefix data and store only thetails as separate data. One character of the string is stored at each level of the tree, with the first character of the string stored at theroot.


dominate the efficiency for any scheme. Therefore, toimprove the users’ efficiency in the Search stage is ourmajor pursuit.

5 Conclusions

In this paper, we presented a general framework forsearching on error-tolerant keywords. We also gave aconcrete scheme based on a fuzzy extractor, and provedthe scheme secure against an adaptive adversary underwell-defined security definitions. Our scheme is suitablefor all similarity metrics, and can tremendously ease theusers’ burden.

References

[1] Mell P, Grance T. The NIST definition of cloud comput-ing. National Institute of Standards and Technology Spe-cial Publication, SP 800-145, September 2011. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

[2] Harauz J, Kaufman L M, Potter B. Data security in the worldof cloud computing. IEEE Security & Privacy, 2009, 7(4):61-64.

[3] Gentry C. Computing arbitrary functions of encrypted data.Communications of the ACM, 2010, 53(3): 97-105.

[4] Boneh D, Crescenzo G D, Ostrovsky R, Persiano G. Publickey encryption with keyword search. In Proc. EUROCRYPT2004, May 2004, pp.506-522.

[5] Boneh D, Kushilevitz E, Ostrovsky R, Skeith W. Public-key encryption that allows PIR queries. In Proc. the 27thCRYPTO, Aug. 2007, pp.50-67.

[6] Boneh D, Waters B. Conjunctive, subset, and range querieson encrypted data. In Proc. the 4th Theory of CryptographyConference, Feb. 2007, pp.535-554.

[7] Bringer J, Chabanne H, Kindarji B. Error-tolerant searchableencryption. In Proc. the IEEE International Conference onCommunications, June 2009.

[8] Chang Y, Mitzenmacher M. Privacy preserving keywordsearches on remote encrypted data. In Proc. the 3rd Int.Conf. Applied Cryptography and Network Security, June2005, pp.442-455.

[9] Curtmola R, Garay J, Kamara S, Ostrovsky R. Searchablesymmetric encryption: Improved definitions and efficient con-structions. J. Computer Security, 2011, 19(5): 895-934.

[10] Goh E. Secure indexes. IACR ePrint Cryptography Archive,2003. http://eprint.iacr.org/2003/216, Dec. 2013.

[11] Li J, Wang Q, Wang C, Cao N, Ren K, Lou W. Fuzzy keywordsearch over encrypted data in cloud computing. In Proc. the29th IEEE INFOCOM, March 2010, pp.441-445.

[12] Ma S, Yang B, Li K, Xia F. A privacy-preserving join on out-sourced database. In Proc. the 14th Information SecurityConference, Oct. 2011, pp.278-292.

[13] Park D, Kim K, Lee P. Public key encryption with conjunc-tive field keyword search. In Proc. the 5th Int. Workshop onInformation Security Applications, Aug. 2004, pp.73-86.

[14] Pinkas B, Reinman T. Oblivious RAM revisited. In Proc. the30th CRYPT, Aug. 2010, pp.502-519.

[15] Shi E, Bethencourt J, Chan T et al. Multi-dimensional rangequery over encrypted data. In Proc. the 2007 IEEE Symp.Security and Privacy, May 2007, pp.350-364.

[16] Song D, Wagner D, Perrig A. Practical techniques for searcheson encrypted data. In Proc. the 2000 IEEE Symposium onSecurity and Privacy, May 2000, pp.44-55.

[17] van Liesdonk P, Sedghi S, Doumen J et al. Computationallyefficient searchable symmetric encryption. In Proc. the 7th

VLDB Workshop. Secure Data Management, Sept. 2010, pp.87-100.

[18] Pang X, Yang B, Huang Q. Privacy-preserving noisy keywordsearch in cloud computing. In Proc. the 14th Int. Conf.Information and Communications Security, October 2012,pp.154-166.

[19] Goldreich O, Ostrovsky R. Software protection and simula-tion on oblivious RAMS. Journal of the ACM, 1996, 43(3):431-473.

[20] Dodis Y, Ostrovsky R, Reyzin L, Smith A. Fuzzy extractors:How to generate strong keys from biometrics and other noisydata. SIAM Journal of Computing, 2008, 38(1): 97-139.

Bo Yang received the B.S. degreein applied mathematics from PekingUniversity in 1986, and the M.S. de-gree in computer software and Ph.D.degree in cryptography from XidianUniversity, Xi’an, in 1993 and 1999,respectively. He is currently a pro-fessor and supervisor of Ph.D. can-didates at the School of ComputerScience, Shaanxi Normal University,

Xi’an, and a special-term professor of Shaanxi Province. Hisresearch interests include information theory and cryptog-raphy.

Xiao-Qiong Pang received theB.S. degree in computer science fromShanxi Normal University, Linfen, in2003, and the M.S. degree in com-puter application technology fromNorth University of China, Taiyuan,in 2006, and the Ph.D. degree in agri-culture electrification and automa-tion from the South China Agri-cultural University, Guangzhou, in

2013. She is a lecturer at College of Computer and Con-trol Engineering, North University of China. Her researchinterests focus on information security and cryptography.

Jun-Qiang Du received the B.E.degree in computer science fromNorthwestern Polytechnical Univer-sity, Xi’an, in 2009. He is currentlya graduate student at the School ofComputer Science, Shaanxi NormalUniversity. His research interests in-clude information security and cryp-tography.

Dan Xie received the B.S. degreein software engineering from HarbinNormal University, China, in 2011.She is currently a graduate student atSchool of Computer Science, ShaanxiNormal University. Her research in-terests include information securityand cryptography.

effective error-tolerant keyword search for secure cloud computing

Documents