the circle game: scalable private membership test using ... · remote membership test: a user...

The Circle Game: Scalable Private Membership TestUsing Trusted Hardware

Sandeep Tamrakar∗, Jian Liu∗, Andrew Paverd∗, Jan-Erik Ekberg†, Benny Pinkas‡, and N. Asokan∗∗Aalto University

[email protected], [email protected], [email protected], [email protected]†Trustonic

[email protected]‡Bar Ilan [email protected]

Abstract—Malware checking is changing from being a localservice to a cloud-assisted one where users’ devices query a cloudserver, which hosts a dictionary of malware signatures, to checkif particular applications are potentially malware. Whilst suchan architecture gains all the benefits of cloud-based services, itopens up a major privacy concern since the cloud service caninfer personal traits of the users based on the lists of applicationsqueried by their devices. Private membership test (PMT) schemescan remove this privacy concern. However, known PMT schemesdo not scale well to a large number of simultaneous users andhigh query arrival rates.

In this paper, we propose a simple PMT approach using acarousel: circling the entire dictionary through trusted hardwareon the cloud server. Users communicate with the trusted hardwarevia secure channels. We show how the carousel approach, usingdifferent data structures to represent the dictionary, can be real-ized on two different commercial hardware security architectures(ARM TrustZone and Intel SGX). We highlight subtle aspects ofsecurely implementing seemingly simple PMT schemes on thesearchitectures. Through extensive experimental analysis, we showthat for the malware checking scenario our carousel approachsurprisingly outperforms Path ORAM on the same hardware bysupporting a much higher query arrival rate while guaranteeingacceptable response latency for individual queries.

I. INTRODUCTION

Malware checking used to operate primarily as a local ser-vice: a locally-installed anti-malware tool periodically receivesupdated lists of known threats from its vendor, but all itschecks are done locally. This paradigm has already started tochange in the era of cloud computing. Nowadays, anti-malwaretools are often thin clients and the bulk of the threat data isheld by a cloud-based service. At the time of installing anapplication (or periodically), the anti-malware tool consults thecloud service to determine if the particular application is likelyto be malware. Such a design pattern is attractive for an anti-malware vendor for a variety of reasons: it avoids unnecessarydata transfers, ensures that all users have up-to-date threatinformation, and allows the anti-malware vendor to retain itsfull set of known malware signatures as a potential competitiveadvantage without having to disclose it in its entirety to everyone of their customers. Consequently, protocols involving suchremote lookup operations also occur in other scenarios, such asquerying whether a document is known to contain a maliciouspayload, or checking if your password is present in a databaseof leaked passwords. Abstractly, such a lookup operation is a

remote membership test: a user holding an item q wants tocheck if q is a member of a large set X , called the dictionary,held by a remote server.

Although services built on remote membership test havesignificant advantages, they suffer from a major privacy con-cern – the server learns all queries submitted by users, therebygaining personal information about the users. For example,it has been demonstrated that the set of applications on auser’s device can be used to infer personal characteristics ofthe user including their gender, age, religion, and relationshipstatus [50]. Service providers who want to benefit from incor-porating remote membership test into their services thereforewant to design their services to demonstrably preclude theability to infer personal traits of their users or profile them [29].The natural cryptographic primitive to build such a privatemembership test (PMT) is private set intersection (PSI) [45],[44]. However, PSI schemes have two major drawbacks. Firstis their complexity: A PSI involving a dictionary of size n, re-quires O(n) communication between the client and the serverand requires the server to perform O(n) operations. Secondis their poor scalability: PMT servers receive simultaneouslookup queries from large numbers of users, but PSI-basedschemes are not amenable to aggregation of such queries.

Recently, trusted hardware has become widely available oncommodity computing platforms. Trusted execution environ-ments (TEEs) are already pervasive on mobile platforms [19],and newer TEEs such as Intel’s SGX [35], [27] are beingdeployed on PCs and servers. Several prior works [9], [26],[34] show how trusted hardware can be used to establish a“trust anchor” [49] in the cloud. The combination of such atrust anchor with Path ORAM [52], the recent breakthroughin oblivious random access memory (ORAM), can be used tosolve the PMT problem. This solution has a constant commu-nication overhead and only O(log n) computational overheadper query. However, like PSI, Path ORAM is not amenableto aggregation of simultaneous queries. Therefore, supportingsimultaneous queries from m users will incur O(m log n)computational cost on the server.

Scenarios like malware checking have certain specific char-acteristics. First, in the case of Android malware, the size ofthe dictionary is currently in the order of millions [2]. Second,a typical user installs some tens or hundreds of applications.Therefore it is reasonable to admit a relatively high false posi-

arX

iv:1

606.

0165

5v3

[cs

.CR

] 2

4 A

ug 2

016

tive rate for PMT when used for such scenarios: a false positiveimplies that the user may subsequently reveal the identifier ofthat particular application to the anti-malware vendor in orderto learn more about the potential threat and possibly remove orquarantine the application. Since it is reasonable to assume thatthe anti-malware vendor is an honest-but-curious adversary(who is trusted to include only known malware identifiers inthe dictionary), the loss of privacy from having to reveal theidentifiers of a very small number of applications (possibly oneor none) to the cloud server is not significant. In this paper, weexploit these characteristics of the scenario of interest to designan effective and efficient PMT scheme that can scale wellby supporting a significantly larger number of simultaneousqueries compared to known PMT schemes.

The time required to provide a response to a query (i.e. thequery response latency) must also be within acceptable limitsfor the given scenario. For cloud-based malware checking,this should typically be within a couple of seconds. Moreimportantly, the service must be able to guarantee a certainresponse latency to its users. If a PMT scheme can guaranteean upper bound on its query response latency for a givenquery arrival rate, we say the scheme is sustainable at thatrate. For every scheme there is a maximum query arrival ratebeyond which the scheme is not sustainable. We call this thebreakdown point because if the query arrival rate exceeds thispoint, the query response latency will keep increasing overtime. At this point, new hardware has to be added to the serverin order to guarantee query response latency.

Our contributions are as follows:

• We introduce a new carousel design pattern inwhich the dictionary (or a representation thereof) iscontinuously circled through the trusted hardware onthe lookup server (Section V). This allows us to ensurequery privacy (i.e. the lookup server does not learn thecontents of the query) while guaranteeing low queryresponse latency.

• We show how the system’s performance can be signifi-cantly improved by selecting efficient data structuresto represent the dictionary. We evaluate severaldifferent data structures (Section VI), and describehow to construct and process each of these withoutleaking information about the queries or responses.

• Through a systematic and extensive experimental eval-uation using two different commercial hardware secu-rity architectures (ARM TrustZone and Intel SGX), weshow that for typical parameters in the malware check-ing scenario, our carousel-based PMT can support alarge number of simultaneous queries while stillguaranteeing sufficiently low query response latencyfor every query (Section VII). We also show experi-mentally that among the dictionary representations westudied, 4-ary Cuckoo hash [21] performs best.

• We also describe how to solve the PMT problemby using the Cuckoo hash representation with PathORAM (Section VII-B). We experimentally com-pare this Cuckoo-on-ORAM (CoO) approach withour Cuckoo-on-a-Carousel (CoaC) approach. We showthat although CoO can achieve very low averagequery response latency, it reaches its breakdown pointquickly. In contrast, CoaC provides a more modest

query response latency while sustaining much higherquery arrival rates using the same hardware. Thesustainability range for CoaC is 2.75 times higheron Intel SGX and nearly 10 times higher on ARMTrustZone (Section VII-D).

• We highlight the subtleties in privacy-preservingimplementations on real-life hardware security ar-chitectures, even for seemingly simple concepts likethe carousel pattern or Path ORAM (Sections VI andVII-B).

In this paper we describe and evaluate our carousel ap-proach as a solution to the real-world challenge of cloud-basedmalware checking. However, we emphasize that our approachcan also be applied to many other use cases.

II. PRELIMINARIES

A. Trusted Execution Environment

A Trusted Execution Environment (TEE) [3] is a systemsecurity primitive that isolates and protects security-criticallogic from all other software on the platform. All softwareoutside the TEE is said to be running in the Rich ExecutionEnvironment (REE) [3], which usually includes the majority ofthe platform’s software and the main operating system (OS).A piece of application logic running inside the TEE is referredto as a Trusted Application (TA), whilst an application runningin the REE is a Client Application (CA) [3].

Fundamentally, a TEE protects the confidentiality andintegrity of a TA’s data, and ensures that no REE softwarecan interfere with the TA’s operation. A TEE usually providessome form of remote attestation, which allows remote usersto ascertain the current configuration and behavior of a TA.The combination of these capabilities enables remote users totrust a TA. In modern systems, the capability to establish andenforce a TEE is often provided by the CPU itself. This leadsto very strong hardware-enforced security guarantees, and alsoallows the TEE to run on the main CPU, which is importantin terms of performance and deployability. However, in somecases this may allow malicious software in the REE to mountside-channel attacks against the TEE.

Our approach does not depend on any platform-specificfeatures, and thus can be realized on any TEE that exhibitsthe above characteristics. We demonstrate this by implement-ing our system on the two most prevalent commercial TEEtechnologies, ARM TrustZone and Intel SGX, for which weprovide overviews below.

1) ARM TrustZone: ARM TrustZone [7] is a contemporaryTEE architecture that is widely deployed on smartphone hard-ware platforms and is now being deployed on AMD CPUs [1].TrustZone provides a platform-wide TEE, called the secureworld, which is fully isolated from the REE or normal world.All interaction between the secure world and the normal worldis mediated by the CPU. In order to support multiple TAs,the secure world usually runs a trusted OS, such as Kinibifrom Trustonic [5]. Due to the constraints of the platform, thetrusted OS may limit TA’s internal memory (comprising code,stack and heap) – e.g. Kinibi limits each TA to 1 MB. Theplatform can be configured to map TA’s internal memory tosystem-on-chip (SoC) RAM. With such a configuration, TA’s

2

internal memory is secure memory since TrustZone protectsthe confidentiality and integrity of this memory against anadversary who controls the normal world. Furthermore, TA’sinternal memory is private memory since the adversary cannotobserve TA’s memory access pattern (i.e. the metadata aboutwhich memory addresses are being accessed, and in whatorder). In contrast, any accesses from TA to the normal world(main) memory are assumed to be observable by the adversary.

In Kinibi on ARM TrustZone (Kinibi-TZ), interaction be-tween a normal world CA and a TA in the TEE follows arequest-response pattern – CA can invoke a specific operationprovided by TA. In addition to a small set of TA invocationparameters, CA can demarcate up to 1 MB of its memoryto be shared with TA.1 From a technical perspective, mappingexternal memory is no different from sharing memory betweenapplications on any modern OS – the memory managementunit (MMU) maps a physical memory page to the virtualaddress spaces of both CA and TA. This page can be readand written by both endpoints, and the processor’s mechanismsfor cache coherency ensure that memory accesses are properlysynchronized. This feature allows TA to access large portionsof normal world memory through memory referencing.

2) Intel SGX: The recent Software Guard Extensions(SGX) technology [36] from Intel allows individual applica-tions to establish their own TEEs, called enclaves. An enclavecan contain application logic and secret data, and protectsthe confidentially and integrity of its contents from all othersoftware on the platform, including other enclaves, applica-tions, or the (untrusted) OS. SGX includes remote attestationcapabilities to provide remote parties with assurance about thecode running in an enclave [6]. For consistency, we refer to theuntrusted application that hosts the enclave as Client Applica-tion (CA), and the enclave itself as Trusted Application (TA).Although both SGX and TrustZone have similar objectives,the specific architectures of these two technologies give riseto several important differences.

Memory considerations. Unlike the platform-wide TrustZoneTEE, SGX supports multiple enclaves: each TA runs in itsown enclave. Each enclave is part of an application and runsin the same virtual address space as its host application. Thismeans that the enclave can directly access the application’smemory, but attempts by the application or OS to access theenclave memory are blocked by the CPU. Whenever any of theenclave memory leaves the CPU (e.g. is written to DRAM),it is automatically encrypted and integrity-protected by theCPU. However, even though SGX provides secure memory (i.e.confidential and integrity-protected memory), the enclave’smemory access pattern may still be observable by untrustedsoftware on the same platform. This lack of private memorygives rise to potential side-channel attacks against the enclave.

Deterministic side-channel attacks. Xu et al. [54] have shownhow a malicious OS can manipulate the platform’s globalmemory page table, which includes the enclave memory pages,to cause page faults whenever the enclave reads from or writesto its memory. If the enclave’s memory access pattern dependson some secret data, their technique can be used to discoverits value by observing the sequence of page faults. This side-

1Some combinations of hardware platform and trusted OS may allow largershared memory.

channel attack is deterministic and thus can be effective evenwith only a single execution trace. However, the adversarycan only observe memory accesses at page-level granularity(usually 4 kB). For example, he can observe when a particularmemory page is accessed and can distinguish between readsand writes, but cannot ascertain the specific addresses ofthese operations within the page. Xu et al. [54] further showhow sequences of multiple memory accesses at page-levelgranularity can be used to infer more precise information aboutthe target application. In some cases, this allows the adversaryto infer the offset within an accessed page. However, theseattacks were performed against existing software. For newsoftware, a potential countermeasure is therefore to ensure thatneither individual page-level memory accesses nor sequencesof accesses depend on secret data.

Probabilistic side-channel attacks. Liu et al. [33] havepresented an even stronger cache side-channel attack, whichcould be used against SGX. They exploit the fact that theCPU’s level 3 (L3) cache is shared between all cores, and thatthe adversary may have control of the other cores while theenclave is executing. Through this type of attack, an adversarymay be able to observe the enclave’s memory access patternat cache line (CL) granularity (usually 64 B). However, sincethe adversary does not have direct control of the L3 cache,this is a probabilistic attack that requires the secret-dependentmemory accesses to be repeated multiple times.

In this paper, we assume that SGX enclave memory can beconsidered private at page-level granularity. That is, differentaccesses within a page of enclave memory are indistinguishableto an adversary. Accesses to different enclave pages can benoticed by the adversary, even though the pages’ contents areencrypted and integrity-protected. Therefore, as explained inSection VII, we ensure that in all our SGX implementationsthe page-level memory access patterns do not depend on secretdata. Furthermore, we assume that probabilistic cache side-channel attacks are infeasible if secret-dependent memory ac-cesses are not repeated multiple times. We therefore ensure thatnone of our SGX implementations perform secret-dependentmemory accesses more than once. If stronger resistance toprobabilistic side-channel attacks is required, techniques suchas those used in Sanctum [16] could be applied.

B. Oblivious RAM

Oblivious RAM (ORAM) is a cryptographic primitive orig-inally proposed by Goldreich and Ostrovsky [24] to prevent in-formation leakage through memory access patterns. In ORAMschemes, a secure processor (e.g. TEE) divides its data intoblocks, which it encrypts and stores in randomized order innon-secure memory, such as the platform’s main memory. Oneach access, the processor reads the desired block and somedummy data, and then re-encrypts and reshuffles this databefore writing it back to non-secure memory. The processoralso needs to update some state in its private memory. UnderORAM, every access pattern is computationally indistinguish-able from other access patterns of the same length.

The state-of-the-art ORAM techniques are tree-based con-structions [51], [47], [52], [18], [39], [17], [12], [48], wherethe data blocks are stored in a tree structure. For example, inPath ORAM the processor stores a position map in its private

3

TEEREE

ClientApp.(CA)

LookupServer

Dic9onaryprovider

User

Results

TrustedApp.(TA)

qi∈{0,1}128

ri∈{0,1}1

SecureChannel

SecureChannel

Dic9onary(X)

x1x2...xn

Queries(Q)

Fig. 1. System model for cloud-based private membership test.

memory to record the path in which each block resides. Whenthe processor wants to access a block, it reads the block’scomplete path from the root to the leaf. To store and accessn blocks from insecure memory, tree-based ORAM has abandwidth cost of O(log n) and uses O(log n) private memory(if recursively storing the position map).

III. PROBLEM SETTING

A. System Model

Figure 1 depicts a generalized system model for cloud-based private membership test (PMT) using trusted hardware.It consists of a dictionary provider, a lookup server, and users.We describe and evaluate our system in terms of the concreteuse case of cloud-based malware checking, but we emphasizethat our approach can be applied to many other use cases.

In the cloud-based malware checking scenario, the dictio-nary provider is the anti-malware vendor that constructs andmaintains a malware dictionary X = {x1, ..., xn} containingn entries. Each entry xi in X is a (statistically) unique malwareidentifier. The lookup server is a remote server that providesthe malware checking functionality to users. The lookup servercould be operated by a third party, such as a content deliverynetwork. The actual lookup functionality is provided by a TArunning in the TEE. The lookup server also runs a CA in itsREE, which facilitates interaction between users and TA, andmakes X accessible to TA. In general, the dictionary cannot bestored inside the TEE since the dictionary may grow arbitrarilylarge and most real-world TEEs have limited memory.

Each user can authenticate and attest TA before establishinga secure communication channel with TA. The user thensubmits a query (q) representing an application that she intendsto install. TA stores the received queries Q = {q1, ..., qm} in itssecure memory, where m is the number of concurrent queriesat any given time. The lookup server must return a responseof 1 if q ∈ X and 0 otherwise. These responses are also keptin secure memory until they are ready to be returned to the

users via their respective secure channels. The primary privacyrequirement is that the adversary (with capabilities defined inSection III-C) must not learn any information about q. Notethat the adversary is permitted to learn statistical informationsuch as the number of queries submitted by a particular user(e.g. through traffic analysis) or the total number of queriescurrently being processed by the lookup server. Hiding thecommunication patterns between users and the lookup serveris an orthogonal problem.

In some use cases, it is acceptable for the PMT protocolto exhibit a small but non-zero false-positive rate (FPR). Forexample, if false positives are permitted in a cloud-basedmalware checking scenario, a user who receives a positiveresult from the lookup server will resubmit the same querydirectly to the dictionary provider in order to ascertain thetrue result and receive guidance on how to deal with thepotentially malicious application. If the FPR is sufficiently low,only a small fraction of apps will be revealed to the dictionaryprovider, thus effectively preventing it from profiling users.This tolerance of a non-zero FPR can provide significantperformance benefits, as we show in Sections VI and VII.However, false negatives are never permissible.

B. Android Malware Use Case Parameters

In 2015, anti-malware vendors, such as G Data, reportedapproximately 2.3 million new Android malware samples,which was a 50% percent increase compared to the previousyear [2]. Thus, we target a malware dictionary of 226 entries(i.e. ˜67 million entries) as a reasonable estimate for the nextten years. A study by Yahoo [4] found that typical Androiddevices have around 95 installed apps. Thus the PMT protocolcan afford a relatively high false positive rate (FPR) withoutadversely affecting user privacy. Based on the recommendationof a leading anti-malware vendor, we selected an FPR of 2−10,which implies that the majority of users will encounter at mostone false positive [29].

C. Adversary Model

The primary adversary we consider is a malicious lookupserver, which is assumed to have full control of the REE. Itsobjective is to learn information about the contents of users’queries, which can be used to profile users. As usual, weassume that the adversary is computationally bounded andcannot subvert correctly implemented cryptographic primi-tives. Therefore the secure channels between TA and usersprevent the lookup server learning the content of messagesexchanged via these channels. Furthermore, we assume theadversary will not perform hardware-level attacks due to therelatively high cost of such attacks compared to the value ofthe data. Therefore, the adversary cannot observe or modify theinternal state of TA or TA’s interactions with its data structuresin private memory. On the other hand, a lookup server canmasquerade as a user and submit its own arbitrary queries. Itcan also schedule or remove incoming queries as it sees fit.It can observe and measure the duration of TA’s interactionswith non-private memory, as well as individual query responselatencies. It can utilize any existing software side channel thatmay reveal information about the internal working of TA. Itmay also attempt to modify or disclose the dictionary.

4

Our secondary adversary is the dictionary provider itself,which is assumed to be honest-but-curious. It is trusted toonly add legitimate malware identifiers to the dictionary (X).Although the dictionary provider may be dishonest in thisregard, this is an orthogonal problem (e.g. this be addressedby considering the reputation of the dictionary provider).However, the dictionary provider may use any applicationidentifiers revealed to it by users to profile these users. Thedictionary provider authenticates X towards TA (e.g. via amessage authentication code using a key it shares with TA)so that TA can detect any tampering of X by the lookupserver. If the dictionary provider wants to keep the dictionaryconfidential from the lookup server, it is also possible toencrypt the dictionary such that it can only be decrypted byTA, as explained in Section VIII. We deem denial-of-serviceattacks to be out-of-scope.

IV. REQUIREMENTS

We define four requirements that the system must satisfy.The first is the main security requirement, and the latter threeensure the system’s performance and accuracy.

R1. Query Privacy: The lookup server and the dictionaryprovider must not be able to learn anything aboutthe content of the users’ queries or the correspondingquery responses. The dictionary provider may learnthe content of queries for which TA gave a positiveresponse (i.e. potentially malicious applications), ifthe user chooses to reveal these. This requirementcan also be stated in the ideal/real model paradigm:If there were an inherently trusted entity in an idealmodel, then it could have received the dictionary fromthe dictionary provider and the queries from users,and it could have sent responses to the users. In thatcase no other information is leaked to any entities.We require that a solution in the real world, whichdoes not have an inherently trusted entity, does notdisclose more information than in the ideal model.

R2. Response latency: The service must respond to everyquery within an acceptable time (e.g. in the orderof seconds for the malware checking use case). Thisresponse latency must be sustainable.

R3. Server scalability: The service must be able tosustain a level of overall throughput (i.e. queries pro-cessed per second) that is sufficient for the intendeduse case (e.g. in the order of thousands of queries persecond for the malware checking use case).

R4. Accuracy: The service must never respond with afalse negative. The false positive rate must be withinthe acceptable limits for the intended use case.

V. THE CAROUSEL APPROACH

To meet the requirements defined in the preceding sec-tion, TA needs a mechanism for accessing the dictionary(X) without leaking any information about the users’ queries(Requirement R1). The naive approach of accessing specificelements of X in the REE violates this requirement becausethe adversary can observe which dictionary items are beingchecked. Canonically, this type of problem could be solvedusing ORAM where TA is the ORAM processor and REE

TEE REE

LookupServer

Dic1onaryprovider

User

Results qi∈{0,1}128

ri∈{0,1}1

SecureChannel

SecureChannel

Queries(Q)

QueryRepresenta1on(S)

ClientApp.(CA)

Encode

Dic1onaryRepresenta1on(Y)

y1y2...yn’

TrustedApp.(TA)

Dic1onary(X)

x1x2...xn

Fig. 2. Overview of the carousel approach.

stores the encrypted shuffled database. However, as an alter-native approach, we propose a new carousel design patternin which a representation of the dictionary is continuouslycircled through TA. As we demonstrate in Section VII, thefundamental advantage of our carousel approach is that itsupports efficient processing of batches of queries in a singlecarousel cycle. Namely, whereas using ORAM to answer abatch of m queries requires accessing O(m log n) dictionaryitems, these m queries can be answered by a single cyclethat reads n dictionary items. When the size of the batchincreases, the latter approach becomes more efficient. Figure 2gives an overview of our carousel approach, which consists offour phases: dictionary representation, query representation,carousel processing and response construction.

Dictionary Representation. To avoid leaking informationabout a query’s position in the dictionary (X), TA cannotsend a response until it has completed one full carousel cyclesubsequent to the query’s arrival. Therefore, the the two factorsinfluencing response latency are the size of the dictionary andthe efficiency with which it is processed. To minimize latency,the Dictionary Provider transforms the dictionary X into amore compact and/or more efficient data structure, which wecall the dictionary representation Y = {y1, ..., yn′}, which iscycled through TA. The choice of data structure therefore hasa significant impact on the performance of the system, andalthough several well-known data structures support efficientmembership tests, it is not obvious which is best-suited for thecarousel setting. In Section VI we discuss these different datastructures and in Section VII we experimentally evaluate theirperformance.

Query representation. TA transforms queries Q into repre-sentations S = {s1, ..., sm′} in a similar manner to Y andmaintains them in sorted order. TA stores Q in its secure mem-ory along with the queries’ corresponding times of arrival andreferences to their representations in S. When more than onequery maps to the same representation sk, TA only maintainsa single sk in S, but adds dummy query representations inS and keeps track of the number of queries referencing sk.

5

Keeping a single instance of sk in S irrespective of the numberof queries that maps to it allows the TA to only operate onthe single sk. This is required to prevent information leakage,since the adversary is also permitted to submit queries.

Carousel processing. To process queries, TA cycles through Yand scans its contents in order to answer the received queries.CA divides Y into several chunks and invokes TA sequentiallywith each chunk as input along with waiting queries. Weassume that queries arrive continuously and breaking Y intochunks allows queries to be passed to TA without having towait for a full carousel cycle. Incoming queries are associatedwith the identifier of the chunk with which they arrived, whichis defined as their time of arrival. At the beginning of eachchunk, TA updates S based on the newly received queries. TAthen compares each entry in the chunk with S and records theresults. This process is repeated for each chunk.

Response construction. When a query has waited for a fullcycle of Y , TA processes the accumulated results and computesthe response. Responses are sent to the users at the end ofeach invocation of TA. Once a response has been sent, TAremoves the query from Q and removes its representationsfrom S if there are no other queries associated with thoserepresentations, otherwise it removes dummy representationsfrom S.

Avoiding information leakage. As explained in Section III,the adversary ADV can observe memory access patterns forall non-private memory (including CA’s memory), and canmeasure the time taken to respond to each query. To providethe strongest possible security guarantees, we assume thatADV knows exactly which entry in Y is currently beingprocessed by TA. If ADV could determine whether or notthis entry is relevant to the current set of queries, which in theworst case could be a single query, this would leak information.Furthermore, for a given query, it is possible that TA couldrespond before completing a full carousel cycle (e.g. if therelevant information was found at the start of Y ). However,since Y is not secret and ADV knows which chunk is currentlybeing processed by TA, the time between query arrival andresponse might also leak information about both the query andresponse. Therefore, in the carousel approach, we can satisfyRequirement R1 by a) performing constant-time processing forevery entry in Y , and b) ensuring that every query remainsin TA for exactly one full carousel cycle. In other words,the number of operations TA performs per chunk must beindependent of S, and the query response latency must beindependent of the queried value.

As shown in Table I, we experimentally measured theaverage time to cycle a 116 MB dictionary representation (Y )through TA in 1 MB chunks, for different memory access pat-terns. The first column shows the time required to perform 116TA invocations without any memory access or computation. Asconfirmed by this column, a main strength of Intel SGX is thatits enclave entries/exits add very little overhead. The secondcolumn shows the time for accessing one byte per 4 KB pagein Y , in addition to TA invocations. The third column showsthe total time for accessing the entirety of Y , also in addition toTA invocations. All read operations were performed using themaximum register size on each platform (i.e. 32 bit on Kinibi-TZ and 64 bit on Intel SGX). For Intel SGX, TA invocationoverhead is negligible, so overhead shown in the last column

is almost entirely due to the read operations. We can see thatmore memory accesses result in longer carousel cycling timefor both platforms. However, even if a dictionary representationallows otherwise, we always access the entirety of Y to ensurequery privacy. Therefore, the last column represents a lowerbound for carousel cycling time (and hence query responselatency).

TABLE I. AVERAGE CAROUSEL CYCLE TIME OF A 116 MBDICTIONARY REPRESENTATION UNDER DIFFERENT ACCESS PATTERNS.

Memory access patternsPlatform No reads (TA in-

vocations only)One read per 4 kBpage

Read every byte

Kinibi-TZ55.84 ms(±6.04)

159.06 ms(±34.52)

234.44 ms(± 55.94)

Intel SGX0.37 ms(±0.02)

0.67 ms(±0.03)

10.32 ms(±0.59)

VI. DICTIONARY REPRESENTATION

As explained in the preceding section, the performanceof the system can be significantly improved by choosing anefficient data structure with which to represent the dictionary.Although there are various data structures that support efficientmembership tests in general (e.g. Bloom filter [11]), it is notobvious which of these is best-suited for use in the carouselapproach. Since query latency depends on the length of thedictionary and the cost of processing each entry, the idealdictionary representation would minimize both of these aspects(Requirement R2). Furthermore, the chosen data structure mustsupport efficient batch processing (i.e. answering multiplequeries in each carousel cycle), since this is the fundamentaladvantage of the carousel approach and also improves serverscalability (Requirement R3). In this section we explore dif-ferent data structures for representing the dictionary. We firstdiscuss the naive approach of using an unmodified dictionary,but show that this is always less space-efficient than our newSequence of Differences representation in which we encodethe differences between successive dictionary entries. We thendescribe how to use two well-known data structures, Bloomfilter and 4-ary Cuckoo hash, in the carousel setting. Finally,we compare the size and processing complexity of thesedifferent representations.

As explained in Section III-A, our motivating scenario ofcloud-based malware checking can tolerate a low but non-zerofalse positive rate (FPR). We argue that this is also a reasonableassumption for other such applications of a PMT protocol. Thisis important because it enables us to use data structures withan inherently non-zero FPR (e.g. Bloom filter) or to reducethe size of the dictionary representation (e.g. using shorterhashes in the 4-ary Cuckoo hash representation). We denotethe acceptable FPR as 2−ε and explain how this is determinedfor each representation.

A. Naive Approach

The most naive approach is to cycle the unmodified dic-tionary entries through TA (i.e. Y = X) and compare theseagainst the queries. This is suboptimal because the dictionaryentries could be arbitrarily large, thus increasing the size of Yunnecessarily.

6

Given that it is acceptable to have an FPR of 2−ε, aslightly better naive method is to map each dictionary entry xiuniformly to a point in a domain of size n · 2ε. For this, theFPR can be calculated as follows:

FPR = 1− (1− 1/(n · 2ε))n ≈ 1− e−1/2ε ≈ 1/2ε = 2−ε,

Therefore (ε+ log n) bits are needed in order to represent anitem, and thus the length of Y is n · (ε + log n) bits. Thesame mapping is applied to the queries such that the resultingquery representations can be compared against Y . However,this approach always results in a larger Y compared to ournew Sequence of Differences representation, as described inthe following subsection. We therefore elide the naive approachfrom our comparisons and use the Sequence of Differencesrepresentation as our baseline.

B. Sequence of Differences

Dictionary representation. Compared to the naive approach,we can reduce the size of Y by representing only the differ-ences between successive items, with minimal additional pro-cessing cost. We first hash each entry xi to a value hi of length(ε+log n), and sort the resulting values: h0 < h1 < · · · < hn.Alternatively, hi can simply be a truncation of xi, sincethe entries are already uniformly distributed in the malwarechecking case. Instead of storing the entries themselves inY , we only store the differences between successive entries:y0 = h0, y1 = h1−h0, . . . , yn = hn−hn−1. If multiple entriesresult in the same hj , we only keep one copy of hj in Y toavoid leaking information.

The advantage of this approach is that the length of thedifferences (yi values) is smaller than the length of itemsthemselves (hi values). However, this approach requires us tochoose a fixed number of bits to represent all differences.

We ran a simulation which showed that the probability ofa difference being larger than (2ε+2 − 1) is approximately2%. Therefore, we chose to use (ε + 2) bits to represent asingle difference. In the vast majority of cases, the differenceyi = hi − hi−1 is less than 2ε+2, so we insert it directly intoY . Otherwise, yi = p · (2ε+2− 1)+ b, where b < 2ε+2− 1. Inthis case we insert p entries of “zero” (each ε + 2 bits) intoY , followed by b (with ε + 2 bits as usual). Note that sincethe actual difference yi is always greater than 0, it is easy torecognize these dummy entries. We expect to add about 0.02ndummy entries, so the total size of Y remains approximately1.02(ε+ 2)n.

Query representation. TA maps each query to its represen-tation in S by applying the same hashing operation as for xi.TA maintains S as a sorted list with m unique items, each(ε+ log n) bits in length.

Carousel processing. The algorithm in Figure 3 shows thecarousel processing for a chunk of Y . TA first recoversthe value of the current dictionary entry hi by adding thedifference yi to the previous entry hi−1. For each recoveredentry hi, it uses binary search to check and mark whether hiis in S. With binary search TA spends equal time processing

Y : Dictionary representationS: Query representationh: Current entryi = 0while i is in the current chunk do

if Y [i] equals 0 thenh← h+ 2ε+2 − 1

elseh← h+ Y [i]binary search of Y [i] in S

end ifi++

end while

Fig. 3. Membership test using Sequence of Differences

every hi thus avoiding information leakage.2 The time takenfor this binary search must not depend on the values of thecurrent queries. Overall, it takes O(ndlogme) operations toprocess each hi. Whenever TA encounters yi = 0, it identifiesit as a dummy item and adds (2ε+2 − 1) to hi−1, but itcontinues without performing a binary search for the hi, sinceY is already known to the adversary. This algorithm ensuresthat TA spends equal time for non-zero entries in Y .

Response construction. When a query completes one carouselcycle, TA generates its response by checking if the correspond-ing item in S is marked as a match.

C. Bloom Filter

Dictionary representation. A Bloom filter is a data struc-ture used for efficient membership testing. It is an N -bitarray B initialized with 0s, together with l independent hashfunctions Hi(·) whose output is uniformly distributed over[0, N − 1] [11]. To add an entry x to the filter, we computel array positions: hi = Hi(x), ∀i, 1 ≤ i ≤ l, and set each ofthese l positions in B to 1 (B[hi] = 1). To test if an item is inthe dictionary, l positions are calculated using the same set ofhash functions. If any of these positions in B is set to 0, wecan conclude that the item is not in B. Otherwise, the item isdeclared to be in B. The false positive rate is:

FPRbf = (1− (1− 1m )nl)l ≈ (1− e−nl

m )l,

For an FPR of 2−ε, an optimized bloom filter needs 1.44εnbits to store n items [41]. We represent the Bloom filter as abit array Y , which is the dictionary representation.

Query representation. For each query, TA calculates l bytepositions in the bloom filter and adds the positions to S insorted order.

Carousel processing. The algorithm in Figure 4 shows thecarousel processing for a chunk of Y . The algorithm essentiallycopies from the carousel all bytes containing data required todecide whether the queries are in the dictionary (namely, the

2If TA was simply comparing hi to items in S until finding sj ≤ hi,an adversary ADV , who knows Y , could measure response latency to learnwhether a certain query is in S. (Note that ADV can also insert false queriesthat could change response latency.)

7

Y : Dictionary representationS: Query representationR: A list empty bytesdummy byte: a byte used to do dummy operationsdummy int: an integer used to do dummy operationsi = 0j = 0while i is in the current chunk do

if S[j] equals i thenR[j]← Y [i]j ++

elsedummy byte← Y [i]dummy int++

end ifi++

end while

Fig. 4. Membership test using Bloom Filter

bytes to which the queries are mapped by the hash functions).R is a list of bytes for storing results, initialized to zeros.For each byte in the current chunk, TA checks whether thebyte is needed, as indicated by S. If so, it copies the byteto R. Otherwise, it copies the byte to a dummy locationdummy byte. TA performs an equal number of operationsfor every byte in Y .

Response construction. Once the carousel processing com-pletes, TA goes through all the queries, links them back to thequery representation, and inspects the corresponding values inR to check if all bit positions for a particular query are set.

D. 4-ary Cuckoo hash

Cuckoo hash is another data structure for efficient mem-bership test [42]. We use a variant, called d-ary Cuckoo hashwith four hash functions, since it utilizes approximately 97%of the hash table (compared to less than 50% utilization instandard Cuckoo hash with two hash functions) [20].

Dictionary representation. Four hash functions H1 – H4 areused to obtain four candidate positions for a given dictionaryentry xi in Y . During insertion, xi is hashed to a (ε+2)-bitvalue yi (e.g. by truncating xi). This value is inserted intothe first available candidate position. If all 4 positions for agiven yi are already occupied (say, by values y1, y2, y3, y4),yi is inserted by recursively relocating yj into one of their3 other positions (since each yj has a choice of 4 positionsin Y ). In the worst case, this recursive strategy could takemany relocations or get into an infinite loop. The standardsolution for this problem is to perform a full rehash, but theprobability for this event is shown to be very small. Kirsch etal. [30] introduce a very small constant-sized auxiliary stashfor putting the current unplaced item when a failure occurs.They show, by both simulation and analysis, that this strategycan dramatically reduce the insertion failure probabilities. Notethat this dictionary construction process affects neither theperformance nor the privacy guarantees of our system, sinceit is performed by the Dictionary Provider and takes placebefore any queries arrive. The dictionary is always assumed tobe known to the adversary.

Y : Dictionary representationS: Query representationR: A list (ε+2)-bit empty valuesdummy value: a byte used to do dummy operationsdummy int: an integer used to do dummy operationsi = 0j = 0while i is in the current chunk do

if S[j] equals i thenR[j]← Y [i]j ++

elsedummy value← Y [i]dummy int++

end ifi++

end while

Fig. 5. Membership test using Cuckoo Hash

To test the existence of an element xi, we need onlycalculate its four candidate positions in Y , and check if anyof these contain xi. We use a Cuckoo hash table of 1.03nslots, storing a hash of (ε + 2) bits in each slot. The FPR is4 · 2−(ε+2) = 2−ε.

Query representation. Given a query, TA calculates 4 posi-tions in Y for each query, and adds these positions into thesorted list S.

Carousel processing. The algorithm in Figure 5 shows howTA does carousel processing for the current chunk of theCuckoo hash table. R is a list of (ε+2)-bit values, initialized tozeros, to store results. For each entry in Y , TA checks whetherit is contained in S. If so, TA copies the byte to R. Otherwise,it copies it to a dummy location dummy value. It is clear thatTA performs an equal number of operations for each entry inY .

Response construction. TA links a query back to its querypresentation, and compares it with the four correspondingvalues in R. A match with any one of these values indicatesthat the query is most probably in the dictionary X .

E. Comparison

Table II shows a comparison of these three dictionaryrepresentations. For a dictionary of n = 226 entries, and FPRof 2−ε = 2−10, the sequence of differences takes the leastamount of space. When the number of queries m is muchsmaller than the dictionary size N , Cuckoo hash will be fastest(asymptotically) to process m queries.

In comparison, note that an ORAM based approach willrequire O(m logN) time for processing m queries. Asymp-totically, this time while become higher than that used by thecarousel approaches at about the point where m > N/ logN .We compare the run times of all methods, and of the ORAMbased approach, in Section VII.

8

TABLE II. COMPARISON OF DICTIONARY REPRESENTATIONS.

Dictionary Representation Dictionarysize (N)

Size forε = 10n = 226

Time forprocessing mqueries

Sequence of Differences 1.02(ε+2)n 97.74 MB O(N logm)

Bloom Filter 1.44εn 115.2 MB O(10m+N)

4-ary Cuckoo Hash 1.03(ε+2)n 98.88 MB O(4m + N)

VII. EXPERIMENTAL EVALUATION

To evaluate the performance of our Carousel approach,we implemented the full system (including multiple datastructures) on the two most prominent hardware securityarchitectures currently available: ARM TrustZone and IntelSGX. In order to compare our Carousel approach with ORAM,we also implemented the essential components of a functionalPath ORAM prototype on both ARM TrustZone and IntelSGX. All performance measurements were obtained using realhardware.

A. Environment Setup

Kinibi-TZ. We used a Samsung Exynos 5250 developmentboard from Arndale with a 1.7 GHz dual-core ARM Cortex-A17 processor to implement the lookup server.3 It runs An-droid OS (version 4.2.1) as the host OS and Kinibi OS asthe TEE OS. Kinibi allows authorized trusted applications toexecute inside the TEE. We us the ARM GCC compiler withKinibi-specific libraries for compilation.

Since Kinibi limits total TA private memory to a total of1 MB, the memory available for heap and stack data structuresis only about 900 KB. This limits the number of queriesthat can be processed concurrently. Further, Kinibi on thedevelopment board only allows CA to share 1 MB of additionalmemory with TA. We used this memory to transfer chunks ofY as well as to submit queries and retrieve responses. Thisplaced an upper bound on the size of the chunks shared withTA at a given time. CA includes the metadata (e.g. queriesper chunk, chunk identifiers, and number of items from Yin the chunk) in TA invocation parameters. To obtain timingmeasurements, we used the gettimeofday() function, a Linuxsystem call, which provides µs resolution.4

Intel SGX. We used an SGX-enabled HP EliteDesk 800 G2desktop PC with a 3.2 GHz Intel Core i5 6500 CPU and 8 GBof RAM.5 It runs Windows 7 (64 bit version) as the host OS,with a page size of 4 KB. We used the Microsoft C/C++ com-piler and the Intel SGX SDK for Windows. Since we are prac-tically unconstrained by code size, we configured the compilerto optimize execution speed (O2) and used the same compileroptions for all experiments. To obtain timing measurements,we used the Windows QueryPerformanceCounter (QPC) API,which provides high resolution (<1µs) time stamps suitablefor time-interval measurement.6

For Intel SGX, we have to account for the fact that TAdoes not have private memory, and thus the adversary can

3http://arndaleboard.org/4http://man7.org/linux/man-pages/man2/gettimeofday.2.html5SGX is not yet available in server-class platforms, which would further

improve the performance of our system.6https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904

observe TA’s memory access pattern at page-level granularity(as discussed in Section II). To overcome this challenge, wedesigned each SGX TA such that its memory access patterndoes not depend on any secret data. A central primitive in thesedesigns is a page-sized data container, which we refer to as anoblivious page. Whenever a private data structure spans morethan one oblivious page, we perform the same memory accessoperations on all pages. Since we assume the adversary canalso measure the timing between these memory accesses, weensure that this does not depend on and private information.These challenges were also recently identified by Gupta etal. [25], who used a similar approach of ensuring constant-time operations and performing uniform memory accesses toavoid leaking information. We do not attempt to defend againstprobabilistic cache-based side-channel attacks, but we arguethat these would not be feasible against our implementationsince we do not perform repeated operations using any pieceof secret data (e.g. as required for the attacks by Liu etal. [33]). If necessary, our mitigation techniques could beadapted to these types of attacks, but this would have an equalimpact on the performance of all data structures in both ourcarousel and ORAM experiments, so the overall comparisonsand conclusions are likely to remain unchanged.

B. Implementing PMT: Carousel Methods

For all hash table lookups, we used lookup3, a fast non-cryptographic hash function from the set of Jenkins hashfunctions [28]. For AES operations on Kinibi-TZ, we used thembed TLS cryptography library, which is designed for embed-ded systems.7 For AES operations on SGX, we used the officialIntel-supplied trusted cryptography library (sgx_tcrypto).We generated a dictionary of n = 226 items, each representedas 128 bits, drawn from a uniform random distribution.8 Weused the data structures described in Section VI to generate Y .We assume that each user communicates with TA via a securechannel, which in our experiments was modelled as 128 bitAES encryption in CBC mode.

In all cases, we aimed to implement the dictionary rep-resentation data structures using an integer number of bytesso as to avoid additional bit-shift operations. However, forthe sequence of differences and 4-ary Cuckoo hash in Kinibi-TZ, we represented items in Y as 12-bit structures (ε = 10)and operated on two items (3 bytes) at a time. Furthermore,we optimized our implementations to make use of the largestavailable registers on each platform (32-bit on Kinibi-TZ and64-bit on Intel SGX) for read/write operations.

Sequence of Differences. Each dictionary entry was truncatedto a 36-bit value (hi) whilst maintaining the desired FPR (ε =10). Entries in Y are 12-bit differences between two successivedictionary entries.

In Kinibi-TZ, Q is a linked-list ordered by chunk identifierwhile S is maintained as a sorted array. Both Q and S arestored entirely in TA’s private memory, which can accommo-date a maximum of 12800 queries.

7https://tls.mbed.org/8In a real deployment, this could be a hash of a mobile application package,

which is customarily used by anti-malware vendors as a (statistically) uniquepackage identifier.

9

http://arndaleboard.org/

http://man7.org/linux/man-pages/man2/gettimeofday.2.html

https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904

https://tls.mbed.org/

In SGX, Q is stored as a sorted array spanning one or moreoblivious pages. Given the size of a query and its associatedmetadata (i.e. query ID and result), a single oblivious page canaccommodate up to 500 queries. If the number of concurrentqueries exceeds 500, TA uses multiple oblivious pages butalways performs the same number of operations on eachpage. This is achieved by including a dummy query on eachoblivious page. The adversary is unable to distinguish thesedummy operations from real operations since they take exactlythe same amount of time and access the same oblivious page.Clearly, this results in many additional operations and thus hasa significant impact on performance as the number of queriesincreases. However, one optimization, which arises from therequirement to perform the same operations on each page,is that we can process each page independently (i.e. eachpage can be processed as if it were the only page present).Although this does not negate the performance overhead, it isa significant improvement over a naive implementation.

4-ary Cuckoo Hash. We use Cuckoo hash with 4 hashfunctions to generate Y . We represent a query as a 12-bitvalue, and each of the four positions as 32-bit values. Eachquery representation therefore consists of a 32-bit position anda 16-bit buffer (R) to store the dictionary item correspondingto that position.

In Kinibi-TZ, we maintain S as a sorted linked list.The private memory can accommodate a maximum of 4500queries.

In SGX, we again store S as a sorted array spanning one ormore oblivious pages. In this scheme, we can only accommo-date up to 170 queries on each oblivious page, since we muststore four positions for each query. As in the previous scheme,if the number of queries exceeds this threshold, multipleoblivious pages are used, and must all be accessed uniformly.In addition to the previous optimization of treating thesepages independently, we can further optimize by selecting hashfunctions that do not overlap with each other. Fotakis et al. [21]used this approach to simplify the algorithm, but in our caseit can also provide a significant performance advantage. Usingfour non-overlapping hash functions essentially allows us topartition the dictionary representation into four regions, andconsider only the query representations for one region at atime. We therefore allocate the four query representations tofour different sets of oblivious pages, thus allowing up to 680queries per set of four pages. When a particular region of thedictionary representation is being processed, we only operateon the pages corresponding to that region (if there are multiplesuch pages, the memory access must still be uniform for eachof them).

Bloom filter. We use Bloom filter with 10 hash functions,and thus represent each query as ten bit positions in Y . Eachquery representation consists of a 32-bit position value and an8-bit buffer (R) to store the byte from Y which contains thatposition.

In Kinibi-TZ, both Q and S are maintained as linked-lists in the TA’s private memory. The private memory canaccommodate a maximum of 1750 queries.

Bloom filter always requires more operations than 4-aryCuckoo hash. Having confirmed this experimentally on Kinibi-TZ, we elide the repetition of this experiment on SGX. The

implementation follows the same principles as that of theCuckoo hash scheme.

C. Implementing PMT: Cuckoo-on-ORAM

Since ORAM itself is not specifically designed for PMT,we need to generate a suitable dictionary representation (Y )and store it in an ORAM database. We chose Cuckoo hashbecause it requires the fewest memory accesses. By compar-ison, each Bloom filter query requires 10 different accesses,and each binary search in the sequence of differences rep-resentation accesses at least 26 positions. TA is the ORAMprocessor whilst CA stores the encrypted shuffled database.When TA receives a query, it maps the query to four cuckoopositions in Y . It then access these four positions followingan ORAM protocol to complete the PMT. Since ORAM wasdesigned to hide the access patterns, the adversary ADV learnsno information about which positions have been accessed.

We chose Path-ORAM as baseline for comparison becauseof its simplicity and because the Goldreich-Ostrovsky lowerbound of O(m log n) amortized lookups for m queries ap-plies to all ORAM variants mentioned in Section II-B, e.g.Ring-ORAM only has a 1.5x speedup over Path-ORAM inthe secure-processor setting. Moreover, advanced parallel orasynchronous ORAM schemes require parallel computation,and are difficult to implement without leaking information,e.g. TaORAM requires additional temporary data storage inTA, which in SGX must be made oblivious. A summary ofour chosen parameters is shown in Table III.

TABLE III. PATH ORAM PARAMETRIZATION.

Block size Node size Tree size Tree height Stash size

4KB 4 blocks 6329 nodes 13 6KB

We set the block size to 4 KB and each node of thetree contains 4 blocks. Our 98.88 MB dictionary (Table II)therefore required 6329 nodes, which results in a tree of height13. It required a 6 KB position map which can easily be storedin TA’s private memory.

Although the Path ORAM algorithm is relatively simple,implementing it in full has been found to be quite compli-cated [10], and is not required for this comparison. Instead,we prototyped the main operations and in all cases choseoptions that favor the ORAM implementation. This partialprototype therefore represents a generous upper bound on theperformance of any full implementation.

Kinibi-TZ. In Kinibi-TZ, we avoided maintaining the stashrequired by Path ORAM. Instead while storing a path back, thenodes were re-encrypted and shuffled along the path and theposition map updated accordingly. Again, this simplificationfavors ORAM in the comparison since maintaining a stashwould increase the number of operations performed per query.

Intel SGX. Since Path ORAM assumes some amount ofprivate memory, which is not available in SGX, we had totake additional steps to ensure that no information is leakedthrough the enclave’s memory access pattern. As with previousschemes, we used the concept of an oblivious page. All privatedata structures are stored on oblivious pages, and whenevera data structure’s size exceeds one page, we ensure that the

10

same sequence of operations is performed on each page (e.g.by reading and writing dummy values).

Specifically, with the above parameters, the Path ORAMposition map spanned four oblivious pages, and thus requiredfour reads/writes for every read/write to the position map.Each node in the stash also takes up four pages. Reading anode into the stash does not require specific privacy protection(e.g. ADV may learn the location of a specific node in thestash without compromising privacy), and thus no additionaloperations are required. However, whenever a node is evictedfrom the stash, ADV must not be able to identify the evictednode. To achieve this, we allocate a stash output buffer, equal tothe size of one node, within the enclave’s secure memory. Wethen iterate over all nodes in the stash, copying the intendednode into the output buffer and performing a constant-timedummy write to the output buffer for every other block. Sincethe stash output buffer is still in the enclave’s secure memory,ADV cannot determine which node has been placed in thisbuffer. The contents of the output buffer are then encryptedand evicted as usual.9

We use the same optimization for Cuckoo hash as describedearlier: the four hash functions are selected to have non-overlapping outputs. As above, this allows us to partition thedictionary representation into four different regions. In the caseof ORAM, we construct four separate ORAM trees, such thateach holds the values for a single region. This optimizationimproves performance in the Path ORAM case by reducingthe size of each tree, and hence the path length and size ofthe position map. With this optimization, each tree’s positionmap fits onto two oblivious pages.

D. Performance evaluation

Batch Performance. Figures 6 and 7 show the total processingtime for a single batch of queries using the different carousalschemes. Queries were sent in a batch at the beginning ofthe each carousel cycle. To achieve the desired FPR, we useddictionary representations with ε = 10 on Kinibi-TZ, and ε =14 on Intel SGX.10

Each point in the figures represents the average time forprocessing the batch over 1000 repetitions. The figures showthat processing time increases with query load for all threecarousel schemes. On both platforms, Difference-on-a-carouselhas longer processing time, because of having to do a binarysearch on S for every item in Y . On Intel SGX, the non-linear step-like behavior is caused by the use of multipleoblivious pages. Since the same number of operations must beperformed on each page (to preserve privacy), each additionalpage causes a step increase in processing time. The widthof each step corresponds to the number of queries that canbe accommodated per page. In the Difference-on-a-carouselscheme, the steps take a logarithmic shape due to the binary

9The issue of preventing side-channels from leaking information about theORAM queries made by the TA, is similar to the issue of asynchronicity inORAM queries discussed in [48]. That work considered a trusted proxy whichcoordinates ORAM queries from different users, and timing side-channelsthat leak information about the queries. We took a conservative approach ofpreventing such side-channels.

10Since it was more efficient to operate on byte-aligned data structures onthe Intel SGX platform.

search on the final (under-utilized) page, which eventuallyreaches full capacity.

On Kinibi-TZ, under small query load (less than 500queries), the batch processing time for the Bloom-filter-on-a-carousel is faster than other carousel approaches, however,the processing time increases rapidly as the number of queriesgrows (beyond 1000 queries). The hardware was unable tosupport larger query batch sizes.

Cuckoo-on-a-carousel (CoaC) can handle more querieswith smaller overhead than the other methods. Again, the non-linear performance characteristics in SGX are due to the useof multiple oblivious pages. Since this algorithm requires onlypointer-based operations (i.e. no binary search), each step addsa constant number of additional operations, resulting in flatstep increase.

In contrast to the carousel schemes, Cuckoo-on-ORAM(CoO) provides a very fast response latency (9 ms) for asingle query. However, queries are processed sequentially. Forexample, when 2,000 queries arrive at once, the latency ofthe final response will be 18 seconds on Kinibi-TZ, which isbeyond the acceptable tolerance of a malware checking service.By comparison, on Kinibi-TZ, Cuckoo-on-a-Carousel (CoaC)takes only 1.83 seconds to process 2,000 queries. Results forIntel SGX show a similar pattern, although with significantlylower latencies (e.g. SGX takes 0.282 seconds to process 2,000queries). The carousel schemes can therefore provide lowerquery response latencies when handling batches of queries.

Steady-state Performance. In addition to measuring batchquery processing, we also compare the steady-state perfor-mance of CoaC and CoO, assuming a constant query arrivalrate. Again we are primarily concerned with the average queryresponse latency.

On Kinibi-TZ, CoO provides responses with a latency of9 ms if the arrival rate is below 111 queries/second. On IntelSGX, this latency decreases to 1 ms latency for arrival ratesbelow 1000 queries/second. Figure 8a and Figure 8b showthe steady state performance of Kinibi-TZ and Intel SGX fordifferent query arrival rates (averaged over 1000 repetitions).

In order to identify the breakdown point where CoaC canno longer guarantee a bounded query response latency, wesimulated the steady-state operation of CoaC with differentquery rates and calculated the average number of concurrentqueries in TA (i.e. the occupancy) during each carousel cycle.

On Kinibi-TZ, we identify a query rate as sustainablewhen the average query occupancy remains stable at a levelbelow the maximum number of concurrent queries the TA canhandle (e.g. 4500 queries). We noticed that the carousel cycletime fluctuates due to OS scheduling on the platform, whichoccasionally causes the occupancy to reach the maximumcapacity. Although occasional spikes can be tolerated, weconsider the breakdown point to be the arrival rate at whichthe average occupancy consistently reaches this maximumcapacity. Figure 9 and Figure 10 shows the evolution of queryoccupancy in the Kinibi-TZ beyond 500 carousel cycle. At1030 queries/second (Figure 10), the CoaC query responselatency cannot be sustained. In contrast, with 1025 queries/sec-ond the linear regression on (Figure 10) occupancy suggeststhat CoaC will provide a sustainable response latency, and we

11

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500

1

5

10

15

20

Number of queries

Proc

essi

ngtim

e(s

econ

ds) Bloom-Filter-on-a-Carousel

Cuckoo-on-a-CarouselDifferences-on-a-Carousel

Cuckoo-on-ORAM

Fig. 6. Kinibi-TZ: Total processing time for a batch of queries (average and variance over 1000 runs).

0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500

1

5

Number of queries

Proc

essi

ngtim

e(s

econ

ds) Cuckoo-on-a-Carousel

Differences-on-a-CarouselCuckoo-on-ORAM

Fig. 7. Intel SGX: Total processing time for a batch of queries (average and variance over 1000 runs).

therefore conclude that the breakdown point is between 1025and 1030 queries/second.

For Intel SGX, we noted that there is almost no variabilityin the batch performance results (i.e. the results do not changemuch over multiple runs), and leveraged this fact when ascer-taining the steady-state breakdown point. We set the occupancyof TA to a fixed value, and measured the time taken to processone carousel’s worth of chunks. Dividing this fixed occupancyby the average carousel time gives the maximum query arrivalrate which is sustainable at that occupancy level. Repeating thisfor multiple occupancy values yields the curve in Figure 8b.For Intel SGX, this is the best method for determining thebreakdown point because of the non-linear behavior causedby the oblivious pages. Although each additional page allowsmore queries to be processed in a single carousel cycle, italso adds a performance penalty, which increases the carouselcycle time. Therefore, the maximum sustainable rate can onlybe achieved by fully utilizing every page. In practice, thesystem would employ a optimization algorithm to select thebest number of pages to use in each situation. The steady-statequery rates shown in Figure 8b would be the input parametersfor this optimization algorithm.

For Intel SGX, Figure 11 shows part of the same graphas Figure 8b with an additional set of data points. Each curvecorresponds to a different number of oblivious pages. Note

that the curves for multiple pages are not defined for lowerquery rates (i.e. towards the left boundary of the figure), sincethe algorithm will never use more pages than necessary. Thisfigure shows that the maximum query rate is achieved whenall the pages are fully utilized, but shows that increasing thenumber of pages beyond a certain point does not increasethe overall maximum query rate. When using more than twooblivious pages, the impact of the performance penalty foradding another page exceeds the benefit provided by thatpage, and ultimately results in a lower maximum rate. Thesame trend continues for larger numbers of oblivious pages,suggesting that two pages is the optimal situation.

VIII. VARIATIONS AND EXTENSIONS

Query scalability. Query arrival rates that exceed the break-down point can be supported by adding new hardware sothat multiple TAs can run in parallel. The same dictionaryrepresentation can be replicated for each TA. Without loss ofprivacy, any incoming query can be routed to any TA (e.g.using any type of load balancing scheme) since each TA hasits own dictionary representation.

Dictionary scalability. Our carousel approach is specificallydesigned around the parameters for the malware checking usecase, including generous safety margins (e.g. a dictionary sizeof 226 entries). If larger dictionary sizes are required, the

12

0 200 400 600 800 1,000 1,20010−3

10−2

10−1

100

101

102

1025 queries/second 111 queries/second1.24 seconds

Query arrival rate (queries/second)

Proc

essi

ngtim

e(s

econ

ds)

Cuckoo-on-a-CarouselCuckoo-on-ORAM

(a) Kinibi-TZ

0 1,000 2,000 3,000 4,00010−4

10−3

10−2

10−1

100

101

102

1354 queries/second3720 queries/second0.36 seconds


Proc

essi

ngtim

e(s

econ

ds)

Cuckoo-on-a-CarouselCuckoo-on-ORAM

(b) SGX

Fig. 8. Steady-state processing time for uniform query arrival rates (average and variance over 1000 runs). Vertical lines indicate breakdown points.

500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500

1,000

2,000

3,000

4,000

5,000

Carousel cycles

Occ

upancy

(concu

rren

tquer

ies

inT

A)

Queries

Linear fit ( y = −2.99 · 10−2 · x + 1,331.85)

Fig. 9. Kinibi-TZ: Evolution of occupancy at 1025 queries/second

dictionary can be split into multiple subsets, each handled bya separate TA running on its own core or processor. To ensurequery privacy, an adversary must not be able to identify whichTA receives a given query. This requires a central dispatcherTA that multiplexes incoming requests to the worker TAs.The dispatcher may need to introduce additional decoy trafficto thwart the adversary from gaining information via trafficanalysis.

Compact representation vs. complexity of processing.More compact dictionary representations may lead to shortercarousel cycle times, but this may be offset by the complexityof processing the representation. Conditional clauses (if) inthe carousel processing logic are particularly expensive. Forexample, we initially implemented the sequence of differencesapproach using Huffman encoding to represent the differences.This resulted in each difference represented by ε + 1.35 bitson average, which is a significant reduction in dictionary size.In particular, as Huffman encoding is prefix-free, there was

no need to add dummy entries (as explained in Section VI).However, the decoding process required processing variable-size suffixes, which resulted in an overall increase in thecarousel cycling time.

Implementation optimization. By default, items in the dic-tionary representations are not necessarily aligned on byteboundaries (e.g. in the sequence of differences and Cuckoohash methods, our desired FPR results in dictionary repre-sentations with 12-bit item length). Extracting such an itemfrom a bit string requires multiple shift and add operationscompared to byte-aligned representations. However, in Kinibi-TZ we still use 12-bit representations for these methods aswe can represent two items with exactly 3 bytes. Similarly,we reduced the number of read operations by designing ouralgorithms to read data at the maximum register size of eachplatform.

Other dictionary encodings. Other data structures than theones we have discussed so far may be suitable for use with

13

500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500

1,000

2,000

3,000

4,000

5,000

Carousel cycles

Occ

up

ancy

(con

curr

ent

quer

ies

inT

A)

Queries

Linear fit ( y = 0.34 · x + 651.61)

Fig. 10. Kinibi-TZ: Evolution of occupancy at 1030 queries/second

2,000 2,500 3,000 3,500 4,00010−2

10−1

100

101

102


Proc

essi

ngtim

e(s

econ

ds)

Cuckoo on a Carousel - 1 pageCuckoo on a Carousel - 2 pagesCuckoo on a Carousel - 3 pagesCuckoo on a Carousel - 4 pages

Fig. 11. Intel SGX: Steady-state response latency for uniform query arrivalrates (average and variance over 1000 runs).

the carousel design pattern. A particularly attractive exampleis a construction by Carter et al. [14] that uses (ε+ 2)n bits.We are exploring the use of this data structure in our currentwork.

Adversary capabilities. We assumed that the adversary canobserve the full memory access pattern for non-private memory(e.g. the CA’s memory, from which the dictionary representa-tion is accessed). This provides the strongest privacy guaranteeon all hardware platforms. However, if certain platforms donot allow the adversary to make such detailed observations,our approach could be further optimized for these platformswithout impacting privacy.

IX. SECURITY ANALYSIS

TA is implemented to behave essentially as a trusted thirdparty. Namely: (1) The communication channels between usersand TA are encrypted and authenticated. (2) Remote attestation

guarantees to all parties that TA runs the required program. (3)The Trusted Execution Environment isolates TA’s computationfrom the rest of the system. (4) The algorithms that areused (Algorithms 3, 4 and 5) were designed and carefullyimplemented to prevent side-channels.11 The access patternsand the entire behavior of TA, as can be viewed externally, areindistinguishable for different query sets of the same length.The implementation of the algorithms has TA access everydictionary entry within a chunk and perform an equal numberof operations per entry, regardless of whether a match is found.(An adversary that measures, for example, the time it takes toprocess a given chunk, will always get the same measurement,since this time depends on the number of queries but not onthe contents of the queries.) Therefore, since TA behaves as atrusted party, Requirement R1 is satisfied.

Figure 6 and 7 show that the carousel time for 1000simultaneous queries is within about a second for both Kinibi-TZ and SGX, satisfying Requirement R2 (latency). Whenthe number of simultaneous queries in the TA increase to4000, the response latency is still reasonable (4 secondsfor Kinibi-TZ and 2 seconds for SGX). Figures 8a and 8bshow that the carousel approach can sustain a relatively highquery arrival rate (1025 queries/second for Kinibi-TZ and3720 queries/second for SGX) without breakdown. Use ofmultiple TEEs can support more queries or a larger dictionary,satisfying Requirement R3 (scalability). Finally, none of ourschemes introduces any false negatives, and the false positiverate is within the 2−10 limit identified (Requirement R4).

X. RELATED WORK

Private Information Retrieval (PIR) is a well-known cryp-tographic protocol that allows a user to retrieve a item froma known position in a server’s database without the serverlearning which item was accessed. The first PIR scheme, asproposed by Chor et al. [15], works in the scenario wherethere are replicated databases held by independent servers.

11 That different code paths take equal processing time cannot be fullyensured at source code level only. In Appendix A we discuss how one canensure equal processing time at instruction level.

14

The first single-server scheme was introduced by Kushilevitzand Ostrovsky [31]. Subsequently, many schemes have beenproposed [13], [32], [23], [37].

It is not reasonable to assume that users know the indicesof desired items. This motivates Private Keyword Search(PKS). In PKS, the server holds a database of n pairs{(x1, p1), . . . , (xn, pn)}, where xi is a keyword and pi isa payload. A query is a searchword x instead of an index.After the protocol, the user gets the result pi if there is avalue i for which xi = x or otherwise receives a specialsymbol ⊥. PKS can be constructed based on PIR, obliviouspolynomial evaluation [22], re-routable encryption [46] ormultiparty computation [43].

Private membership test can be viewed as a simplifiedversion of PKS, where the user does not require the actualpayload. The main limitation of the current PIR/PKS solutionsis their efficiency, in terms of both computation and commu-nication.

In addition to the purely cryptographic solutions, anotheroption is to use trusted hardware combined with cryptographyto solve the PIR/PKS problems. For example, [8], [26], [53]can achieve PIR with constant computation and communica-tion, but have to periodically re-shuffle the dataset. Backeset. al [9] propose to use ORAM in combination with trustedhardware to achieve PKS, which ensures access privacy inonline behavioral advertising. However, this approach has twodrawbacks compared with our solution: First, it requires allelements in the database to be encrypted thus some subsetmust be decrypted to answer each query. Second, it is hard toachieve batched query processing, thus limiting scalability.

Another approach for implementing PMT is to have theserver offload some data (retaining the same order as thedataset) to the user in the offline phase. This allows constantcommunication and computation for each query in the onlinephase [40], [38]. However, the drawback of this approach isthat it prevents the dataset from being updated frequently,which is a critical requirement for a malware checking usecase.

XI. CONCLUSION AND FUTURE WORK

Motivated by the problem of privacy-preserving cloud-based malware checking, we introduced a new carousel ap-proach for private membership test. We evaluated several datastructures for representing the dictionary and described howto adapt them to the carousel design pattern. We implementedthese on both ARM TrustZone and Intel SGX and foundthat Cuckoo hash provides the lowest query response latency.We compared our carousel approach with ORAM, and foundthat the former can sustain significantly higher query arrivalrates. Future work will investigate other data structures forrepresenting the dictionary, compare newer ORAM schemes,and explore new ways of using trusted hardware to enhancethese schemes.

REFERENCES

[1] “AMD Secure Processor.” [Online]. Available:http://www.amd.com/en-us/innovations/software-technologies/security

[2] “G data mobile malware report, threat report: Q4/2015.” [Online].Available: https://www.gdatasoftware.com/securitylabs/news/article/g-data-releases-mobile-malware-report-for-the-fourth-quarter-of-2015

[3] “GlobalPlatform: Device specifications for trusted executionenvironment.” [Online]. Available:http://www.globalplatform.org/specificationsdevice.asp

[4] “How android users interact with their phones.” [Online]. Available:https://yahooaviate.tumblr.com/image/95795838933

[5] “Kinibi Trusted Execution Environment (TEE).” [Online]. Available:https://www.trustonic.com/products/kinibi

[6] I. Anati, S. Gueron, S. Johnson, and V. Scarlata, “InnovativeTechnology for CPU Based Attestation and Sealing,” in HASP, 2013.[Online]. Available: https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing

[7] ARM, “ARM security technology – building a secure system usingTrustZone technology,” April 2009. [Online]. Available:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.prd29-genc-009492c/index.html

[8] D. Asonov and J.-C. Freytag, “Almost optimal private informationretrieval,” in PETS, 2002, pp. 209–223. [Online]. Available:http://dl.acm.org/citation.cfm?id=1765299.1765315

[9] M. Backes, A. Kate, M. Maffei, and K. Pecina, “Obliviad: Provablysecure and practical online behavioral advertising,” in Oakland, 2012,pp. 257–271. [Online]. Available:http://www.ieee-security.org/TC/SP2012/papers/4681a257.pdf

[10] V. Bindschaedler, M. Naveed, X. Pan, X. Wang, and Y. Huang,“Practicing Oblivious Access on Cloud Storage,” in CCS, 2015.[Online]. Available:http://dl.acm.org/citation.cfm?id=2810103.2813649

[11] B. H. Bloom, “Space/time trade-offs in hash coding with allowableerrors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, 1970. [Online].Available: http://doi.acm.org/10.1145/362686.362692

[12] E. Boyle, K.-M. Chung, and R. Pass, “Oblivious parallel ram andapplications,” in 13th International Conference on the Theory ofCryptography, 2016, pp. 175–204. [Online]. Available:http://dx.doi.org/10.1007/978-3-662-49099-0 7

[13] C. Cachin, S. Micali, and M. Stadler, “Computationally privateinformation retrieval with polylogarithmic communication,” inEUROCRYPT, 1999, vol. 1592, pp. 402–414. [Online]. Available:http://dx.doi.org/10.1007/3-540-48910-X 28

[14] L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman, “Exactand approximate membership testers,” in STOC, 1978, pp. 59–65.[Online]. Available: http://doi.acm.org/10.1145/800133.804332

[15] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Privateinformation retrieval,” J. ACM, vol. 45, no. 6, pp. 965–981, Nov.1998. [Online]. Available: http://doi.acm.org/10.1145/293347.293350

[16] V. Costan, I. Lebedev, and S. Devadas, “Sanctum: Minimal hardwareextensions for strong software isolation,” USENIX Security, 2016.[Online]. Available: https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/costan

[17] D. Dachman-Soled, C. Liu, C. Papamanthou, E. Shi, and U. Vishkin,“Oblivious network ram and leveraging parallelism to achieveobliviousness,” in ASIACRYPT, 2015, pp. 337–359. [Online].Available: http://dx.doi.org/10.1007/978-3-662-48797-6 15

[18] S. Devadas, M. van Dijk, C. W. Fletcher, L. Ren, E. Shi, andD. Wichs, “Onion oram: A constant bandwidth blowup obliviousram,” in 13th International Conference on the Theory ofCryptography, 2016, pp. 145–174. [Online]. Available:http://dx.doi.org/10.1007/978-3-662-49099-0 6

[19] J. Ekberg, K. Kostiainen, and N. Asokan, “The untapped potential oftrusted execution environments on mobile devices,” IEEE Security &Privacy, vol. 12, no. 4, pp. 29–37, 2014. [Online]. Available:http://dx.doi.org/10.1109/MSP.2014.38

[20] U. Erlingsson, M. Manasse, and F. McSherry, “A cool and practicalalternative to traditional hash tables,” in WDAS, 2006. [Online].Available: http://www.ru.is/faculty/ulfar/CuckooHash.pdf

[21] D. Fotakis, R. Pagh, P. Sanders, and P. G. Spirakis, “Space efficienthash tables with worst case constant access time,” Theory Comput.Syst., vol. 38, no. 2, pp. 229–248, 2005. [Online]. Available:http://dx.doi.org/10.1007/s00224-004-1195-x

15

http://www.amd.com/en-us/innovations/software-technologies/security

https://www.gdatasoftware.com/securitylabs/news/article/g-data-releases-mobile-malware-report-for-the-fourth-quarter-of-2015

https://www.gdatasoftware.com/securitylabs/news/article/g-data-releases-mobile-malware-report-for-the-fourth-quarter-of-2015

http://www.globalplatform.org/specificationsdevice.asp

https://yahooaviate.tumblr.com/image/95795838933

https://www.trustonic.com/products/kinibi

https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing

https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.prd29-genc-009492c/index.html

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.prd29-genc-009492c/index.html

http://dl.acm.org/citation.cfm?id=1765299.1765315

http://www.ieee-security.org/TC/SP2012/papers/4681a257.pdf


http://doi.acm.org/10.1145/362686.362692

http://dx.doi.org/10.1007/978-3-662-49099-0_7

http://dx.doi.org/10.1007/3-540-48910-X_28

http://doi.acm.org/10.1145/800133.804332

http://doi.acm.org/10.1145/293347.293350

https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/costan

https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/costan

http://dx.doi.org/10.1007/978-3-662-48797-6_15

http://dx.doi.org/10.1007/978-3-662-49099-0_6

http://dx.doi.org/10.1109/MSP.2014.38

http://www.ru.is/faculty/ulfar/CuckooHash.pdf

http://dx.doi.org/10.1007/s00224-004-1195-x

[22] M. Freedman, Y. Ishai, B. Pinkas, and O. Reingold, “Keyword searchand oblivious pseudorandom functions,” in TTC, 2005, vol. 3378, pp.303–324. [Online]. Available:http://dx.doi.org/10.1007/978-3-540-30576-7 17

[23] C. Gentry and Z. Ramzan, “Single-database private informationretrieval with constant communication rate,” in ICALP, 2005, vol.3580, pp. 803–815. [Online]. Available:http://dx.doi.org/10.1007/11523468 65

[24] O. Goldreich and R. Ostrovsky, “Software protection and simulationon oblivious rams,” J. ACM, vol. 43, no. 3, pp. 431–473, 1996.[Online]. Available: http://doi.acm.org/10.1145/233551.233553

[25] D. Gupta, B. Mood, J. Feigenbaum, K. Butler, and P. Traynor, “UsingIntel Software Guard Extensions for Efficient Two-Party SecureFunction Evaluation,” in 4th Workshop on Encrypted Computing andApplied Homomorphic Cryptography - WAHC’16, 2016. [Online].Available: http://www.cs.yale.edu/homes/jf/GMFBT-WAHC2016.pdf

[26] A. Iliev and S. W. Smith, “Protecting client privacy with trustedcomputing at the server,” IEEE Security & Privacy, vol. 3, no. 2, pp.20–28, 2005. [Online]. Available:http://www.cs.dartmouth.edu/∼sws/pubs/is05a.pdf

[27] Intel, “Software Guard Extensions Programming Reference,” 2013.[Online]. Available:https://software.intel.com/sites/default/files/329298-001.pdf

[28] B. Jenkins, “Function for producing 32bit hashes for hash tablelookup,” 2006. [Online]. Available:http://www.burtleburtle.net/bob/c/lookup3.c

[29] A. Kirichenko, “Personal communication,” F-Secure, 2015.

[30] A. Kirsch, M. Mitzenmacher, and U. Wieder, “More robust hashing:Cuckoo hashing with a stash,” SIAM Journal on Computing, vol. 39,no. 4, pp. 1543–1561, 2010. [Online]. Available:http://dx.doi.org/10.1137/080728743

[31] E. Kushilevitz and R. Ostrovsky, “Replication is not needed: singledatabase, computationally-private information retrieval,” FOCS, p.364, 1997. [Online]. Available:http://dx.doi.org/10.1109/SFCS.1997.646125

[32] H. Lipmaa, “An oblivious transfer protocol with log-squaredcommunication,” in Information Security, 2005, pp. 314–328.[Online]. Available: http://dx.doi.org/10.1007/11556992 23

[33] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-Level CacheSide-Channel Attacks are Practical,” in Oakland, 2015. [Online].Available: http://palms.ee.princeton.edu/system/files/SP vfinal.pdf

[34] J. R. Lorch, B. Parno, J. Mickens, M. Raykova, and J. Schiffman,“Shroud: Enabling private access to large-scale data in the datacenter,” in USENIX FAST, 2013. [Online]. Available:http://research.microsoft.com/apps/pubs/default.aspx?id=179714

[35] F. McKeen, I. Alexandrovich, A. Berenzon, C. V. Rozas, H. Shafi,V. Shanbhogue, and U. R. Savagaonkar, “Innovative instructions andsoftware model for isolated execution,” in HASP, 2013, pp. 10:1–10:1.[Online]. Available: http://doi.acm.org/10.1145/2487726.2488368

[36] ——, “Innovative instructions and software model for isolatedexecution,” in HASP, 2013. [Online]. Available:http://dl.acm.org/citation.cfm?id=2487726.2488368

[37] C. A. Melchor and P. Gaborit, “A lattice-basedcomputationally-efficient private information retrieval protocol,” IACRCryptology ePrint Archive, p. 446, 2007. [Online]. Available:http://eprint.iacr.org/2007/446

[38] T. Meskanen, J. Liu, S. Ramezanian, and V. Niemi, “Privatemembership test for bloom filters,” in Trustcom/BigDataSE/ISPA,vol. 1, 2015, pp. 515–522. [Online]. Available:http://dx.doi.org/10.1109/Trustcom.2015.414

[39] T. Moataz, T. Mayberry, and E.-O. Blass, “Constant communicationoram with small blocksize,” in CCS, 2015, pp. 862–873. [Online].Available: http://doi.acm.org/10.1145/2810103.2813701

[40] R. Nojima and Y. Kadobayashi, “Cryptographically securebloom-filters,” Trans. Data Privacy, vol. 2, no. 2, pp. 131–139, 2009.[Online]. Available:http://dl.acm.org/citation.cfm?id=1745475.1745477

[41] A. Pagh, R. Pagh, and S. S. Rao, “An optimal bloom filterreplacement,” in SODA. Society for Industrial and Applied

Mathematics, 2005, pp. 823–829. [Online]. Available:http://dl.acm.org/citation.cfm?id=1070432.1070548

[42] R. Pagh and F. F. Rodler, “Cuckoo hashing,” Journal of Algorithms,vol. 51, no. 2, pp. 122 – 144, 2004. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0196677403001925

[43] V. Pappas, F. Krell, B. Vo, V. Kolesnikov, T. Malkin, S. G. Choi,W. George, A. Keromytis, and S. Bellovin, “Blind Seer: A ScalablePrivate DBMS,” in Oakland, 2014, pp. 359–374. [Online]. Available:https://www.cs.columbia.edu/∼angelos/Papers/2014/blind seer.pdf

[44] B. Pinkas, T. Schneider, G. Segev, and M. Zohner, “Phasing: Privateset intersection using permutation-based hashing,” in USENIXSecurity, 2015, pp. 515–530. [Online]. Available: https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/pinkas

[45] B. Pinkas, T. Schneider, and M. Zohner, “Faster private setintersection based on ot extension,” in USENIX Security, 2014, pp.797–812. [Online]. Available:http://dl.acm.org/citation.cfm?id=2671225.2671276

[46] M. Raykova, B. Vo, S. M. Bellovin, and T. Malkin, “Secureanonymous database search,” in CCSW, 2009, pp. 115–126. [Online].Available: http://doi.acm.org/10.1145/1655008.1655025

[47] L. Ren, C. Fletcher, A. Kwon, E. Stefanov, E. Shi, M. van Dijk, andS. Devadas, “Constants count: Practical improvements to obliviousram,” in USENIX Security, Aug. 2015, pp. 415–430. [Online].Available: https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/ren-ling

[48] C. Sahin, V. Zakhary, A. El Abbadi, H. R. Lin, and S. Tessaro,“TaoStore: Overcoming asynchronicity in oblivious data storage,” inOakland, 2016. [Online]. Available:http://www.ieee-security.org/TC/SP2016/papers/0824a198.pdf

[49] J. Schiffman, T. Moyer, H. Vijayakumar, T. Jaeger, and P. McDaniel,“Seeding clouds with trust anchors,” in CCSW, 2010, pp. 43–46.[Online]. Available: http://doi.acm.org/10.1145/1866835.1866843

[50] S. Seneviratne, A. Seneviratne, P. Mohapatra, and A. Mahanti,“Predicting user traits from a snapshot of apps installed on asmartphone,” SIGMOBILE Mob. Comput. Commun. Rev., vol. 18,no. 2, pp. 1–8, 2014. [Online]. Available:http://doi.acm.org/10.1145/2636242.2636244

[51] E. Shi, T. H. H. Chan, E. Stefanov, and M. Li, “Oblivious ram witho((logn)3) worst-case cost,” in ASIACRYPT, 2011, pp. 197–214.[Online]. Available: http://dx.doi.org/10.1007/978-3-642-25385-0 11

[52] E. Stefanov, M. van Dijk, E. Shi, C. W. Fletcher, L. Ren, X. Yu, andS. Devadas, “Path ORAM: an extremely simple oblivious RAMprotocol,” in CCS, 2013, pp. 299–310. [Online]. Available:http://doi.acm.org/10.1145/2508859.2516660

[53] P. Williams and R. Sion, “Usable PIR,” in NDSS, 2008. [Online].Available:http://www.isoc.org/isoc/conferences/ndss/08/papers/09 usable pir.pdf

[54] Y. Xu, W. Cui, and M. Peinado, “Controlled-Channel Attacks:Deterministic Side Channels for Untrusted Operating Systems,” inOakland, 2015. [Online]. Available:https://www.cs.utexas.edu/∼yxu/files/xu15oakland.pdf

APPENDIX

Implementing algorithms from Section VI naively does notensure that the TA performs equal number of operations onevery item in Y at machine-level instructions. For example,in Algorithm 5, R can be an unsigned char array anddummy_byte an unsigned char variable. The compileruses different sets of instructions to copy values of Y on tothem causing unequal number of machine-level instructionsat the conditional clauses (if and else). Similary, thecompiler removes or optimizes the dummy operation (e.g.dummy_int ++) if they are not used elsewhere in the code.It also removes dummy conditional clauses that areunreachable / unnecessary.

We tailored our implementation to achieve a balanced set ofinstructions for the conditional clauses while processing the

16

http://dx.doi.org/10.1007/978-3-540-30576-7_17

http://dx.doi.org/10.1007/11523468_65

http://doi.acm.org/10.1145/233551.233553

http://www.cs.yale.edu/homes/jf/GMFBT-WAHC2016.pdf

http://www.cs.dartmouth.edu/~sws/pubs/is05a.pdf

https://software.intel.com/sites/default/files/329298-001.pdf

http://www.burtleburtle.net/bob/c/lookup3.c

http://dx.doi.org/10.1137/080728743

http://dx.doi.org/10.1109/SFCS.1997.646125

http://dx.doi.org/10.1007/11556992_23

http://palms.ee.princeton.edu/system/files/SP_vfinal.pdf

http://research.microsoft.com/apps/pubs/default.aspx?id=179714

http://doi.acm.org/10.1145/2487726.2488368


http://eprint.iacr.org/2007/446

http://dx.doi.org/10.1109/Trustcom.2015.414

http://doi.acm.org/10.1145/2810103.2813701



http://www.sciencedirect.com/science/article/pii/S0196677403001925

https://www.cs.columbia.edu/~angelos/Papers/2014/blind_seer.pdf

https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/pinkas

https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/pinkas


http://doi.acm.org/10.1145/1655008.1655025

https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/ren-ling

https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/ren-ling

http://www.ieee-security.org/TC/SP2016/papers/0824a198.pdf

http://doi.acm.org/10.1145/1866835.1866843

http://doi.acm.org/10.1145/2636242.2636244

http://dx.doi.org/10.1007/978-3-642-25385-0_11

http://doi.acm.org/10.1145/2508859.2516660

http://www.isoc.org/isoc/conferences/ndss/08/papers/09_usable_pir.pdf

https://www.cs.utexas.edu/~yxu/files/xu15oakland.pdf

carousel. Figure 12 depicts a section of the carouselprocessing code for Cuckoo hash method that produces equalnumber of operations on every item in Y at machine-levelinstructions. Figure 13 shows the disassembled machine-levelinstructions mnemonics for the same code segment. Forsimilicity the code segment shown in the figure is forprocessing 16-bit (ε = 14) items in Y .

In Figure 12, ptr_query_rep represents the pointer to S.We use the same variable to represent the dictionarypositions as well as store the value of the correspondingposition. We implemented the code to operate on 32-bitvalues. The variables ptr_query_rep, ptr_chunk andptr_chunk_end are defined as unsigned int*.Similarly dummy_pos is an array of type unsigned int.

17

/ / p t r c h u n k : p o i n t e r t o t h e/ / b e g i n i n g of Y chunk

/ / p t r c h u n k : p o i n t e r t o t h e/ / end Y chunk

/ / y pos : c u r r e n t p o s i t i o n/ / i n Y

/ / p t r q u e r y r e p : p o i n t e r t o S

/ / dummy pos : dummy a r r a y o f s i z e 255

w h i l e ( p t r c h u n k < p t r c h u n k e n d ){

i f ( y pos == ∗ p t r q u e r y r e p ){∗ p t r q u e r y r e p = ∗ p t r c h u n k ;p t r q u e r y r e p ++;

} e l s e {dummy pos [ ( u i n t 8 t )∗ p t r c h u n k ] = \∗ p t r c h u n k ;

}y pos ++;p t r c h u n k = p t r c h u n k + 1 ;

}

Fig. 12. Kinibi TA code for Cuckoo-on-a-Carousel processing

70 e : 1b61 subs r1 , r4 , r5710 : 4439 add r1 , r7712 : f5b1 1 f40 cmp .w r1 , #3145728 ; 0 x300000716 : f1c5 0200 r s b r2 , r5 , #071 a : d20a bcs . n 732 <t l M a i n +0x1b4>71 c : 6819 l d r r1 , [ r3 , #0]71 e : 4422 add r2 , r4720 : 5dd2 l d r b r2 , [ r2 , r7 ]722 : 428 f cmp r7 , r1724 : b f0c i t e eq726 : f843 2b04 s t r e q .w r2 , [ r3 ] , #472 a : f84a 2022 s t r n e .w r2 , [ s l , r2 , l s l #2]72 e : 3701 adds r7 , #1730 : e7ed b . n 70 e <t l M a i n +0x190>

Fig. 13. Disassembled machine instructions mnemonics for Cuckoo-on-a-Carousel processing

18

the circle game: scalable private membership test using ... · remote membership test: a user...

Documents