wide area network-scale discovery and data dissemination ...wide area network-scale discovery and...

13
Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems * Kyoungho An and Aniruddha Gokhale EECS Dept, Vanderbilt Univ Nashville, TN, USA {kyoungho.an,a.gokhale}@vanderbilt.edu Sumant Tambe Real-Time Innovations Sunnyvale, CA, USA [email protected] Takayuki Kuroda Knowledge Discovery Research Lab, NEC Corp Kawasaki, Kanagawa, Japan [email protected] ABSTRACT Distributed systems found in application domains, such as smart transportation and smart grids, inherently require dissemination of large amount of data over wide area net- works (WAN). A large portion of this data is analyzed and used to manage the overall health and safety of these dis- tributed systems. The data-centric, publish/subscribe (pub- /sub) paradigm is an attractive choice to address these needs because it provides scalable and loosely coupled data com- munications. However, existing data-centric pub/sub mech- anisms supporting quality of service (QoS) tend to operate effectively only within local area networks. Likewise broker- based solutions that operate at WAN-scale seldom provide mechanisms to coordinate among themselves for discovery and dissemination of information, and cannot handle both the heterogeneity of pub/sub endpoints as well as the signif- icant churn in endpoints that is common in WAN-scale sys- tems. To address these limitations, this paper presents Pub- SubCoord, which is a cloud-based coordination and discov- ery service for WAN-scale pub/sub systems. PubSubCoord realizes a WAN-scale, adaptive, and low-latency endpoint discovery and data dissemination architecture by (a) bal- ancing the load using elastic cloud resources, (b) clustering brokers by topics for affinity, and (c) minimizing the number of data delivery hops in the pub/sub overlay. PubSubCoord builds on ZooKeeper’s coordination primitives to support dynamic discovery of brokers and pub/sub endpoints located in isolated networks. Empirical results evaluating the per- formance of PubSubCoord are presented for (1) scalability of data dissemination and coordination, and (2) deadline- aware overlays employing configurable QoS to provide low- * This work is supported in part by NSF CAREER CNS 0845789. Any opinions, findings, and conclusions or rec- ommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF. Author is currently with Real-Time Innovations ISIS Technical Report ISIS-15-120 Nashville, TN, USA 2015. latency data delivery for topics demanding strict service re- quirements. Categories and Subject Descriptors C.2.4 [Computer Systems]: Distributed Systems—pub- lish/subscribe ; D.2.11 [Software Engineering]: Architec- tures—middleware Keywords Publish/subscribe, Distributed Systems, WAN 1. INTRODUCTION Distributed systems found in application domains, such as transportation, advanced and agile manufacturing, smart energy grids, and other industrial use cases of Internet of Things, are increasingly assuming wide area network (WAN) scale, e.g., Industrial Internet Reference Architecture. 1 These systems are characterized by a large collection of het- erogeneous entities, some of which are sensors that sense various parameters across the span of the system and dis- seminate these information for processing so that appro- priate control operations are actuated over different spatio- temporal scales to manage different properties of the system. The key to the success of these systems relies on how effec- tive is the system in collecting and delivering data across large number of heterogeneous entities at internet scale in a manner that satisfies different quality of service (QoS) prop- erties (e.g., timeliness, reliability, and security). The publish/subscribe (pub/sub) communication paradi- gm is a promising solution since it provides a scalable and decoupled data delivery mechanism between communicating peers. Many industrial and academic pub/sub solutions ex- ist [4, 7, 9, 12, 18, 22, 27]. Some of these even support the desired QoS properties, such as availability [9, 22], con- figurable reliability [18], durability [12], and timeliness [6, 15]. However, these solutions tend to support only one QoS property at a time and in most cases, the support for config- urability and dynamic adaptation is lacking. More impor- tantly, dynamic discovery of heterogeneous endpoints, which is a key requirement, is often missing in these solutions. One approach to supporting WAN-based pub/sub relies on broker-based solutions [1, 10, 14, 17] because this ap- proach can solve practical issues when a system is deployed 1 http://www.iiconsortium.org/IIRA.htm

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

Wide Area Network-scale Discovery and DataDissemination in Data-centric Publish/Subscribe Systems ∗

Kyoungho An†

andAniruddha Gokhale

EECS Dept, Vanderbilt UnivNashville, TN, USA

{kyoungho.an,a.gokhale}@vanderbilt.edu

Sumant TambeReal-Time InnovationsSunnyvale, CA, [email protected]

Takayuki KurodaKnowledge Discovery

Research Lab, NEC CorpKawasaki, Kanagawa, [email protected]

ABSTRACTDistributed systems found in application domains, such assmart transportation and smart grids, inherently requiredissemination of large amount of data over wide area net-works (WAN). A large portion of this data is analyzed andused to manage the overall health and safety of these dis-tributed systems. The data-centric, publish/subscribe (pub-/sub) paradigm is an attractive choice to address these needsbecause it provides scalable and loosely coupled data com-munications. However, existing data-centric pub/sub mech-anisms supporting quality of service (QoS) tend to operateeffectively only within local area networks. Likewise broker-based solutions that operate at WAN-scale seldom providemechanisms to coordinate among themselves for discoveryand dissemination of information, and cannot handle boththe heterogeneity of pub/sub endpoints as well as the signif-icant churn in endpoints that is common in WAN-scale sys-tems. To address these limitations, this paper presents Pub-SubCoord, which is a cloud-based coordination and discov-ery service for WAN-scale pub/sub systems. PubSubCoordrealizes a WAN-scale, adaptive, and low-latency endpointdiscovery and data dissemination architecture by (a) bal-ancing the load using elastic cloud resources, (b) clusteringbrokers by topics for affinity, and (c) minimizing the numberof data delivery hops in the pub/sub overlay. PubSubCoordbuilds on ZooKeeper’s coordination primitives to supportdynamic discovery of brokers and pub/sub endpoints locatedin isolated networks. Empirical results evaluating the per-formance of PubSubCoord are presented for (1) scalabilityof data dissemination and coordination, and (2) deadline-aware overlays employing configurable QoS to provide low-

∗This work is supported in part by NSF CAREER CNS0845789. Any opinions, findings, and conclusions or rec-ommendations expressed in this material are those of theauthor(s) and do not necessarily reflect the views of NSF.†Author is currently with Real-Time Innovations

ISIS Technical Report ISIS-15-120 Nashville, TN, USA2015.

latency data delivery for topics demanding strict service re-quirements.

Categories and Subject DescriptorsC.2.4 [Computer Systems]: Distributed Systems—pub-lish/subscribe; D.2.11 [Software Engineering]: Architec-tures—middleware

KeywordsPublish/subscribe, Distributed Systems, WAN

1. INTRODUCTIONDistributed systems found in application domains, such

as transportation, advanced and agile manufacturing, smartenergy grids, and other industrial use cases of Internet ofThings, are increasingly assuming wide area network (WAN)scale, e.g., Industrial Internet Reference Architecture. 1

These systems are characterized by a large collection of het-erogeneous entities, some of which are sensors that sensevarious parameters across the span of the system and dis-seminate these information for processing so that appro-priate control operations are actuated over different spatio-temporal scales to manage different properties of the system.The key to the success of these systems relies on how effec-tive is the system in collecting and delivering data acrosslarge number of heterogeneous entities at internet scale in amanner that satisfies different quality of service (QoS) prop-erties (e.g., timeliness, reliability, and security).

The publish/subscribe (pub/sub) communication paradi-gm is a promising solution since it provides a scalable anddecoupled data delivery mechanism between communicatingpeers. Many industrial and academic pub/sub solutions ex-ist [4, 7, 9, 12, 18, 22, 27]. Some of these even supportthe desired QoS properties, such as availability [9, 22], con-figurable reliability [18], durability [12], and timeliness [6,15]. However, these solutions tend to support only one QoSproperty at a time and in most cases, the support for config-urability and dynamic adaptation is lacking. More impor-tantly, dynamic discovery of heterogeneous endpoints, whichis a key requirement, is often missing in these solutions.

One approach to supporting WAN-based pub/sub relieson broker-based solutions [1, 10, 14, 17] because this ap-proach can solve practical issues when a system is deployed

1http://www.iiconsortium.org/IIRA.htm

Page 2: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

in network environments that use network address transla-tion (NAT) or firewall or do not support multicast. How-ever, as industrial systems progressively integrate other sub-systems located in multiple disparate networks, the numberof deployed brokers becomes very large. Consequently, theamount of effort to manage these dispersed brokers becomesunwieldy. Moreover, forming an efficient overlay network ofbrokers that offers both scalability and low latency becomeseven harder.

To address these key requirements, we present PubSub-Coord, which is a cloud-based coordination service for ge-ographically distributed pub/sub brokers to transparentlyconnect heterogeneous endpoints and realize internet-scaledata-centric pub/sub systems. Specifically, this paper ad-dresses the following challenges in the context of scalable,reliable, and dynamic pub/sub systems:

• Scalability and Availability: To address the scala-bility and availability needs of data dissemination acrossWAN-scale systems despite NAT/firewall issues andfailures, PubSubCoord uses the separation of concernsprinciple to decouple local area-based brokers callededge brokers that handle local pub/sub issues fromcloud-based brokers called routing brokers that han-dle routing between edge brokers (See Section 2.2).

• Dynamic Discovery and Dissemination: To sup-port dynamic discovery and data routing between bro-kers, PubsubCoord provides efficient coordination amo-ng the brokers by building pub/sub-specific event noti-fications using the basic primitives provided by ZooKee-per coordination service [13]. This solution helps tosynchronize the dissemination paths over the dynamicoverlay network of brokers and heterogeneous endpoints(See Section 2.3).

• Overload and Fault Management: To managetopic and dissemination overload, PubSubCoord usescloud-based elasticity to balance the load. Load bal-ancing and broker failures are handled by an electedleader (See Section 2.4).

• Performance Optimizations: For those dissemina-tion paths that need both low latency and reliabilityassurances, PubSubCoord trades off resource usage infavor of deadline-aware overlays that build multiple,redundant paths between brokers (See Section 2.4.3).

Our PubSubCoord design can easily be adopted by in-dustrial systems because its design is based on proven soft-ware engineering design patterns. We have favored maxi-mal reuse of proven industrial-strength solutions whereverpossible instead of reinventing the wheel. A key guidingprinciple for us was to ensure a non-invasive and extensi-ble design which preserves the endpoint discovery and datadissemination model of the underlying pub/sub messagingsystem by tunneling discovery and dissemination messagesacross the hierarchy of brokers. Using this approach, it ispossible to support multiple concrete pub/sub technologieswithout breaking their individual semantics. We present ex-tensive empirical test data to validate our claims. The exper-imental results are demonstrated concretely in the contextof endpoints that use the OMG Data Distribution Service(DDS) [20] as the underlying pub/sub messaging system.

The source code and experimental apparatus of PubSubCo-ord are made available in open source.2

The remainder of this paper is organized as follows: Sec-tion 2 describes the design and implementation of PubSub-Coord; Section 3 shows experimental results validating ourclaims; Section 4 compares PubSubCoord with related work;and Section 5 presents concluding remarks and alludes to fu-ture work.

2. DESIGN OF PUBSUBCOORDThis section presents the PubSubCoord architecture and

the rationale for the design decisions. We present each solu-tion explaining the context, the design patterns and frame-works used, and how the consequences from applying thepatterns are resolved. We then provide details on the run-time interactions between the elements of the architecture.We then allude to some implementation details.

2.1 PubSubCoord Architecture: The Big Pic-ture

Figure 1 shows the layered PubSubCoord architecture de-picting three layers: a coordination layer, a pub/sub brokeroverlay layer, and the physical network layer. The pub/subbroker overlay comprises a broker hierarchy based on a sep-aration of concerns representing the logical network of bro-kers and endpoints in a system. An edge broker is directlyconnected to individual endpoints in a local area network(LAN) (i.e., which represents an isolated network) to serveas a bridge to other endpoints placed in different networks.A routing broker serves as a mediator to route data betweenedge brokers according to assigned and matched topics thatare present in the global data space. The coordination layercomprises an ensemble of ZooKeeper servers used for coor-dination between the brokers.

The data dissemination in PubSubCoord is explained us-ing an example from Figure 1. Pi{T} denotes a publisheri that publishes topic T (similarly for a subscriber Sj{T}).In the example, since publisher P1 and subscriber S1 arethe only endpoints interested in topic A, they communicatewithin their local network A only using the techniques pro-vided by the underlying pub/sub technology. On the otherhand, P2, P4, and S2 are interested in topic B but are de-ployed in different isolated networks. So their communica-tions are routed through a routing broker that is responsiblefor topic B. The network transport protocol between bro-kers is configurable, but TCP is used as a default transportto ensure reliable communication over WANs. As seen fromthis example, a maximum of 2 hops on the overlay networkare incurred by data flowing from one isolated network toanother (e.g., network A to B).

2.2 Addressing Scalability and Availability Re-quirements

Context: Traditional WAN-based pub/sub systems tendto form an overlay network of brokers to which endpoints canbe connected. The brokers exchange subscriptions they re-ceive from subscribers which are used to build routing pathsfrom publishers. The main challenge in this approach stemsfrom having to build routing states among brokers to routedata efficiently according to matching subscribers. Secondly,

2URL for download: www.dre.vanderbilt.edu/~kyoungho/pubsubcoord.

Page 3: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

Cloud Data Center Network

EdgeBroker

P1{A}

S1{A}

P2{B}

P2

RoutingBroker

P2

EdgeBrokerP3

{C}

S2{B}

P3

P2, P4

P2, P4

EdgeBroker

S3{C}

P4{B}

P3 P4P1

RoutingBroker

P4P3 P3

Network A Network B Network C

Routing Brokers

ZooKeeperServers

ZooKeeperServer

(Leader)

ZooKeeperServer

ZooKeeperServer Coordination

Layer

Pub/Sub Overlay Layer

Physical Network Layer

Figure 1: Layering and Separation of Concerns in the Pub-SubCoord Architecture

in traditional broker-based pub/sub systems, if some brokerwere to fail, it halts not only the pub/sub service for end-points connected to this broker but also service for endpointsconnected to other brokers that use this failed broker as anintermediate routing broker.

Design: To resolve these duo of challenges, PubSubCo-ord’s broker overlay layer is structured as a two-tier archi-tecture of brokers with strict separation of responsibilities:at the first tier are edge brokers that manage pub/sub issueswithin an isolated network, and at the second tier are rout-ing brokers, which route traffic among different edge brokers.Our solution clusters edge brokers by matching topics androutes data through routing brokers. Each routing broker isresponsible for handling only a certain number of topics soas to balance the load, minimize the overall amount of dataexchanged and number of connections between edge brokers.

Consequences and resolution: The immediate conse-quence of such a design decision is having to decide howmany routing brokers to maintain, how many topics to behandled by each routing broker, and how to organize themin the system. Having only one routing broker would beproblematic since it cannot scale to handle the substantialrouting load stemming from the dissemination of differenttopic data among the large number of endpoints. Havingmultiple routing brokers and organizing them in multiplelevels of hierarchy similar to DNS would not be acceptableeither since it would complicate the management of top-ics and recovery from failures because the routing state andtopic management would get distributed across multiple lev-els. Secondly, multiple levels as in DNS introduces multiplerouting hops, which will impact latency of distribution. Forthat reason, we maintain a flat tier of routing brokers.

The number of routing brokers and topics managed byeach routing broker is determined by the end-to-end per-formance requirements of the pub/sub flows. Thus, a solu-tion that can elastically scale the number of routing brokersand balance the number of topics handled by each broker isneeded. For that reason, the routing broker tier is placedas a cluster in the cloud where resources can be elasticallyscaled up/down depending on the demand. This capabilityallows us to dynamically adapt to system load and scale toexisting demand.

2.3 Addressing Dynamic Endpoint Discoveryand Dissemination Requirements

Context: In the WAN-style systems of interest to us,matching publishers and subscribers are most likely dis-tributed in separate isolated networks, and may join or leavedynamically. Thus, publishers and subscribers must be ableto dynamically discover each other and a dissemination routeneeds to be established between the communicating entities.

So far we described the design rationale for a 2-tier bro-ker architecture, which helps resolve issues stemming fromhaving to maintain substantial pub/sub routing states butdid not show how endpoints in isolated networks are discov-ered and how the routes are established dynamically basedon endpoint discovery.

Design: To address this need, PubSubCoord provides acoordination layer (top layer shown in Figure 1) comprisingan ensemble of ZooKeeper [13] servers, which help brokersdiscover each other and build broker overlay networks us-ing the PubSubCoord coordination logic. ZooKeeper is anindustrial-strength solution that provides generic primitivesfor developing domain-specific coordination capabilities fordistributed applications.

The ZooKeeper service consists of an ensemble of serversthat use replication to accomplish high availability with highperformance and relaxed consistency. Many coordinationrecipes using ZooKeeper exist (e.g., leader election, groupmembership, and sharing configuration metadata) that areneeded by distributed applications. PubSubCoord buildsupon these recipes to develop coordination logic for brokerinteraction. ZooKeeper also provides the watch mechanismto notify a client of ZooKeeper of a change to a znode (i.e.,a ZooKeeper data object containing its path and data con-tent). This feature is exploited for dynamic changes in thesystem.

/

/topics

/pub /sub

/dw1 /dr1

/leader /broker

/topic_A /topic_B

/pub /sub

/dw1 /dw2 /dr1

/rb1 /rb2 /rb3

Figure 2: Tunneling used in PubSubCoord for ZNode Tree

Consequences and resolution: The data model of Zoo-

Page 4: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

Keeper is structured like a file system in the form of znodes.The original intent of this hierarchical namespace is to man-age group membership. We repurposed it to manage pub/-sub endpoints by applying the Tunnel pattern to introducepub/sub semantics into the znodes. Figure 2 shows the zn-ode data tree structure of PubSubCoord with pub/sub se-mantics. The root znode contains three child znodes: topics,leader, and broker. All unique topics defined in the pub/subsystem are rooted under the topics znode. The child znodesfor every unique topic represent the endpoints, i.e., publish-ers and subscribers, associated with it. The leader znode isused to elect a leader among the routing brokers. The bro-ker znode has child znodes for each routing broker where itslocation information (i.e., IP address and port number of arouting broker) is stored. The leader uses this informationto associate a selected routing broker’s location to a topicznode after the topic assignment.

Dynamic updates to different parts of this tree are achievedthrough broker interactions that exploit the watch mech-anism. Specifically, dynamic changes in the system (e.g.,broker or pub/sub join/leave, topic creation/deletion) arehandled using the Observer pattern where brokers connectto the ZooKeeper service as its clients and create, update,and delete znodes stored in the servers. They use the watchmechanism to set watches on interesting znodes to receivenotifications. PubSubCoord exploits the ephemeral modefeature of ZooKeeper, where a specific znode in the tree isautomatically deleted from the tree when the client sessionhandling this znode is lost. Details on all the interactionsthat take place in this context are provided in Section 2.5where we show how brokers discover each other and routesare established.

2.4 Addressing Overload and Fault ToleranceRequirements

Context: Performance of the WAN-scale pub/sub systemin our architecture can be impacted by at least two factors:load on a routing broker and network congestion on the twohop route between edge brokers over the broker overlay.3

Load on a routing broker depends on the number of topics itmanages and correspondingly the number of edge brokers itinterconnects through itself. Addressing these two sources ofperformance bottleneck are important for PubSubCoord. Inthe case of faults, although many kinds of faults are possible,failures of routing brokers and the coordination logic aremost critical. Hence, we focus only on tolerating failures inrouting brokers and the coordination logic.

2.4.1 Routing Broker Overload ManagementDesign: We address the routing broker overload problem

by supporting load balancing within the routing broker tier.Load balancing is handled by a leader routing broker, whichis elected among the routing brokers. To elect a leader in aconsistent and safe manner, PubSubCoord uses ZooKeeper’sleader znode for routing brokers to write themselves on theznode so as to be elected as a leader (i.e., voting process).The routing broker that gets to write first becomes a leadersince the znode is locked thereafter (i.e., no one can writeon the znode unless the leader fails). The rest of the routingbrokers become workers. Worker routing brokers relay pub-/sub data between edge brokers. The leader routing broker

3We have not addressed security issues in this paper.

can also serve as a worker routing broker.Consequences and resolution: A leader routing bro-

ker must manage the cluster of routing brokers and assigntopics to workers in a way that balances the load. It doesthis by selecting the least loaded worker, which currentlyis decided based on the number of adopted topics by thatworker. However, the use of the Strategy pattern enables usto plug in other load balancing schemes (e.g., least loadedbased on CPU utilization or the number of connections).

2.4.2 Broker Fault ToleranceDesign: PubSubCoord offers tolerance to routing bro-

ker failures, which can be of two kinds: worker failure andleader failure. When the worker fails, the leader reassignstopics handled by that failed broker to another worker rout-ing broker to avoid service cessation. If the load is too high,the cloud will elastically scale the number of routing bro-kers. If a leader fails, the routing brokers vote for anotherleader again. On (re)assignment or failure of routing bro-ker, PubSubCoord leverages ZooKeeper’s watch mechanismto notify the appropriate edge brokers to update their pathsto the right routing broker.

Consequences and resolution: Since the ZooKeeperserver itself may fail, to provide a scalable and fault-tolerantservice at the coordination layer, multiple ZooKeeper serverscan exist as an ensemble, and a leader of the ensemble syn-chronizes data between distributed servers to provide con-sistent coordination events to clients (i.e., brokers in oursolution) and avoid single points of failure.

Note that we do not offer a solution to edge broker failure.We treat this failure as making an isolated network unreach-able and hence not part of the pub/sub system. If and whenit reappears, the edge broker will follow the protocol for in-forming PubSubCoord of its existence as described next inthe runtime interactions.

2.4.3 Deadline-aware Overlay OptimizationsA second cause of performance bottlenecks stems from

the congested two-hop route connecting edge brokers via arouting broker. To overcome this problem, PubSubCoordalso supports an optimization to both improve reliabilityand latency by providing an additional one hop path overthe overlay that directly connects communicating edge bro-kers. Note that doing this for every edge broker is infeasibledue to the very large connection state that every edge bro-ker must manage. Figure 3 illustrates the concept. Theseoptimizations can be leveraged by pub/sub flows that re-quire stringent assurances on reliable and deadline-drivendata delivery.

R

E1 E2

P S

L2 L3

L1

Figure 3: Multi-path Deadline-aware Overlay Concept

Page 5: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

To achieve this feature, PubSubCoord requires hints fromthe underlying pub/sub messaging system. As an example,OMG DDS uses deadline QoS as a contract between pub/-sub flows, which is used to express the maximum durationof a sample to be updated. For those event streams requir-ing strict deadlines, multi-path overlay networks build analternative, additional path directly between edge brokersthereby reducing the number of hops to just one.

The patterns-based framework design of PubSubCoordenables strategizing it with the underlying technology-specificoptimizations. This technology-specific logic should alsoprovide the threshold on when to activate these optimiza-tions.

2.5 Broker Interactions and OperationSo far we presented the architectural elements of PubSub-

Coord. We now present details on the runtime interactionsamong these elements that realizes the various capabilitiesof PubSubCoord. We describe how the brokers interact andthe actual process of updating their internal states used inrouting the streamed pub/sub data.

2.5.1 Routing Broker ResponsibilitiesFigure 4 presents the sequence diagram showing the run-

time interactions of the routing brokers. The algorithm ex-ecuted by the routing broker is captured in Algorithm 1.This algorithm is predominantly event-driven, i.e., it is madeup of callback functions that are invoked when some condi-tion is satisfied. These callback functions are invoked byZooKeeper due to the different watch conditions. The Rout-ing Service shown in the figure and used in the algorithm arethe capabilities at the edge broker that bridge the isolatednetwork to the outside world.

LeaderRouting Broker

Routing ServiceZooKeeper

Server

Run Routing Service

Initiate connection

Elect a leader

Register a listener to receive topic creation/deletion events

Creation event of Topic 'A'

Assign Topic 'A' to a worker routing broker

Elected as a leader

Update the znode for Topic 'A' with the assigned

routing broker's locator

WorkerRouting Broker

Elect a leader

Assignment event of Topic 'A' to this worker routing broker

Register a listener to receive events of endpoints for Topic 'A'

Creation event of an endpoint for Topic 'A' with edge broker's locator (IP and port)

Initiate connection

Create a route for Topic 'A' Add a path to

the edge broker

Register a listener to receive topic assignment events

Figure 4: Routing Broker Sequence Diagram

The algorithm and sequence of steps for the routing brokeroperate as follows: Each routing broker initially connects tothe ZooKeeper leader as a client of the ZooKeeper service.The cluster of routing brokers subsequently elect a leaderamong themselves using the process described in Section 2.3that uses the leader znode. Once a leader is elected, it reg-isters a listener (i.e., event detector that is notified when

Algorithm 1 Routing Broker Callback Functions

function broker node listener(broker node cache)topic set = broker node cache.get data()for topic : topic set do

if ! topic list.contains(topic) thenep cache = create children cache (topic)set listener(ep cache)topic list.add(topic)

function endpoint listener(ep cache)ep = ep cache.get data()switch ep cache.get event type() do

case child addedif ! eb peer list.contains(epeb locator) then

eb peer list.add(epeb locator)routing service.add peer(epeb locator)

if ! topic list.contains(eptopic) thenrouting service.create topic route(ep)

topic multi set.add(eptopic)

case child deletedtopic multi set.delete(eptopic)if ! topic multi set.contains(eptopic) then

eb peer list.delete(epeb locator)routing service.delete topic route(ep)

the registered znode changes) on the topics znode (shown inFigure 2) to receive topic relevant events (e.g., creation ordeletion of topics).

The following callback functions are implemented by therouting brokers:

• broker node listener – This function is invokedwhen a znode for a worker routing broker is updatedwith an assigned topic by a leader routing broker andactivated by a ZooKeeper client process.

• endpoint listener – This function is invoked whenchildren pub/sub endpoints of a znode for an assignedtopic are created, deleted, or updated. It is activatedby a ZooKeeper client process.

Every worker routing broker registers a listener on theznode for itself to receive topic assignment events updatedby a leader routing broker. In the broker node listenercallback function, the znode for the routing broker stores aset of topics. When the topic set is updated by the leader(e.g., the leader assigns a new topic to the worker routingbroker), it applies the changes by creating a cache for theassigned topic and its listener to receive events relevant toendpoints interested in the assigned topic.

When an endpoint is created or deleted in an isolatednetwork, their edge brokers create or delete znodes for end-points and these events will trigger the endpoint listenerfunction in the routing brokers that are responsible for thetopics involved with the endpoints. The metadata of theznode cache for an endpoint (ep in the endpoint listenercallback function) contains the locator of an edge brokerwhere the endpoint is located as well as the topic name,type, and QoS settings.

Consider a concrete example. Using Figure 4, when TopicAis created, the leader routing broker assigns the topic to theleast loaded worker, which currently is decided based on thenumber of adopted topics by that worker. However, other

Page 6: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

strategies can also be used in the load balancing decisions(e.g., least loaded based on CPU utilization or the numberof connections). Next, the leader updates a locator of theassigned worker broker on the corresponding znode that iscreated for TopicA, i.e., a child of topics znode – see the left-most node in row three of Figure 2. This locator informationwill then be used by edge brokers interested in TopicA.

A worker routing broker initially registers listeners on aznode for itself (i.e., a child of broker znodes) to receive topicassignment events, which occur when the assigned topics zn-ode is updated by a leader routing broker. When the workerrouting broker is informed that it must handle a specifictopic, such as TopicA, it then registers a listener on pub/-sub znodes for that particular assigned topic (e.g., childrenof topic A znode) to receive endpoint discovery events, suchas creation of publisher or subscriber endpoints interested inTopicA. When an endpoint for TopicA is created and theworker routing broker is notified, it establishes data dissem-ination paths to edge brokers. For this data dissemination,PubSubCoord relies on the underlying pub/sub messagingsystems’ broker capabilities.

2.5.2 Edge Broker ResponsibilitiesFigure 5 shows the corresponding sequence diagram for

edge brokers, and Algorithm 2 describes the logic of thecallback functions implemented by the edge broker.

Edge Broker Routing ServiceZooKeeper

ServerEndpoint

(Pub or Sub)

Discovery event via Multicast

Run Routing Service

Initiate connection

Create a znode for the created endpoint with edge broker's locator (IP and port)

Register a listener to receive events for routing broker assignment

Assignment event of a routing broker for Topic 'A' with its locator

Create a routefor Topic 'A' Add a path to

the routing broker

Create an endpoint for Topic A

Publish/Subscribe data

Destroy or movethe endpoint for Topic A

Liveliness timeout of the endpoint

Delete the znode for the disappered endpoint

Delete the routefor Topic 'A'

Create a znode for Topic 'A' if it does not exist

Figure 5: Edge Broker Sequence Diagram

Recall that an edge broker interfaces the underlying con-crete pub/sub technology with the generic capabilities pro-vided by PubSubCoord by tunneling technology-specific meta-data through the generic data structures and communicationmechanisms of PubSubCoord. To better explain the opera-tion of the edge broker, we have used the OMG DDS as ourconcrete pub/sub technology.4.

The OMG DDS specification defines a distributed pub-/sub communications standard [20]. At the core of DDSis a data-centric architecture (i.e., subscriptions are defined

4Our empirical evaluations also use OMG DDS as the con-crete pub/sub technology as described in Section 3

Algorithm 2 Edge Broker Callback Functions

function endpoint created(ep)create znode (ep)if ! topic multi set.contains(eptopic) then

ep node cache = create node cache(ep)set listener (ep node cache)routing service.create topic route(ep)

topic multi set.add(eptopic)

function topic node listener(topic node cache)rb locator = topic node cache.get data()if ! rb peer list.contains(rb locator) then

rb peer list.add(rb locator)routing service.add peer(rb locator)

function endpoint deleted(ep)delete znode (ep)topic multi set.delete(eptopic)if ! topic multi set.contains(eptopic) then

delete node cache(ep)routing service.delete topic route(ep)

by topics, keyed data types, data contents, and QoS poli-cies) for connecting anonymous data publishers with datasubscribers in a logical global data space. A DDS datapublisher produces typed data streams identified by namescalled topics. The coupling between a publisher and sub-scriber is expressed in terms of topic name, its data typeschema, and QoS attributes of publishers and subscribers.To ease the management, each publisher is made up of one ormore DataWriters and each subscriber is made up of one ormore DataReaders. Each DataWriter and DataReader canbe associated with only one topic and perform the action ofwriting and reading, respectively.

The algorithm implements the following callback func-tions for edge brokers which are invoked under the followingconditions:

• endpoint created: This function is invoked whenan endpoint in an isolated network is created and ac-tivated by a built-in DDS DataReader.

• topic node listener: This function is invoked whena topic znode managed by an edge broker is updatedwith a locator of an assigned worker routing broker. Itis activated by a ZooKeeper client process.

• endpoint deleted: This function is invoked when anendpoint in a network is deleted and activated by abuilt-in DDS DataReader.

The edge broker operates as follows: Like routing brokers,edge brokers initially connect to the ZooKeeper servers asclients of the ZooKeeper service. In the context of OMGDDS, edge brokers make use of built-in entities (i.e., specialpub/sub entities for discovering peers and endpoints in anetwork supported by OMG DDS) to discover endpoints inlocal networks. For example, when a pub or sub endpoint in-terested in TopicA is created within some isolated network,the built-in entities receive discovery events via multicast,and then edge brokers create znodes for the created end-points.

PubSubCoord does not need to know the semantics ofthese technology-specific mechanisms; all technology-specific

Page 7: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

information is masked within the generic znode data struc-tures. When an endpoint is created with a new topic, anedge broker informs ZooKeeper of the new topic which in-serts it into its znode tree and informs the leader routingbroker of the new topic. The routing broker leader then se-lects one of the existing worker routing brokers to handlethat topic as explained before. This selection is made basedon the load on each worker routing broker.

Edge brokers register a listener on a topic znode (e.g.,topic A in Figure 2) in which the created endpoint is in-terested in to obtain the locator of the routing broker thatis in charge of that particular topic. Once a locator of arouting broker is obtained, an edge broker initiates a datadissemination path to the routing broker through the rout-ing capabilities that are assumed to be collocated with theedge broker.

The endpoint created callback function first creates aznode for a created endpoint (i.e. ep in Algorithms 2) thatcontains the topic name, type, and QoS settings. If a rele-vant topic to the created endpoint has not appeared in anedge broker before, a cache for the topic znode and its lis-tener for the topic are created to receive locator informationof an assigned worker routing broker. When the znode forthe topic is updated by a leader routing broker, it triggersthe topic node listener callback described in Algorithm 2.

In the topic node listener callback function, each topicznode stores the locator of the worker routing broker thatis responsible for the topic. The locator of a routing brokeris added to the routing capability of the edge broker to es-tablish a communication path between the edge broker anda worker routing broker.

The endpoint deleted callback function deletes the zn-ode for the existing endpoint, and deletes it from the multi-set for topics. Next, it checks if the multi-set contains thetopic of the deleted endpoint. If the topic is contained inthe multi-set, it means other endpoints are still interestedin the topic. If it is empty, it means no other endpoints thatare interested in the topic exists, and that the cache and itslistener need to be removed. The multi-set data structurefor topics is used because there may still exist endpointsinterested in topics relevant to deleted endpoints.

To support mobility or termination of endpoints, Pub-SubCoord relies on some cooperation from the underlyingpub/sub technology. For example, in the context of OMGDDS, if the created endpoints move to different networks orare deleted, a timeout event occurs by virtue of using theliveliness QoS policy (i.e., a DDS QoS policy used to detectdisconnected endpoints where the timeout values are con-figurable) and accordingly the znodes (which operate in theephemeral mode) for those endpoints are deleted from thecoordination servers and the route maintained at the edgebroker is also terminated.

2.6 Implementation DetailsRecall that PubSubCoord is meant to work with any un-

derlying pub/sub technology with a goal of making it WAN-enabled. Any concrete realization of PubSubCoord will re-quire some concrete underlying pub/sub technology. Wehave demonstrated the feasibility of PubSubCoord in thecontext of OMG Data Distribution Service (DDS). We haveused Curator,5 which is a high-level API that simplifies usingZooKeeper, and provides useful recipes such as leader elec-

5http://curator.apache.org

tion and caches of znodes. We use the cache recipe to locallyreserve data objects that are accessed multiple times for fastdata access and reduce the load on ZooKeeper servers.

3. EXPERIMENTAL VALIDATION OF PUB-SUBCOORD

We now present results of experiments we conducted tovalidate our claims. We focus on PubSubCoord performanceand scalability, and overhead of the coordination layer.

3.1 Overview of Testbed Configuration and De-fault Settings

Our testbed is a private cloud managed by OpenStackcomprising 60 physical machines each with 12 cores and 32GB of memory. We emulated a WAN environment in ourtestbed using Neutron,6 which is an OpenStack project fornetworking as a service that allows users to create virtualnetworks by using an Open vSwitch plugin.7 To that end, wecreated a total of 120 virtual networks. 380 virtual machines(VMs) are placed across these virtual networks. Each VMis configured with one virtual CPU and 2 GB of memory.We use RTI Connext 5.1 8 as the implementation of theOMG DDS standard technology, and its Routing Service isintegrated with brokers for our test applications.

We used all the 380 VMs for our scalability experimentsdiscussed below. Each broker executes inside its own VMfor which we used a total of 160 VMs: 120 VMs for hostingedge brokers (implying 120 isolated networks) and 40 VMsfor routing brokers (implying a cluster of 40 routing bro-kers). Of the remaining 220 VMs, 20 VMs are used for host-ing publishers and 200 VMs for subscribers. Each of theseVMs runs either 25 publisher or 50 subscriber test applica-tions. We locate 50 publishers or 100 subscribers within eachisolated network (i.e., 2 VMs inside each network). This ap-proach keeps the experimental apparatus simple. The entirenumber of publishers and subscribers is 1,000 and 10,000, re-spectively. Subscribers in each network are interested in 100topics out of the total 1,000 topics in a system. Publisherspush data every 50 milliseconds, and the size of a data sam-ple is 64 bytes; any other size is acceptable, however, we didnot test with simultaneous different packet sizes.

3.2 OMG DDS QoS Policy Settings UsedSince OMG DDS supports a number of different QoS poli-

cies that can be mixed and matched, we used some of thesethat are important for the systems of interest to us. EachQoS policy has offered and requested semantics (i.e., offeredby publishers and requested by subscribers) and are used inconjunction with data types of topics to match pairs of end-points, i.e., the DataReader and DataWriter. We describethe purpose of only those policies that we have used in ourexperiments and show how they are used.

The reliability QoS controls the reliability of data flowsbetween DataWriters and DataReaders at the transport level.It can be of two kinds: best effort and reliable. Thedurability QoS specifies whether or not the DDS middle-ware stores and delivers previously published data samplesto endpoints that join the network later. The reliabilityand persistency can be affected by the history QoS policy,

6https://wiki.openstack.org/wiki/Neutron7http://www.openvswitch.org8https://www.rti.com/products/dds/

Page 8: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

20 40 60 80 1002

4

6

8

10

12

14

Number of Topics Per Edge Broker

End−

to−

end L

ate

ncy (

ms)

Median

Mean

(a) Mean and Median End-to-endLatency of Pub/Sub by DifferentNumber of Topics Per Network

5 10 15 200

1

2

3

4

Number of Routing Brokers

End−

to−

end L

ate

ncy (

sec)

Mean

Median

(b) Mean and Median End-to-endLatency of Pub/Subwith Load Balance in Routing Brokers

2000 4000 6000 8000 10000

5

10

15

20

25

Ave

rag

e L

ate

ncy (

ms)

Number of Subscribers2000 4000 6000 8000 10000

(c) Average Latency of CoordinationService by Different Number of JoiningSubscribers

20 40 60 80 100

10

20

30

40

Number of Topics Per Edge Broker

End−

to−

end L

ate

ncy (

ms)

95th Percentile

99th Percentile

(d) 95th and 99th PercentileEnd-to-end Latency of Pub/Subby Different Number of TopicsPer Network

5 10 15 200

2

4

6

8

10

12

Number of Routing Brokers

End−

to−

end L

ate

ncy (

sec)

95th Percentile

99th Percentile

(e) 95th and 99th PercentileEnd-to-end Latency of Pub/Subwith Load Balance in Routing Brokers

2000 4000 6000 8000 10000

500

1000

1500

2000

2500

Ma

xim

um

La

ten

cy (

ms)

Number of Subscribers2000 4000 6000 8000 10000

(f) Maximum Latency of CoordinationService by Different Number of JoiningSubscribers

20 40 60 80 1000

20

40

60

80

100

Number of Topics Per Edge Broker

CP

U U

tiliz

ation (

%)

(g) CPU Utilization by DifferentNumber of Topics Per Network

5 10 15 200

20

40

60

80

100

Number of Routing Brokers

CP

U U

tiliz

ation (

%)

(h) CPU Utilization with Load Balancein Routing Brokers

2000 4000 6000 8000 100000

5,000

10,000

15,000

20,000

25,000

30,000

Number of Subscribers

Number of ZNodes

Number of Watches

(i) Number of ZNodes and Watchesby Different Number of JoiningSubscribers

Figure 6: Results of Performance and Scalability Experiments

which specifies how much data must be stored in in-memorycache allocated by the middleware. The lifespan QoS helpsto control memory usage and lifecycle of data by settingthe expiration time of the data on DataWriters, so that themiddleware can delete expired data from the cache. Thedeadline QoS policy specifies the deadline between two suc-cessive updates for each data sample. The middleware willnotify the application via callbacks if a DataReader or aDataWriter breaks the deadline contract. Note that DDSmakes no effort to meet the deadline; it only notifies if thedeadline is missed.

Our experiments use the reliability and durability DDSQoS policies for pub/sub communications to validate Pub-SubCoord for higher service quality in terms of reliabilityand persistence of data delivery both of which are impor-tant for our systems of interest. Depending on the sys-tems’ requirements, QoS policies can be varied and perfor-

mance results may change according to the different QoSsettings. Specifically, we use reliable reliability QoS toavoid data loss at the transport level through data retrans-mission. We use keep all history QoS to keep all historicaldata and transient durability QoS to make it possible forlate-joining subscribers to obtain previously published sam-ples. The lifespan QoS is set to 60 seconds so publishersguarantee persistence for 60 seconds.

3.3 Testing MethodologyTo evaluate our solution, the system performance is mea-

sured in terms of the end-to-end latency from publishers tosubscribers, while scalability of data dissemination is mea-sured in terms of CPU usage on brokers. CPU usage isshown along with latency to understand how different con-figuration settings, i.e., number of topics per network andnumber of routing brokers, affect dissemination scalability.

Page 9: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

Moreover, we measure latency of coordination requests andthe number of data objects and notifications on ZooKeeperservers to show coordination scalability. To measure end-to-end latency from publishers to subscribers, we calculate timedifferences with timestamps of events on publishers and sub-scribers. For clock synchronization, we exploit the PreciseTime Protocol (PTP) [5] that guarantees fine-grained timesynchronization for distributed machines, and achieves clockaccuracy in the sub-microsecond range on a local network.Due to space limitations, we have not presented results onsystem behavior and performance due to dynamic churn inthe different aspects of the system.

3.4 Performance and Scalability ResultsFor end-to-end latency of measurements, we collect la-

tency values of 5,000 samples in total for each subscriberand use values only after 1,000 samples since the latencyvalues of the initial samples are not consistent due to co-ordination and discovery process overhead until the systemstabilizes (e.g., time for discovery of brokers and creatingroutes). Figure 6 summarizes all the results that are ex-plained below.

3.4.1 Evaluating the Broker Overlay LayerSince the edge brokers are responsible for delivering data

incoming from other edge brokers via worker routing brokersto subscribers in a local, isolated network, the computationoverhead on edge brokers grows linearly as the number ofadopted topics increases. Figure 6a, 6d, and 6g show resultswith different number of topics per edge broker, increasingthe number of topics from 20 to 100 out of 1,000 topics ina system. The CPU utilization linearly increases accordingto the number of adopted topics, and latency values growas well. From these results, we can infer that if the numberof incoming streams increases due to more number of topicsper network, it adversely affects latency values even thoughthe CPUs of edge brokers are not saturated.

Our solution supports load balancing in the group of rout-ing brokers and makes it possible to flexibly scale systemswith the number of topics. Figure 6b, 6e, and 6h presentlatency and CPU usage across different number of routingbrokers. When the number of routing brokers is small, inthis case 5, the CPU of the routing brokers become satu-rated and latency values are adversely impacted. However,after increasing the number of routing brokers to 10, latencyvalues improve. The results in Figure 6h also validate thatCPU usage linearly decreases by increasing the number ofrouting brokers.

3.4.2 Evaluating the Coordination LayerWe evaluate the scalability of a ZooKeeper-based central-

ized coordination service by increasing the number of simul-taneous joining subscribers. Figures 6c and 6f show latency,i.e., the amount of time it takes for the server to respondto a client request. Figure 6i presents the number of usedznodes and watches. We use mntr, a ZooKeeper commandfor monitoring service,9 to retrieve the experimental valuespresented in our results. We increase the number of sub-scribers from 2,000 to 10,000 in steps of 2,000. The averagelatency increases from 10 milliseconds to 20 milliseconds and

9http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

the number of znodes and watches linearly increase approx-imately 2,000 and 4,000, respectively, as the number of sub-scribers increase. The reason why the number of watches aretwice compared to the number of znodes is that ZooKeeperneeds to notify brokers for both publishers and subscribersif they have matching pub/sub endpoints.

3.5 Evaluating Performance Optimizations forDeadline-aware Overlays

We also conducted experiments to validate our deadline-aware overlays showing latency and overhead by comparingthe performance parameters for multi-path and single-pathoverlays. A topology used for these experiments is shownin Figure 3. To emulate variable delays in the network andpacket losses, which are common in WANs, we use Dum-mynet [23]. These parameters are varied depending on ge-ographic locations of brokers, which is a factor influencingthe need for deadline-aware overlays.

To ensure realistic network delays and losses, for the multi-path overlay experiments, we use delay and loss data pro-vided by Verizon, which shows latency and packet deliv-ery statistics for communication between different countriesacross the globe.10 We categorize delay and loss data intotwo groups (i.e., A with 30ms delay and no packet loss, andB with 250 msec delay and 1% packet loss in Table 1) andexperimented 8 possible combinations with the given links(i.e., L1, L2, and L3 as shown in Figure 3), and test casesdescribed in Table 1.

Table 1: Deadline-aware Overlays Experiment Cases

Test Cases L1 L2 L3

Case 1 A A ACase 2 A A BCase 3 A B ACase 4 A B BCase 5 B A ACase 6 B A BCase 7 B B ACase 8 B B B

A = 30ms delay, no packet lossB = 250ms delay, 1% packet loss

Figure 7a and 7b show average and maximum latency ofsingle-path overlays with different network delays and packetloss and multi-path overlays with 8 test cases, respectively.From Case 1 to Case 5, the multi-path overlays performbetter than any cases of single-path in terms of latency. Allcases of multi-path overlays outperform a case with 125 mil-liseconds delay and 1% packet loss in single-path overlays.In spite of that, a multi-path overlay builds a duplicate pathfrom an edge broker other than from a routing broker, so itcauses extra overhead compared to a single-path overlay dueto additional computations and extra network transfer atthe edge broker. We measure network transfer overhead for10,000 samples from a publisher to a subscriber to comparesingle-path and multi-path by using tcpdump11 and the re-sults are presented in Figure 7c. These results validate thatdeadline-aware overlay improves latency, but incurs someoverhead.

10http://www.verizonenterprise.com/about/network/latency

11http://www.tcpdump.org

Page 10: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

30ms 45ms 90ms 125ms 250ms

250

500

750

1000

1250

1500

Network Delay

End−

to−

end L

ate

ncy (

ms)

Avg (1% Packet Loss)

Max (1% Packet Loss)

Avg (No Packet Loss)

Max (No Packet Loss)

(a) End-to-end Latency of Pub/Subwith Single-path Overlays

Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7 Case 8

100

200

300

400

500

600

700

800

End−

to−

End L

ate

ncy (

ms)

Test Case

Average

Maximum

(b) End-to-end Latency of Pub/SubMulti-path Overlays

Single−path Multi−path0

1000

2000

3000

4000

5000

6000

Netw

ork

Tra

ffic

(K

B)

E1 to R

R to E2

E1 to E2

(c) Overhead Comparison

Figure 7: Deadline-aware Experiments

3.6 Discussion and Key InsightsThe following is a summary of the insights we gained from

this research and the empirical evaluations.

• PubSubCoord disseminates data in a scalablemanner for systems having many pub/sub end-points and topics. The experimental results showthat PubSubCoord can deliver topic data within 100milliseconds for a system having 10,000 subscribersand 1,000 topics distributed across more than 100 net-works. As the number of topics increases in a system,our solution uses elastic cloud resources and load bal-ancing techniques to deliver data in a scalable way.However, if the number of adopted topics per edge bro-ker increases, service quality becomes worse as shownin the experimental results because edge brokers needto deal with more number of forwarding operationsbetween routing brokers and pub/sub endpoints. If asystem requires higher frequency or more number oftopics per network, edge brokers can become a bottle-neck, requiring an elastic solution for edge brokers.

• Centralized coordination services like ZooKeepercan serve as a pub/sub control plane for large-scale systems. Our solution employs a centralizedservice for coordinating pub/sub brokers for its con-sistency and simplicity, and our experimental resultsshow that the average latency of the coordination ser-vice is 20 milliseconds for 10,000 subscribers joiningsimultaneously. Moreover, the number of data nodesand notifications linearly increase by organizing its datatree in a hierarchical way. Our experiments use a stan-dalone server for coordination, but multiple serversas an ensemble can be used for scalability and fault-tolerance and ZooKeeper guarantees consistency of databetween multiple servers. The ensemble of servers ismore scalable for read operations, but not for writeoperations that require synchronizing data betweenservers. In future, we plan to carry out experimentswith increasing the number of coordination servers tounderstand its scalability for pub/sub broker coordi-nation.

• Configurable QoS supported by the pub/subtechnology can be used for low-latency data de-livery in WANs by building multi-path over-lays. In our experiments and concrete realization ofPubSubCoord, we used OMG DDS and its QoS poli-cies. We use configurable deadline QoS to deliver data

at low-latency by establishing selective multi-path over-lays, and validate this approach by providing experi-mental results. Since not every path can be a delay-sensitive path, we need some higher level policy man-agement (e.g., offered and requested QoS managementbetween network domains) to decide what character-izes a delay-sensitive path. In addition, although thisapproach assures low-latency data delivery, it occursextra overhead by duplicating data delivery from mul-tiple paths. To reduce the costs, we can utilize own-ership QoS that dynamically selects an owner of datastreams to reduce data traffic from backups, and theowner is changed to a backup when the owner fails.Our deadline-aware overlay optimizations were easierto realize due to OMG DDS features; implementingsimilar optimizations for other messaging systems willrequire identifying similar opportunities in that pub/-sub technology.

• End-to-end QoS management is required for ef-ficiency. Although most of the QoS policies in oursolution are supported by hop-by-hop enforcement be-tween brokers, QoS policies for persistence, reliability,and ordering used in our experiments assure end-to-end QoS. However, this may be inefficient for somecases. For example, the durability QoS ensures sendingpreviously published data to late joining subscribers.To support end-to-end data persistence with hop-by-hop QoS enforcement, each broker needs to keep his-tory data in memory that will not be freed until it isacknowledged. This is beneficial for some late joiningsubscribers that require history data with low-latency.However, keeping duplicate history data on each bro-ker consumes memory resources. We suggest end-to-end acknowledgment mechanisms as a solution to ad-dress these inefficiencies.

4. RELATED WORKPrior research on pub/sub systems can be classified into

topic-based, attribute-based, and content-based dependingon the subscription model. The topic-based model, such asScribe [24], TIB/RV [19], and SpiderCast [8], groups sub-scription events in topics. In the attribute-based model,events are defined by specific types, and therefore this modelhelps developers to define data models in a robust way bytype-checking. The content-based model [7, 22] allows sub-scribers to express their interests by specifying conditions

Page 11: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

on the data content of events, and the system filters out anddelivers events based on the conditions.

The Object Management Group (OMG)’s Data Distribu-tion Service (DDS) [20] standard for data-centric pub/subholds substantial promise for CPS because of its supportfor configurable QoS policies, dynamic discovery, and asyn-chronous and anonymous decoupling of data endpoints (i.e.,publishers and subscribers) in time and space. However,DDS is limited in its support for WAN-based CPS. For in-stance, DDS uses multicast as a default transport to dy-namically discover peers in a system. If the endpoints arelocated in isolated networks that do not support multicast,then these endpoints cannot be discovered by each other.Secondly, even if these endpoints were discoverable, becauseof network firewalls and network address translation (NAT),peers may not be able to deliver messages to the destinationendpoints. PubSubCoord addresses these limitations andmakes it possible for OMG DDS to be used in WAN-scaleCPS without any modifications to the OMG DDS semantics.

Pub/sub systems tend to form overlay networks to sup-port application-level multicast rather than using IP-basedmulticast owing to the fact that IP multicast is not sup-ported in WANs and the limited number of IP-based multi-cast addresses would not fit the potential number of logicalchannels for fine-grained subscription models [2]. Overlayarchitectures for pub/sub systems can be categorized intobroker-based overlay [7, 22, 19], structured peer-to-peer [24],and unstructured peer-to-peer. GREEN [25] supports con-figurable overlay architectures for different network environ-ments. PubSubCoord adopts a hybrid approach that con-structs unstructured peer-to-peer overlays in LANs by dy-namically discovering peers via multicast, and broker-basedoverlays in WANs.

The BlueDove [16] pub/sub system achieves scalabilityand elasticity by harnessing cloud resources. It is a two-tierarchitecture to reduce the number of delivery hops and forsimplicity. BlueDove is designed for enterprise systems de-ployed in the cloud and does not consider the restrictionsof physical locations of pub/sub endpoints. Like BlueDove,PubSubCoord also uses a hierarchical broker solution andexploits the cloud for scalability. However, in our system,pub/sub endpoints located in different networks dynamicallydiscover each other with the help of edge brokers, and there-fore we consider physical restrictions of pub/sub endpoints.Moreover, our approach is decoupled from any specific pub-/sub technology.

Bellavista et al. [3] study QoS-aware pub/sub systems overWANs and compare multiple existing pub/sub systems sup-porting QoS including DDS. In [15], the authors evaluatea pub/sub system for wide-area networks named Harmonyand techniques for responsive and highly available messag-ing. The Harmony system delivers messages through brokeroverlays placed in different physical networks, and pub/subendpoints communicate via local brokers located in the samenetwork. Although this effort describes a WAN-scale pub/-sub solution with QoS support, it centers on selective routingstrategies to balance responsiveness and resource usage us-ing multi-hop broker networks. In contrast, PubSubCoorduses only a two-hop overlay architecture connecting the edgeand routing brokers.

IndiQoS [6] also proposes a pub/sub system with QoS sup-port to reduce end-to-end latency by exploiting network-level reservation mechanisms, where message brokers are

structured using distributed hash table (DHT). Similar toIndiQoS, we pursue low-latency and high availability but oursolution can also support other QoS policies such as config-urable transport reliability, data persistence, ordering, andresource management by controlling the depth of historydata and subscribing rate. In other words, we strive to pro-vide the capabilities of the underlying pub/sub technologyat the WAN level and additionally also provide WAN-basedoptimizations. We do not use a DHT solution for brokersand so a comparison along these lines will require additionalresearch, which is part of our future work.

Recent research including ours [10, 17] has broadened thescope of OMG DDS to WANs by bringing in routing enginesto disseminate data from a local network to others. Our so-lution emphasizes reuse and hence leverages such routing en-gines and additionally solves the discovery and coordinationproblem between routing engines that otherwise requires sig-nificant manual efforts for large-scale systems. Finally, [28]suggests separation of control and data plane in next gen-eration pub/sub systems, which is motivated by software-defined networking (SDN). We have not explored the ben-efits of SDN for PubSubCoord, however, our other ongoingefforts have demonstrated preliminary ideas [11, 21], whichform additional dimensions of future work.

Literature on stream processing is not reviewed since wefeel the comparisons are not accurate due to the differencesin the computation and communication semantics.

5. CONCLUDING REMARKSEmerging WAN-scale distributed systems found in do-

mains, such as transportation and smart grid, must dis-seminate large volumes of data between a large number ofheterogeneous entities that are geographically distributed,and require a variety of QoS properties for data dissemina-tion from the publishers of information to the subscribers.Many disparate solutions that handle individual aspects ofthe problem space exist but seldom have these techniquesbeen brought together holistically to solve the broader setof challenges. There is a need to systematically integratethese proven solutions while also providing new capabilities.This is challenging, particularly when the sum of the partsitself must be made reusable and applicable across a varietyof pub/sub technologies.

To address some of these broader challenges, this paperpresents the design, implementation, and evaluation of Pub-SubCoord, which is a cloud-based coordination and discov-ery service and middleware for geographically dispersed pub-/sub applications. PubSubCoord supports scalability in termsof data dissemination as well as coordination, dynamic dis-covery, and configurable QoS properties. Our experimentalresults validate our claims.

Insights gained from this research and current limitationshave provided us directions for future work. For example,we have not fully explored the fault tolerance and securitydimensions in current work. Similarly, the edge broker bot-tleneck remains to be resolved. Furthermore, our work usesoverlay networks; thus we do not have control over the QoSover the network links. It is possible to use software definednetworking (SDN) to control the network QoS for pub/subtraffic and provide differential services to different pub/subflows. A recent effort [26] has already showed how pub/-sub and SDN can be combined to perform filtering at thelevel of SDN controllers. Since the systems of interest to us

Page 12: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

will almost always comprise heterogeneous technologies, weneed to investigate how the existing architecture can supportWAN-scale pub/sub across distributed isolated networks us-ing heterogeneous pub/sub technologies (e.g., between DDSand MQTT). Our future work aims to address these limita-tions.

PubSubCoord is designed with reuse in mind and can eas-ily be adopted in industrial settings. To that end, the mid-dleware and test harness can be downloaded from:www.dre.vanderbilt.edu/~kyoungho/pubsubcoord.

6. REFERENCES[1] R. Baldoni, R. Beraldi, L. Querzoni, and A. Virgillito.

Efficient publish/subscribe through a self-organizingbroker overlay and its application to siena. TheComputer Journal, 50(4):444–459, 2007.

[2] R. Baldoni, M. Contenti, and A. Virgillito. Theevolution of publish/subscribe communicationsystems. In Future directions in distributed computing,pages 137–141. Springer, 2003.

[3] P. Bellavista, A. Corradi, and A. Reale. Quality ofservice in wide scale publish-subscribe systems. IEEECommunications Surveys & Tutorials, 2014.

[4] F. Cao and J. P. Singh. Efficient event routing incontent-based publish-subscribe service networks. InINFOCOM 2004. Twenty-third AnnualJointConference of the IEEE Computer andCommunications Societies, volume 2, pages 929–940.IEEE, 2004.

[5] L. Carroll. Ieee 1588 precision time protocol (ptp).http://www.eecis.udel.edu/~mills/ptp.html, 2012.

[6] N. Carvalho, F. Araujo, and L. Rodrigues. Scalableqos-based event routing in publish-subscribe systems.In Network Computing and Applications, Fourth IEEEInternational Symposium on, pages 101–108. IEEE,2005.

[7] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf.Design and evaluation of a wide-area eventnotification service. ACM Transactions on ComputerSystems (TOCS), 19(3):332–383, 2001.

[8] G. Chockler, R. Melamed, Y. Tock, and R. Vitenberg.Spidercast: a scalable interest-aware overlay fortopic-based pub/sub communication. In Proceedings ofthe 2007 inaugural international conference onDistributed event-based systems, pages 14–25. ACM,2007.

[9] C. Esposito, D. Cotroneo, and A. Gokhale. Reliablepublish/subscribe middleware for time-sensitiveinternet-scale applications. In Proceedings of the ThirdACM International Conference on DistributedEvent-Based Systems, page 16. ACM, 2009.

[10] A. Hakiri, P. Berthou, A. Gokhale, D. Schmidt, andT. Gayraud. Supporting End-to-end Scalability andReal-time Event Dissemination in the OMG DataDistribution Service over Wide Area Networks.Elsevier Journal of Systems Software (JSS),86(10):2574–2593, Oct. 2013.

[11] A. Hakiri, P. Berthou, P. Patil, and A. Gokhale.Towards a Publish/Subscribe-based Open PolicyFramework for Proactive Overlay Software DefinedNetworking. Technical Report ISIS-15-115, Institutefor Software Integrated Systems, Nashville, TN, USA,

2015.

[12] P. Hintjens. ZeroMQ: Messaging for ManyApplications. O’Reilly Media, Inc., 2013.

[13] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed.Zookeeper: wait-free coordination for internet-scalesystems. In Proceedings of the 2010 USENIXconference on USENIX annual technical conference,volume 8, pages 11–11, 2010.

[14] M. A. Jaeger, H. Parzyjegla, G. Muhl, andK. Herrmann. Self-organizing broker topologies forpublish/subscribe systems. In Proceedings of the 2007ACM symposium on Applied computing, pages543–550. ACM, 2007.

[15] M. Kim, K. Karenos, F. Ye, J. Reason, H. Lei, andK. Shagin. Efficacy of techniques for responsiveness ina wide-area publish/subscribe system. In Proceedingsof the 11th International Middleware ConferenceIndustrial track, pages 40–45. ACM, 2010.

[16] M. Li, F. Ye, M. Kim, H. Chen, and H. Lei. A scalableand elastic publish/subscribe service. In Parallel &Distributed Processing Symposium (IPDPS), 2011IEEE International, pages 1254–1265. IEEE, 2011.

[17] J. M. Lopez-Vega, J. Povedano-Molina,G. Pardo-Castellote, and J. M. Lopez-Soler. Acontent-aware bridging service for publish/subscribeenvironments. Journal of Systems and Software,86(1):108–124, 2013.

[18] OASIS. Mqtt version 3.1.1. http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html,2014.

[19] B. Oki, M. Pfluegl, A. Siegel, and D. Skeen. Theinformation bus: an architecture for extensibledistributed systems. In ACM SIGOPS OperatingSystems Review, volume 27, pages 58–68. ACM, 1994.

[20] OMG. The data distribution service specification,v1.2. http://www.omg.org/spec/DDS/1.2, 2007.

[21] P. Patil, A. Hakiri, and A. Gokhale. BootstrappingSoftware Defined Network for Flexible and DynamicControl Plane Management. In 1st IEEE Conferenceon Network Softwarization, London, UK, Apr. 2015.IEEE.

[22] P. R. Pietzuch and J. M. Bacon. Hermes: Adistributed event-based middleware architecture. InDistributed Computing Systems Workshops, 2002.Proceedings. 22nd International Conference on, pages611–618. IEEE, 2002.

[23] L. Rizzo. Dummynet: a simple approach to theevaluation of network protocols. ACM SIGCOMMComputer Communication Review, 27(1):31–41, 1997.

[24] A. Rowstron, A.-M. Kermarrec, M. Castro, andP. Druschel. Scribe: The design of a large-scale eventnotification infrastructure. In Networked groupcommunication, pages 30–43. Springer, 2001.

[25] T. Sivaharan, G. Blair, and G. Coulson. Green: Aconfigurable and re-configurable publish-subscribemiddleware for pervasive computing. In On the Moveto Meaningful Internet Systems 2005: CoopIS, DOA,and ODBASE, pages 732–749. Springer, 2005.

[26] M. A. Tariq, B. Koldehofe, S. Bhowmik, andK. Rothermel. PLEROMA: A SDN-based HighPerformance Publish/Subscribe Middleware. InProceedings of the 15th International Middleware

Page 13: Wide Area Network-scale Discovery and Data Dissemination ...Wide Area Network-scale Discovery and Data Dissemination in Data-centric Publish/Subscribe Systems ... ble design which

Conference, Middleware ’14, pages 217–228, NewYork, NY, USA, 2014. ACM.

[27] S. Vinoski. Advanced message queuing protocol. IEEEInternet Computing, 10(6):87–89, 2006.

[28] K. Zhang and H.-A. Jacobsen. Sdn-like: The nextgeneration of pub/sub. arXiv preprintarXiv:1308.0056, 2013.