analyzing data movements and identifying techniques for next-generation networks

49
Experiences with 100Gbps Network Applications Analyzing Data Movements and Identifying Techniques for Next-generation High-bandwidth Networks Mehmet Balman Computational Research Division Lawrence Berkeley National Laboratory

Upload: balmanme

Post on 06-Jul-2015

161 views

Category:

Technology


1 download

DESCRIPTION

Uc davis talk-jan28th2013_balman

TRANSCRIPT

Page 1: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Experiences with 100Gbps Network Applications

Analyzing Data Movements and Identifying Techniques for

Next-generation High-bandwidth Networks

Mehmet Balman Computational Research Division

Lawrence Berkeley National Laboratory

Page 2: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Outline Climate Data as a typical science scenario

• 100Gbps Climate100 demo MemzNet: Memory-mapped zero-copy Network Channel Advance Network Initiative (ANI) Testbed

• 100Gbps Experiments

Future Research Plans

Page 3: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

100Gbps networking has finally arrived! Applications’ Perspective Increasing the bandwidth is not sufficient by itself; we need careful evaluation of high-bandwidth networks from the applications’ perspective.

1Gbps to 10Gbps transition (10 years ago)

Application did not run 10 times faster because there was more bandwidth available

Page 4: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

The need for 100Gbps Modern science is Data driven and Collaborative in nature • The largest collaborations are most likely to

depend on distributed architectures. • LHC (distributed architecture) data generation,

distribution, and analysis. • The volume of data produced by of genomic

sequencers is rising exponentially. •  In climate science, researchers must analyze

observational and simulation data located at facilities around the world

Page 5: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

ANI 100Gbps Demo

•  100Gbps demo by ESnet and Internet2

•  Application design issues and

host tuning strategies to scale to 100Gbps rates

•  Visualization of remotely located data (Cosmology)

•  Data movement of large datasets with

many files (Climate analysis)

Late 2011 – early 2012

Page 6: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

ESG (Earth Systems Grid) • Over 2,700 sites • 25,000 users

•  IPCC Fifth Assessment Report (AR5) 2PB

•  IPCC Forth Assessment Report (AR4) 35TB

Page 7: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Data distribution for climate science How scientific data movement and analysis between geographically disparate supercomputing facilities can benefit from high-bandwidth networks?

•  Local copies •  data files are copied

into temporary storage in HPC centers for post-processing and further climate analysis.

Page 8: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Climate Data Analysis Petabytes of data •  Data is usually in a remote location • How can we access efficiently?

Page 9: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Tuning Network Transfers

Page 10: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Adaptive Tuning

Worked well with large files !

Page 11: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Climate Data over 100Gbps • Data volume in climate applications is increasing

exponentially. • An important challenge in managing ever increasing data

sizes in climate science is the large variance in file sizes. •  Climate simulation data consists of a mix of relatively small and

large files with irregular file size distribution in each dataset.

• Many small files

Page 12: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

lots-of-small-files problem!

file-centric tools?

FTP RPC

request a file

request a file

send file

send file

request data

send data

•  Concurrent transfers •  Parallel streams

Page 13: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Moving climate files efficiently

Page 14: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

MemzNet: memory-mapped zero-copy network channel

network

Front-­‐end  threads  

(access  to  memory  blocks)

Front-­‐end  threads  (access  to  memory  blocks)

Memory  blocks Memory  

blocks

memory caches are logically mapped between client and server

Page 15: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Advantages •  Decoupling I/O and network operations

•  front-end (I/O processing) •  back-end (networking layer)

•  Not limited by the characteristics of the file sizes

•  On the fly tar approach, bundling and sending many files together

•  Dynamic data channel management Can increase/decrease the parallelism level both in the network communication and I/O read/write operations, without closing and reopening the data channel connection (as is done in regular FTP variants).

Page 16: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Advantages The synchronization of the memory cache is accomplished based on the tag header. •  Application processes interact with the memory

blocks. •  Enables out-of-order and asynchronous send

receive MemzNet is is not file-centric. Bookkeeping information is embedded inside each block.

•  Can increase/decrease the number of parallel streams without closing and reopening the data channel.

Page 17: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

100Gbps demo environment

RRT: Seattle – NERSC 16ms NERSC – ANL 50ms NERSC – ORNL 64ms

Page 18: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

100Gbps Demo • CMIP3 data (35TB) from the GPFS filesystem at NERSC

• Block size 4MB • Each block’s data section was aligned according to the

system pagesize. •  1GB cache both at the client and the server

• At NERSC, 8 front-end threads on each host for reading

data files in parallel. •  At ANL/ORNL, 4 front-end threads for processing

received data blocks. •  4 parallel TCP streams (four back-end threads) were

used for each host-to-host connection.

Page 19: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

83Gbps throughput

Page 20: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Framework for the Memory-mapped Network Channel

memory caches are logically mapped between client and server

Page 21: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

MemzNet’s Features Data files are aggregated and divided into simple blocks. Blocks are tagged and streamed over the network. Each data block’s tag includes information about the content inside. Decouples disk and network IO operations; so, read/write threads can work independently. Implements a memory cache managements system that is accessed in blocks. These memory blocks are logically mapped to the memory cache that resides in the remote site.

Page 22: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Performance at SC11 demo

TCP buffer size is set to 50MB

MemzNet GridFTP

Page 23: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Advance Network Initiative

Esnets’s 100Gbps Testbed

Page 24: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Performance results in the 100Gbps ANI testbed

Page 25: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

MemzNet’s Performance

TCP buffer size is set to 50MB

MemzNet GridFTP

100Gbps demo

ANI Testbed

Page 26: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Challenge? • High-bandwidth brings new challenges!

•  We need substantial amount of processing power and involvement of multiple cores to fill a 40Gbps or 100Gbps network

•  Fine-tuning, both in network and application layers, to take advantage of the higher network capacity.

•  Incremental improvement in current tools? •  We cannot expect every application to tune and improve every time

we change the link technology or speed.

Page 27: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

MemzNet • MemzNet: Memory-mapped Network Channel

•  High-performance data movement

MemzNet is an initial effort to put a new layer between the application and the transport layer.

• Main goal is to define a network channel so applications can directly use it without the burden of managing/tuning the network communication.

Page 28: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Framework for the Memory-mapped Network Channel

memory caches are logically mapped between client and server

Page 29: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Advance Network Initiatieve

Esnets’s 100Gbps Testbed

Page 30: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Initial Tests ²  Many TCP sockets oversubscribe the network

and cause performance degradation.

²  Host system performance could easily be the bottleneck.

•  TCP/UDP buffer tuning, using jumbo frames, and

interrupt coalescing.

• Multi-core systems: IRQ binding is now essential for maximizing host performance.

Page 31: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Many Concurrent Streams

(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).

Page 32: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min intervals], TCP buffer size is 50M

Effects of many concurrent streams

Page 33: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Parallel Streams - 10Gbps

ANI testbed 10Gbps connection: Interface traffic vs the number of parallel streams [1, 2, 4, 8, 16, 32 64 streams -5min intervals], TCP buffer size is set to 50MB

Page 34: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Parallel Streams – 40Gbps

ANI testbed 40Gbps (4x10NICs, single host): Interface traffic vs the number of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals], TCP buffer size is set to 50MB

Page 35: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Parallel streams - 40Gbps

Figure 5: ANI testbed 40Gbps (4x10NICs, single host): Throughput vs the number of parallel streams [1, 2, 4, 8, 16, 32 64 streams - 5min intervals], TCP buffer size is set to 50MB

Page 36: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Host tuning With proper tuning, we achieved 98Gbps using only 3 sending hosts, 3 receiving hosts, 10 10GE NICS, and 10 TCP flows

Image source: “Experiences with 100Gbps network applications” In Proceedings of the fifth international workshop on Data-Intensive Distributed Computing, 2012

Page 37: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

NIC/TCP Tuning •  We are using Myricom 10G NIC (100Gbps testbed)

•  Download latest drive/firmware from vendor site •  Version of driver in RHEL/CentOS fairly old

•  Enable MSI-X •  Increase txgueuelen /sbin/ifconfig eth2 txqueuelen 10000!

•  Increase Interrupt coalescence!/usr/sbin/ethtool -C eth2 rx-usecs 100!

•  TCP Tuning: net.core.rmem_max = 67108864!net.core.wmem_max = 67108864!net.core.netdev_max_backlog = 250000

From “Experiences with 100Gbps network applications” In Proceedings of the fifth international workshop on Data-Intensive Distributed Computing, 2012

Page 38: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

100Gbps = It’s full of frames ! •  Problem:

•  Interrupts are very expensive •  Even with jumbo frames and driver optimization, there is still

too many interrupts.

•  Solution: •  Turn off Linux irqbalance (chkconfig irqbalance off) •  Use /proc/interrupt to get the list of interrupts •  Dedicate an entire processor core for each 10G interface •  Use /proc/irq/<irq-number>/smp_affinity to bind rx/tx queues to

a specific core.

From “Experiences with 100Gbps network applications” In Proceedings of the fifth international workshop on Data-Intensive Distributed Computing, 2012

Page 39: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Tuning Setting without tuning with tuning % improvementInterrupt coalescing (TCP) 24 36.8 53.3333333

Interrupt coalescing (UDP) 21.1 38.8 83.8862559

IRQ Binding (TCP) 30.6 36.8 20.2614379

IRQ Binding (UDP) 27.9 38.5 37.9928315

0

5

10

15

20

25

30

35

40

45

Interrupt coalescing

(TCP)

Interrupt coalescing

(UDP)

IRQ Binding (TCP)

IRQ Binding (UDP)

Gbp

s

Host Tuning Results

without tuning

with tuning

Image source: “Experiences with 100Gbps network applications” In Proceedings of the fifth international workshop on Data-Intensive Distributed Computing, 2012

Page 40: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Initial Tests ²  Many TCP sockets oversubscribe the network

and cause performance degradation.

²  Host system performance could easily be the bottleneck.

• TCP/UDP buffer tuning, using jumbo frames, and

interrupt coalescing. • Multi-core systems: IRQ binding is now essential for

maximizing host performance.

Page 41: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

End-to-end Data Movement • Tuning parameters

• Host performance

• Multiple streams on the host systems

• Multiple NICs and multiple cores

• Effect of the application design

Page 42: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

MemzNet’s Architecture for data streaming

Page 43: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

MemzNet • MemzNet: Memory-mapped Network Channel

•  High-performance data movement

MemzNet is an initial effort to put a new layer between the application and the transport layer.

• Main goal is to define a network channel so applications can directly use it without the burden of managing/tuning the network communication.

Page 44: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Framework for the Memory-mapped Network Channel

memory caches are logically mapped between client and server

Page 45: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Related Work • Luigi Rizzo ’s netmap

• proposes a new API to send/receive data over the network

• http://info.iet.unipi.it/~luigi/netmap/

• RDMA programming model • MemzNet can be used for RDMA-enabled transfers

• memory-management middleware

• GridFTP popen + reordering ?

Page 46: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Future Work

•  Integrate MemzNet with scientific data formats for remote data analysis

• Provide a user library for filtering and subsetting of data.

Page 47: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Testbed Results http://www.es.net/RandD/100g-testbed/

•  Proposal process •  Testbed Description, etc.

•  http://www.es.net/RandD/100g-testbed/results/

•  RoCE (RDMA over Ethernet) tests •  RDMA implementation (RFPT100) •  100Gbps demo paper •  Efficient Data Transfer Protocols •  MemzNet (Memory-mapped Zero-copy Network Channel)

Page 48: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

Acknowledgements Collaborators: Eric Pouyoul, Yushu Yao, E. Wes Bethel Burlen Loring, Prabhat, John Shalf, Alex Sim, Arie Shoshani, and Brian L. Tierney

Special Thanks: Peter Nugent, Zarija Lukic , Patrick Dorn, Evangelos Chaniotakis, John Christman, Chin Guok, Chris Tracy, Lauren Rotman, Jason Lee, Shane Canon, Tina Declerck, Cary Whitney, Ed Holohan, Adam Scovel, Linda Winkler, Jason Hill, Doug Fuller, Susan Hicks, Hank Childs, Mark Howison, Aaron Thomas, John Dugan, Gopal Vaswani

Page 49: Analyzing Data Movements and Identifying Techniques for Next-generation Networks

NDM 2013 Workshop (tentative)

The 3rd International Workshop on Network-aware Data Management

to be held in conjunction with the EEE/ACM International

Conference for High Performance Computing, Networking, Storage and

Analysis (SC’13)

http://sdm.lbl.gov/ndm

Editing a book (CRC press): NDM research frontiers (tentative)