masaki hirabaru 情報処理学会四国支部講演会 december 17, 2004...
TRANSCRIPT
2
データ転送昔と今
パケット転送遅延= サイズ速度
距離光速
+
パケットサイズ = 1500 バイト距離 = 10,000Km
1) 通信速度 = 9600bps 遅延 = 156ms + 33ms2) 通信速度 = 1Gbps 遅延 = 15us + 33ms
ネットワークの帯域が広くなったのに、スループットが出ない!
3
Radio Telescopes
NICT Kashima Space Center
Onsala Space Observatory
20m
34m
18m
Urumqi25m
4
VLBI* System Transitions
K5 Data AcquisitionTerminal
1st Generation
2nd Generation
1983~Open-Reel TapeHardware Correlator 1990~
Cassette TapeHardware Correlatore-VLBI over ATM
3rd Generation
2002~PC-based SystemHard-disk StorageSoftware Correlatore-VLBI over Internet
K3 Correlator (Center)K3 Recorder (Right)
K4 Terminal
K4 Correlator
64Mbps
256Mbps1 ~ 2Gbps
* 超長基線電波干渉計
5
VLBI (Very Long Baseline Interferometry)de
lay
radio signal from a star
correlator
A/D clockA/D
Internet
clock
•e-VLBI geographically distributed observation, interconnecting radio antennas over the world
Large Bandwidth-Delay Product Network issue
ASTRONOMYGEODESY
~Gbps~Gbps
•Gigabit / real-time VLBI multi-gigabit rate sampling
A B
A
B
d
6
Motivations• MIT Haystack – CRL Kashima e-VLBI Experiment
on August 27, 2003 to measure UT1-UTC in 24 hours– 41.54 GB CRL => MIT 107 Mbps (~50 mins)
41.54 GB MIT => CRL 44.6 Mbps (~120 mins)
– RTT ~220 ms, UDP throughput 300-400 MbpsHowever TCP ~6-8 Mbps (per session, tuned)
– BBFTP with 5 x 10 TCP sessions to gain performance
• HUT – CRL Kashima Gigabit VLBI Experiment
- RTT ~325 ms, UDP throughput ~70 MbpsHowever TCP ~2 Mbps (as is), ~10 Mbps (tuned)
- Netants (5 TCP sessions with ftp stream restart extension)
They need high-speed / real-time / reliable / long-haul high-performance, huge data transfer.
UT1-UTC = -32338.7280 +/- 23.90 usec
7
TCP Dynamic Behavior
time
rate
slow-start
congestionavoidance
available bandwidth
8
Example: From Tokyo to BostonTCP on a fast long path with a bottleneck
bottleneck
overflowqueue
loss
Tokyosenderrate control
Bostonreceiverloss detection
feedback
It takes 150ms to know the loss (buffer overflow).It keeps overflowing during the period…150ms is very long for the high-speed network.150ms at 1Gbps generates ~19MByte on the wire.
Los Angeles
50ms 50ms
100ms
bw 1G
bw 0.8G
Buffer25MB
9
Conditions
• a signle TCP stream (x multi)• memory to memory (no disk access)• single bottleneck• keep end-to-end principle (x relay)• packet switched network (scalable)
consume all the available bandwidth
Target
10
ExampleHow much speed can we get?
ReceiverSender
a-2)
High-Speed
Backbone
L2/L3SW
GbE 100M
RTT 200ms
a-1)
ReceiverSender
High-Speed
BackboneSW
GbE100M
RTT 200ms
GbE
SWGbE100M
11
Average TCP Throughput less than 20Mbps
In case we limit the sending rate at 100Mbps
This is TCP’s fundamental behavior.
12
Possible Bottlenecks
CPU
IO
NIC
Memory
PCI, PCI-X
Disk
driver bufferInterrupt coalesce, delayMTUetc.
1st Step: Tuning a Host with UDPIperf theoretical UDP throughput 957 Mbps (IPv4)
13
2nd Step: Tuning a Host with TCP• Maximum socket buffer size (TCP window size)
– net.core.wmem_max net.core.rmem_max (64MB)– net.ipv4.tcp_wmem net.tcp4.tcp_rmem (64MB)
• Driver descriptor length– e1000: TxDescriptors=1024 RxDescriptors=256 (default)
• Interface queue length– txqueuelen=100 (default)– net.core.netdev_max_backlog=300 (default)
• Interface queue descriptor– fifo (default)
• MTU– mtu=1500 (IP MTU)
• Linux 2.4.26 (RedHat 9) with web100• Web100 (incl. High Speed TCP)
– net.ipv4.web100_no_metric_save=1 (do not store TCP metrics in the route cache)– net.ipv4.WAD_IFQ=1 (do not send a congestion signal on buffer full)– net.ipv4.web100_rbufmode=0 net.ipv4.web100_sbufmode=0 (disable auto tuning)– net.ipv4.WAD_FloydAIMD=1 (HighSpeed TCP)– net.ipv4.web100_default_wscale=7 (default)
14
KwangjuBusan
2.5G
Fukuoka
Korea
2.5G SONET
KORENTaegu
Daejon
10G
0.6G1Gx2
1Gx2
QGPOP
Seoul XP
Genkai XP
Kitakyushu
Tokyo XP
Kashima
0.1G
Fukuoka Japan
250km
1,000km2.5G
TransPAC
9,000km
4,000km
Los Angeles
Cicago
New York
MIT Haystack
HUT
10G
1G
APII/JGN
Abilene
0.1GHelsinki
2.4G
Stockholm
0.6G
2.4G
2.4G
GEANT
Nordunet
funetKoganei
1G
7,000km
Indianapolis
I2 Venue1G
10G
100km
server (general)
server (e-VLBI)
Abilene Observatory: servers at each NOC
CMM: common measurement
machines
Network Diagram for TransPAC/I2 Measurement(Oct. 2003)
1G x2
sender
receiver
Mark5Linux 2.4.7 (RH 7.1)P3 1.3GHzMemory 256MBGbE SK-9843
PE1650Linux 2.4.22 (RH 9)Xeon 1.4GHzMemory 1GBGbE Intel Pro/1000 XT
Iperf UDP ~900Mbps (no loss)
15
TransPAC/I2 #1:Reno (Win 64MB)
16
Analyzing Advanced TCP Dynamic Behavior in a Real Network(Example: From Tokyo to Indianapolis at 1G bps with HighSpeed TCP)
The data was obtained during e-VLBI demonstration at Internet2 Member Meeting in October 2003.
17
ReceiverLinux TCP
SenderLinux 2.4
ENP2611Network
Processor Emulator
GbE
RTT 200ms(100ms one-way)
GbE
Only 800 Mbps available
Replaying in a Laboratory-Evaluation of Advanced TCPs-
ENP2611 board
18
Test Result #1Queue size 100 packets
TCPNewReno(Linux)
HighSpeedTCP (Web100)
19
Example of Advanced TCPswith different bottleneck queue sizes
BIC TCP queue size 100 packets FAST TCP queue size 100 packets
BIC TCP queue size 1000 packets FAST TCP queue size 1000 packets
BIC TCP: http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/FAST TCP: TCP Vegas を基にした Delay Base の TCPhttp://netlab.caltech.edu/FAST/
20
RouterSwitch RouterSwitch
1Gbps
100Mbps1Gbps1Gbps
a) b)
DeviceQueuing
Delay (µs)Capacity (Mbps)
Estimated Queue Size
(1500B)
Switch A 6161 100* 50
Switch B 22168 100* 180
Switch C 20847 100* 169
Switch D 738 1000 60
Switch E 3662 1000 298
Router F 148463 1000 12081
Router G 188627 1000 15350
* set to 100M for measurement
Measuring Bottleneck Queue Sizes
Switch/Router Queue Size Measurement Result
Typical Bottleneck Cases
ReceiverSenderCapacit
y C
packet train lost packetmeasured packet
Queue Size = C x (Delaymax – Delaymin)
21
KwangjuBusan
2.5G
Fukuoka
Korea
2.5G SONET
KORENTaegu
Daejon
10G
1G (10G)1G
1G
Seoul XP
Genkai XP
Kitakyushu
Kashima
1G (10G)
Fukuoka Japan
250km
1,000km2.5G
JGN II
9,000km
4,000km
Los Angeles
Chicago
Washington DC
MIT Haystack
10G
2.4G
APII/JGNII
Abilene
Koganei
1G(10G)
Indianapolis
100kmbwctl server
Experiment for High-Performance Scientific Data Transfer
10G
Tokyo XP /JGN II I-NOC
*Performance Measurement Point Directoryhttp://e2epi.internet2.edu/pipes/pmp/pmp-dir.html
perf server
e-vlbi server
JGNII
10G
GEANT
SWITCH
7,000km
TransPAC
Pittsburgh
U of Tokyo
A BWCTL account available for CMMincluding Korean researchers
International collaboration to support for science applications
22
VLBI Antenna Locations in North-East Asia
Shintotsukawa 3.8m
Tomakomai 11m, FTTH (100M)70km from Sapporo
Mizusawa 10m 20m118km from Sendai
Tsukuba 32m, OC48/ATMx2 SuperSINETKashima 34m, 1Gx2 JGN, OC48/ATM Galaxy
Yamaguchi 32m1G, 75M SINET
Gifu 11m 3m, OC48/ATMx2 SuperSINET
Usuda 64m, OC48/ATM Galaxy
Nobeyama 45mOC48/ATM Galaxy
Nanshan (Urumqi) 25m70km from Urumqi
Koganei 34m, 1Gx2 JGN, OC48/ATM Galaxy
Miyun (Beijing) 50m50km from Beijing
2Mbps
2Mbps
Yunnan (Kunming) 3m (40m)10km from Kunming
Sheshan (Shanghai) 25m30km from Shanghai
Observatory is on CSTNET at 100M
Jeju 20mTamna U
Seoul 20mYonsei U
Ulsan 20mU Ulsan
Daejon 14mTaeduk
Ishigaki 20m
Ogasawara 20mChichijima 10m
Iriki 20m Kagoshima 6mAira 10m
Legend
connectednot yet connectedantenna under construction
23
Speed Races
• Internet2 Land Speed Record– Single Regular TCP stream– Speed x Distance– 6.57 Gbps by Caltech (7.21 Gbps by U Tokyo)
• SC Bandwidth Challenge– More than 100 Gbps– AIST, U Tokyo, JAXA, …
24
Summary and Future Work
• High-performance scientific data transfer faces on network issues we need to work out.
• Big science applications like e-VLBI and High-Energy Physics need cooperation with network researchers.
• Deployment of performance measurement Infrastructure is on-going on world-wide basis.