the challenges tomcat faces in high throughput … · biography • huxing zhang (张乎兴) •...
TRANSCRIPT
TheChallengesTomcatFacesinHighThroughputProductionSystems
May17,2017 ©2017AlibabaMiddlewareGroup 1
Biography
• Huxing Zhang (张乎兴)• Software Engineer @ AlibabaMiddlewareGroup since 2014• Previously worked for DeNA, Tokyo, Japan• Apache Tomcat Committer since 2016• Core tomcat maintainer @ Alibaba Inc.
May17,2017 ©2017AlibabaMiddlewareGroup 2
Outline• How we use Tomcat at Alibaba• High throughputweb applications• The challenges• Case studies• Summary• Q & A
May17,2017 ©2017AlibabaMiddlewareGroup 3
Outline
• How we use Tomcat at Alibaba• High throughputweb applications• The challenges• Case studies• Summary• Q & A
May17,2017 ©2017AlibabaMiddlewareGroup 4
Tomcat in Alibaba
• Small trial since 2009• Migration From JBoss to Tomcat
• Jboss uses Tomcat as its servlet container• Officially started in 2013
• Becomes the dominant servlet containerin 2015• Thousands of web applications• Hundreds of thousands of instances
• StatusQuo• 95% tomcat, 5% other• Mainly 7.0.x (support JDK 6+)• Few 8.0.x• Emerging 8.5.x because of micro-service (e.g. spring-boot)
May17,2017 ©2017AlibabaMiddlewareGroup 5
DeploymentStandard
May17,2017 ©2017AlibabaMiddlewareGroup 6
• Startup script features• Per startup log rotation for catalina.out• Smart auto-fix for incorrect JVM args• CATALINA_BASE preparation• Tomcat startup status detection
One web applicationper Tomcat instance
Nginxconf
Startup script
Tomcat conf
AppName.war
NginxConf
TomcatConf
JVMArgs
HealthCheck
StartupHooks
Nginxconf
Tomcat conf
AppName.war
Deployment
Platform
Monitoring, diagnostics and development tools• Tomcat Monitor (RESTful service)• Metrics for OS, JVM, Tomcat, Middleware,and application• Classloading: locatewhich jar file a class is loaded from• Thread: per thread CPU usage, thread state, etc.• Connector stats: tomcat thread pool statistics
• Diagnostics on production environment• https://github.com/oldmanpushcart/greys-anatomy• Much easier to use than BTrace
• Development tools• Package free tomcat Eclipse plugin
May17,2017 ©2017AlibabaMiddlewareGroup 7
Other Usages that worth mentioning
• Cluster Session Management:TaobaoSession• Cookie + distributed key-value storage(Tair)
• Database Connection Pool: Druid• https://github.com/alibaba/druid• Performance, Monitoring,Extensibility
• Multi-tenancy• High density deployment for legacy web applications• Providing CPU/Memory isolation• Based on multi-tenant JVM
May17,2017 ©2017AlibabaMiddlewareGroup 8
Outline• How we use Tomcat at Alibaba• High throughput web applications• The challenges• Case studies• Summary• Q & A
May17,2017 ©2017AlibabaMiddlewareGroup 9
High throughput web application
• Taobao API Gateway for mobile devices• Interacting with hundredsof backend services• Handling millions of request per second
May17,2017 ©2017AlibabaMiddlewareGroup 10
API Gateway
Shopping Cart
MobileDevices Order
Inventory
Shipping
…
Overall Architecture
• Nginx + Tomcat on the same machine• Security• Rate LimitingMay17,2017 ©2017AlibabaMiddlewareGroup 11
Connection & Load Balance
PC
Nginx Nginx Tomcat Database
Nginx
Nginx
Database
Database
Mobile
Nginx Tomcat
Nginx Tomcat
Nginx Tomcat
Outline• How we use Tomcat at Alibaba• High throughputweb applications• The challenges• Case studies• Summary• Q & A
May17,2017 ©2017AlibabaMiddlewareGroup 12
The challenges
• ConnectorSelection• Configuration Tweaking• NIO Connection & Concurrency Issues• Case study 1
• Security Issues• Case study 2
May17,2017 ©2017AlibabaMiddlewareGroup 13
Configurationtuning
Tomcat
Security
Concurrency
GCAsynchronization
Connection
MemoryLeak
The challenges
• ConnectorSelection• Configuration Tweaking• NIO Connection & Concurrency Issues• Case study 1
• Security Issues• Case study 2
May17,2017 ©2017AlibabaMiddlewareGroup 14
Configurationtuning
Tomcat
Security
Concurrency
GCAsynchronization
Connection
MemoryLeak
Architecture: BIO + Sync RPC
• One working thread per request• Can not scale up to thousands of requests per second• Low cpu utilization
May17,2017 ©2017AlibabaMiddlewareGroup 15
BIOConnector Servlet
Remote Service
Remote Service
Remote Service
UserRequest Nginx
Architecture: APR + Async RPC
• Blocking on reading request headers• Non-blockingon waiting for the next request• Cons:
• Unexpected JVM crash• Maintenance cost: requires separate tcnative library
May17,2017 ©2017AlibabaMiddlewareGroup 16
APRConnector
AsyncServlet
Remote Service
Remote Service
Remote Service
UserRequest Nginx
Architecture: NIO + Async RPC
• Non-blocking on reading request headers• Non-blocking on waiting for next request• Fully asynchronized
May17,2017 ©2017AlibabaMiddlewareGroup 17
NIOConnector
AsyncServlet
Remote Service
Remote Service
Remote Service
UserRequest Nginx
The challenges
• ConnectorSelection• Configuration Tweaking• NIO Connection & Concurrency Issues• Case study 1
• Security Issues• Case study 2
May17,2017 ©2017AlibabaMiddlewareGroup 18
Configurationtuning
Tomcat
Security
Concurrency
GCAsynchronization
Connection
MemoryLeak
Tweaking Tomcat
• Connector configuration• Do not change if you really encounter some issue• Read the documentation carefully
• access log disabled• Availableon Nginx side
May17,2017 ©2017AlibabaMiddlewareGroup 19
Configuration Default Recommended Note
ConnectionTimeout 20s Decrease
maxThreads 200 Increase
acceptCount (backlog) 100 Increase min(acceptCount , /proc/sys/net/core/somaxconn)
processorCache 200 Increase max(maxThreads, concurrent connection)
The challenges
• ConnectorSelection• Configuration Tweaking• NIO Connection & Concurrency Issues• Case study 1
• Security Issues• Case study 2
May17,2017 ©2017AlibabaMiddlewareGroup 20
Configurationtuning
Tomcat
Security
Concurrency
GCAsynchronization
Connection
MemoryLeak
Tomcat NIO revisit
1. Acceptor accepts a socket2. Acceptor retrieves a nioChannel object from cache3. Poller thread registers nioChannel to its selector
and selects for IO events4. Poller assigns the nioChannel to one of the worker
threads to process the request5. SocketProcessor finishes processing the request
and returns the nioChanel
May17,2017 ©2017AlibabaMiddlewareGroup 21
Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
SocketProcessor
…
…
…
nioChannel
Common NIO Issues
• Acceptor stops accepting new request• Connection leakage• Connection count leakage• nioChannel Cache pollutiondue to concurrency issues
• NPE during request processing• Caused by sendfile, default is on
May17,2017 ©2017AlibabaMiddlewareGroup 22
Case study 1
• Tomcat 7.0.x• NIO connector• Throughput: 1000+ request/s• Tomcat Poller thread throws ConcurrentModificationException
• After a while Tomcat stops accepting new request
May17,2017 ©2017AlibabaMiddlewareGroup 23
Troubleshooting
• Poller thread has its own selector• Shouldnot have concurrency issues• Who is modifying the SelectionKeys?
protected void timeout(int keyCount, boolean hasEvents) { ... //timeout
Set<SelectionKey> keys = selector.keys(); ...
for (Iterator<SelectionKey> iter = keys.iterator(); iter.hasNext();) {
SelectionKey key = iter.next(); ... } ...}
May17,2017 ©2017AlibabaMiddlewareGroup 24
Digging into the JVM• The only entrance: sun.nio.ch.EpollSelectorImpl
protected void implRegister(SelectionKeyImpl ski) { ...
keys.add(ski);+ System.out.println("> " + Thread.currentThread().getName() + ", " + this);
}
protected void implDereg(SelectionKeyImpl ski) throws IOException { ...
keys.remove(ski);+ System.out.println("> " + Thread.currentThread().getName() + ", " + this);
...}
Issue NOT reproducible!May17,2017 ©2017AlibabaMiddlewareGroup 25
Minimize the overhead
• 2 poller thread are modifying the same selector instance!
protected void implRegister(SelectionKeyImpl ski) { ...
keys.add(ski);+ System.out.println(Thread.currentThread().getName() + "> " + this);
}
protected void implDereg(SelectionKeyImpl ski) throws IOException { ...
keys.remove(ski);+ System.out.println(Thread.currentThread().getName() + "> " + this);
...}
http-nio-7001-ClientPoller-1> sun.nio.ch.EPollSelectorImpl@cc98f2c http-nio-7001-ClientPoller-1> sun.nio.ch.EPollSelectorImpl@5787ae61 http-nio-7001-ClientPoller-0> sun.nio.ch.EPollSelectorImpl@cc98f2c
May17,2017 ©2017AlibabaMiddlewareGroup 26
JVM bug?
• Update to JDK 1.7 and the issue is remaining• Setting poller thread number to 1 prevents the issue from beingreproduced• NOT a JVM bug!• Where is the selection key being modified?java.lang.Exception: Stack trace
at sun.nio.ch.EPollSelectorImpl.implRegister(EPollSelectorImpl.java:171)at sun.nio.ch.SelectorImpl.register(SelectorImpl.java:133)at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:180)at org.apache.tomcat.util.net.NioEndpoint$PollerEvent.run(NioEndpoint.java:897)at org.apache.tomcat.util.net.NioEndpoint$Poller.events(NioEndpoint.java:1039)at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:1155)at java.lang.Thread.run(Thread.java:662)
@Overridepublic void run() {
if ( interestOps == OP_REGISTER ) { try {
socket.getIOChannel().register(socket.getPoller().getSelector(),
SelectionKey.OP_READ, key); } catch (Exception x) {
log.error(“”, x); }
} else { // …
}//end if}//run
nioChannel registered tothe wrong selector?
May17,2017 ©2017AlibabaMiddlewareGroup 27
Can’t be changed during runtime
socket.getPoller() != this?
• It is confirmed that socket’s poller is modified after being pushed inevent queue
if (socket.getPoller() != this) {System.out.println(“t: ” + Thread.currentThread().getName() + “, s: ” + socket.getPoller().getName();
}socket.getIOChannel().register(
socket.getPoller().getSelector(), SelectionKey.OP_READ, key);// …
)
t: http-nio-7001-ClientPoller-1, s: http-nio-7001-ClientPoller-0t: http-nio-7001-ClientPoller-0, s: http-nio-7001-ClientPoller-1t: http-nio-7001-ClientPoller-1, s: http-nio-7001-ClientPoller-0
May17,2017 ©2017AlibabaMiddlewareGroup 28
Acceptor
PollerEvent0
…
PollerEvent1
…
Poller1Poller0
Suspicious Point
May17,2017 ©2017AlibabaMiddlewareGroup 29
nioChannel’s poller has been modified by another poller thread
Acceptor
PollerEvent0
…
…
…
Poller1Poller0
Duplicate nioChannelObject?
Further suspicion• Duplicate nioChannel objects in nioChannels cacheprotected boolean setSocketOptions(SocketChannel socket) {
try { // … NioChannel channel = nioChannels.poll(); if ( channel == null ) {
// … channel = new NioChannel(socket, bufhandler);
} //…getPoller0().register(channel);
} catch (Throwable t) { // … return false;
} return true;
}
>>> poll but still found in queue, org.apache.tomcat.util.net.NioChannel@28baf3bf:java.nio.channels.SocketChannel[closed], ts: 1435720813750>>> offer but already found in queue, org.apache.tomcat.util.net.NioChannel@28baf3bf:java.nio.channels.SocketChannel[closed], ts: 1435720812885May17,2017 ©2017AlibabaMiddlewareGroup 30
Why duplicate entries?
• Two threads are concurrently offering the same nioChannel to cache• Poller#timeoutdue to Async Timeout• Async Completeby Async RPC callback thread
public boolean processSocket(NioChannel socket, SocketStatus status, boolean dispatch, int where) {
...SocketProcessor sc = processorCache.poll();if ( sc == null ) sc = new SocketProcessor(socket,status, where);else sc.reset(socket,status, where);if ( dispatch && getExecutor()!=null ) getExecutor().execute(sc);
else sc.run(); ...}
thread: Thread[http-nio-7001-ClientPoller-1,5,main], where: Poller#processKeythread: Thread[http-nio-7001-ClientPoller-1,5,main], where: Poller#timeoutthread: Thread[HSF-CallBack-11-thread-2,10,main], where: Http11NioProcessor#actionInternal
May17,2017 ©2017AlibabaMiddlewareGroup 31
Recap• Async Servlet timeout• 4s~15s
• Async RPC timeout• 3s~15s
• 2 SocketProcessor threads are trying tooffer nioChannel object to cacheconcurrently• Async timeout• Async complete
May17,2017 ©2017AlibabaMiddlewareGroup 32
Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
…
…
…
…
…
… …
Recap(cont.)
• The same nioChannel object hasbeen offered twice
• And it is highly probable that theobject has been offered twiceconsecutively!
May17,2017 ©2017AlibabaMiddlewareGroup 33
Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
nioChannel
nioChannel
…
…
… …
…
• The same nioChannel has been assignedto two different poller event queue• Poller thread 0 is processing timeout onthat object
• Poller thread 1 is registering keys to thatnioChannel
May17,2017 ©2017AlibabaMiddlewareGroup 34
Recap(cont.)Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
…… …
…
nioChannel nioChannel
…
…
• Poller thread 0 throwsConcurrentModification Exception and dies
May17,2017 ©2017AlibabaMiddlewareGroup 35
Recap(cont.)Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
…… …
…
nioChannel nioChannel
…
…X
• Acceptor does not know Poller thread 0has already died• Acceptor keeps assigning to Poller thread0 until it reaches maxConnection(10000)
May17,2017 ©2017AlibabaMiddlewareGroup 36
Recap(cont.)Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
…… …
…
nioChannel
…
…X………
…
…
• Acceptor stops accepting new request
May17,2017 ©2017AlibabaMiddlewareGroup 37
Recap(cont.)Acceptor
User Request
Poller0 Poller1
SocketProcessor
SocketProcessor
…… …
…
nioChannel
…
…X………
…
…
X
Solution
• Tomcat has fixed a similar issue in 7.0.58• But not a complete fix
• It may also happen if handshake == -1• It means the socket has been processed by another thread
• Final fix• Allowingthe nioChannel to be offered exactly once• Using CAS to avoid lock overhead
• Available from7.0.64/8.0.25 onwards
May17,2017 ©2017AlibabaMiddlewareGroup 38
The challenges
• ConnectorSelection• Configuration Tweaking• NIO Connection & Concurrency Issues• Case study 1
• Security Issues• Case study 2
May17,2017 ©2017AlibabaMiddlewareGroup 39
Configurationtuning
Tomcat
Security
Concurrency
GCAsynchronization
Connection
MemoryLeak
Security Issues• Disable manager and host-manager• Disable jmx remote access• Reverse Proxy can defend Tomcat against some attack• e.g. SlowHTTP body attack• Nginx will collect entire http body before handing off to tomcat
• Special Case to high throughputweb applications• Cookie-basedmemory exhaustion
May17,2017 ©2017AlibabaMiddlewareGroup 40
Case Study 2
• Symptom: MessageBytes object array has 13m instances, and takesup ~2G memory which cannot be garbage collected
• Who is holding the reference?
May17,2017 ©2017AlibabaMiddlewareGroup 41
Who is holding the reference?
• Normally 1 request has only a few MessageBytes• Why 2k requests has 13m MessageBytes?May17,2017 ©2017AlibabaMiddlewareGroup 42
ServerCookies
• One request has 2k server cookies• What’s inside the cookies?May17,2017 ©2017AlibabaMiddlewareGroup 43
ServerCookies inside• Dynamically double the cookie array if current size is exceeded
• Never gets shrunk even after recycling• If the cookie string is small enough, it can produce a large number of servercookie objects
private ServerCookie addCookie() { if( cookieCount >= scookies.length ) {
ServerCookie scookiesTmp[]=new ServerCookie[2*cookieCount]; System.arraycopy( scookies, 0, scookiesTmp, 0, cookieCount); scookies=scookiesTmp;
} ServerCookie c = scookies[cookieCount]; if( c==null ) {
c= new ServerCookie(); scookies[cookieCount]=c;
} cookieCount++;
return c; }
May17,2017 ©2017AlibabaMiddlewareGroup 44
The malicious attack
• Cookies are considered as one of the header• maxHeaderCount cannot defend
• Only maxHttpHeader effectively limits the number of cookies.• RFC6265 does not require cookie name and cookie value to be non-empty
May17,2017 ©2017AlibabaMiddlewareGroup 45
Cookie string maxHttpHeader # of bytes a cookie takes # of server cookie object
“a=b;a=b;a=b;…” 8k 4 2k
“a=;a=;a=;…” 8k 3 4k
“=;=;=;=;…” 8k 2 4k
4k cookies *2k request = 8m cookies
Solution
• Add a maxCookieCount parameter to limit the max number of cookieallowed• Shrink the server cookie object array if limit is exceed• Fixed in• 9.0.xfor9.0.0.M10onwards• 8.5.xfor8.5.5onwards• 8.0.xfor8.0.37onwards• 7.0.xfor7.0.71onwards• 6.0.xfor6.0.46onwards
May17,2017 ©2017AlibabaMiddlewareGroup 46
Summary
• Asynchronize yourweb application to improve throughput• NIO is relatively STABLE for high throughput productionuse• Check your version for known issues!
• Troubleshooting concurrency issues can be very tricky• A good tool is really helpful
• Upgrade your tomcat version to get latest updates!
May17,2017 ©2017AlibabaMiddlewareGroup 47
Q & A• Contact:• [email protected]• [email protected]• Facebook: https://www.facebook.com/huxing.zhang• LinkedIn: https://www.linkedin.com/in/huxing-zhang
• Thanks!
May17,2017 ©2017AlibabaMiddlewareGroup 48