最近发现一台机器堆内存使用率大约100%,Web server相应极慢,导致客户端多次报告Read Timeout.
跟踪Resin的jvm-default log,发现已经持续告警:
[15-11-13 09:48:57.864] {CoordinatorThread[]-3} AlarmClock slow alarm Alarm[alarm[WebAppExpandDeployGenerator[null]]] 27901ms coordinator-delta 31071ms
[15-11-13 09:48:57.864] {CoordinatorThread[]-3} AlarmClock slow alarm Alarm[alarm[WebAppExpandDeployGenerator[null]]] 26071ms coordinator-delta 31071ms
[15-11-13 09:49:02.865] {CoordinatorThread[]-3} AlarmClock slow alarm Alarm[alarm[HostExpandDeployGenerator[null]]] 17393ms coordinator-delta 22394ms
1.通过jmap -heap查看一下堆内存使用情况,相当震惊:新生代和老年代使用率都接近100%.
[work@lg-b2c-kfs-webchat-app02 ~]$ jmap -heap 27000
Attaching to process ID 27000, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.12-b01
using thread-local object allocation.
Parallel GC with 8 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 4294967296 (4096.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 268435456 (256.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 1040121856 (991.9375MB)
used = 978730744 (933.3903732299805MB)
free = 61391112 (58.54712677001953MB)
94.09770002948578% used
From Space:
capacity = 26935296 (25.6875MB)
used = 0 (0.0MB)
free = 26935296 (25.6875MB)
0.0% used
To Space:
capacity = 22675456 (21.625MB)
used = 0 (0.0MB)
free = 22675456 (21.625MB)
0.0% used
PS Old Generation
capacity = 2863333376 (2730.6875MB)
used = 2863333288 (2730.68741607666MB)
free = 88 (8.392333984375E-5MB)
99.99999692665895% used
PS Perm Generation
capacity = 78249984 (74.625MB)
used = 78185600 (74.5635986328125MB)
free = 64384 (0.0614013671875MB)
99.91772011097153% used
2.通过jmap -histo查看一下堆中的对象信息,发现业务对象Terminal和HttpChatContext实例数已经达到两百多万,正常情况下应该维持在几百到几千。更奇怪的是居然有大量的java.lang.ThreadLocal$ThreadLocalMap实例对象。
[work@lg-b2c-kfs-webchat-app02 log]$ jmap -histo 27000|less
num #instances #bytes class name
----------------------------------------------
1: 11926272 920079152 [C
2: 22979344 735339008 java.lang.ThreadLocal$ThreadLocalMap$Entry
3: 22978102 551474448 java.util.concurrent.locks.ReentrantReadWriteLock$Sync$HoldCounter
4: 11795560 377457920 java.lang.String
5: 259 167801200 [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;
6: 4237670 135605440 java.util.concurrent.ConcurrentHashMap$HashEntry
7: 1611294 128911920 [Ljava.util.HashMap$Entry;
8: 2092579 100443792 com.xiaomi.kfs.chat.core.domain.Terminal
9: 2350789 94031560 java.util.LinkedHashMap$Entry
10: 2092583 83703320 java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync
11: 28486 59237952 [I
12: 1584956 50718592 java.util.HashMap$Entry
13: 2092580 50221920 java.util.concurrent.locks.ReentrantReadWriteLock
14: 2077571 49861704 com.xiaomi.kfs.chat.service.support.HttpChatContext
15: 783207 43859592 java.util.LinkedHashMap
16: 827778 39733344 java.util.HashMap
17: 13066 34390136 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
18: 2092583 33481328 java.util.concurrent.locks.ReentrantReadWriteLock$Sync$ThreadLocalHoldCounter
19: 2092583 33481328 java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock
20: 2092580 33481280 java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock
21: 1564801 25036816 java.util.HashMap$EntrySet
22: 163768 22473320 [B
3.可以确认出现了内存泄露,Terminal和HttpChatContext存在未被回收的现象。使用工具来观察一下这两种类对象是通过哪些路径和GC Roots关联上的。我们通过dump快照和JProfiler来进行进一步分析:
jmap -dump:format=b,live,file=/home/work/tmp/terminalMemoryLeak.hprof 27000
1
4.通过JProfiler,我们可以非常清晰地看到各类对象的数量和内存占用情况。
5.选择com.xiaomi.kfs.chat.core.domain.Terminal类,双击创建一个新的Object Set,选择Cumulated incoming references,可以看到Terminal对象的引用链。
6.经过分析确认,ChatDomainManager中维护了Terminal的哈希表,而某些Terminal未被正确地从ConcurrentHashMap中删除,因此出现数量过大的情况。而Terminal对象中使用了ReentrantReadWriteLock,ReentrantReadWriteLock是通过ThreadLocal来记录当前线程所持有的读锁数量的。因此才有了上面诸多有关ReentrantReadWriteLock的对象实例。