cephFS jewel  client 报警: 

mds0: Client l10-190 failing to respond to capability release

1
2
3
4
5
6
7
8
9
2016-08-26 08:01:49.905823 mds.0 192.168.14.120:6800/171607 3 : cluster [WRN] 7 slow requests, 5 included below; oldest blocked 
for 
> 33.942552 secs
2016-08-26 08:01:49.905831 mds.0 192.168.14.120:6800/171607 4 : cluster [WRN] slow request 32.990829 seconds old, received at 2016-08-26 08:01:16.914880: client_request(client.1840252:164888 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:16.919088) currently failed to rdlock, waiting
2016-08-26 08:01:49.905840 mds.0 192.168.14.120:6800/171607 5 : cluster [WRN] slow request 32.945069 seconds old, received at 2016-08-26 08:01:16.960641: client_request(client.1840252:164889 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:16.965089) currently failed to rdlock, waiting
2016-08-26 08:01:49.905848 mds.0 192.168.14.120:6800/171607 6 : cluster [WRN] slow request 32.819194 seconds old, received at 2016-08-26 08:01:17.086515: client_request(client.1840252:164893 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:17.091092) currently failed to rdlock, waiting
2016-08-26 08:01:49.905852 mds.0 192.168.14.120:6800/171607 7 : cluster [WRN] slow request 33.942552 seconds old, received at 2016-08-26 08:01:15.963158: client_request(client.1840252:164835 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.967068) currently failed to rdlock, waiting
2016-08-26 08:01:49.905857 mds.0 192.168.14.120:6800/171607 8 : cluster [WRN] slow request 33.930154 seconds old, received at 2016-08-26 08:01:15.975555: client_request(client.1840252:164836 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.980068) currently failed to rdlock, waiting
2016-08-26 08:01:54.905862 mds.0 192.168.14.120:6800/171607 9 : cluster [WRN] 7 slow requests, 2 included below; oldest blocked 
for 
> 38.942642 secs
2016-08-26 08:01:54.905868 mds.0 192.168.14.120:6800/171607 10 : cluster [WRN] slow request 38.920220 seconds old, received at 2016-08-26 08:01:15.985579: client_request(client.1840252:164837 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.990068) currently failed to rdlock, waiting
2016-08-26 08:01:54.905871 mds.0 192.168.14.120:6800/171607 11 : cluster [WRN] slow request 38.894827 seconds old, received at 2016-08-26 08:01:16.010972: client_request(client.1840252:164838 getattr pAsLsXsFs #100
1
2
3
4
cluster 75f7dde4-d350-4853-9asda6b4ed2
     
health HEALTH_WARN
            
mds0: Client l10-190 failing to respond to capability release
            
mds0: Client l10-191 failing to respond to capability release
1
2
问题原因:
    
cdn 预取图片回源比较到,导致ngx 进程down 了
1
2
解决方法:
    
临时解决办法:关闭ngx 服务,重新mont -t ceph *:path
1
2
3
4
5
思考:
    
1. 视频一当回源将会导致ngx 进程宕机,是否可以改变视频回源,针对视频文件进行切片的大小
       
设置,并发数减小
    
2. ngx 是否可以优化,改大缓存参数
    
3. cephfs 是否可以优化
1
2
3
4
5
经过长时间关注,问题依然发生:终极解决办法,思路:
    
1. 检测 client 端口状态
    
2. 如果端口不通或超时
    
3. 重启NGX 服务
    
4. 代码如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#!/usr/bin/env python
#-*-coding:UTF-
8
-*-
"""
@Item   :  Cephfs 
@Author :  Villiam Sheng
@Group  :  System Group
@Date   :  
2016
-
08
-
11
@Mail   :  swq
.499809608
@hotmail
.
com
@Funtion:
           
Check NGX Server 
"""
 
import ftplib, os,time,sys,traceback,socket,hashlib
 
LOGFILE = 
'/tmp/ngx.log'
 
 
def LOG (info):
    
if 
not 
os
.
path
.
exists(LOGFILE):
        
os
.
system("touch %s"%LOGFILE)
 
    
fopen = open(LOGFILE,
'a'
)
    
fopen
.
write
("%s INFO  %s \n" %(time
.
ctime(),info))
 
class 
TelnetHttp(
object
):
    
def __init__ (self):
        
version  = 
0
 
     
    
def work (self,dest_addr,port):
        
sock=socket
.
socket(socket
.
AF_INET,socket
.
SOCK_STREAM)
        
sock
.
settimeout(
6
)
        
try
:
            
sock
.
connect((dest_addr,int(port)))
            
LOG("ping ngx port  %s:
80 
is 
oK" %dest_addr)
            
sock
.
close()
            
return 
True
        
except 
socket
.
error,e:
            
sock
.
close()
            
return 
False
 
 
if 
__name__ == 
'__main__'
:
    
try
:
        
pid = os
.
fork()
        
if 
pid > 
0 
:
            
sys
.
exit(
0
)
        
os
.
setsid()
        
os
.
chdir(
'/'
)
        
sys
.
stdin = open("/dev/null","r+")
        
sys
.
stdout = os
.
dup(sys
.
stdin
.
fileno())
        
sys
.
stderr = os
.
dup(sys
.
stdin
.
fileno())
 
 
        
while 
True
:
            
try
:
                
start = TelnetHttp()
                
result = start
.
work("
127.0.0.1
",
80
)
                
if 
result:
                    
time
.
sleep(
3
)
                
else
:
                    
os
.
system("/usr/local/nginx/sbin/nginx -s stop")
                    
time
.
sleep(
6
)
                    
os
.
system("/usr/local/nginx/sbin/nginx")
                    
LOG("Ngx service anomalies, restart the NGX services")
            
except
:
                
os
.
system("/usr/local/nginx/sbin/nginx -s stop")
                
time
.
sleep(
6
)
                
os
.
system("/usr/local/nginx/sbin/nginx")
                
LOG("Ngx service anomalies, restart the NGX services")
                
continue
    
except 
IOError,e:
        
LOG(traceback
.
format_exc())