cephFS jewel client 报警:
mds0: Client l10-190 failing to respond to capability release
1 2 3 4 5 6 7 8 9 | 2016-08-26 08:01:49.905823 mds.0 192.168.14.120:6800/171607 3 : cluster [WRN] 7 slow requests, 5 included below; oldest blocked for > 33.942552 secs 2016-08-26 08:01:49.905831 mds.0 192.168.14.120:6800/171607 4 : cluster [WRN] slow request 32.990829 seconds old, received at 2016-08-26 08:01:16.914880: client_request(client.1840252:164888 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:16.919088) currently failed to rdlock, waiting 2016-08-26 08:01:49.905840 mds.0 192.168.14.120:6800/171607 5 : cluster [WRN] slow request 32.945069 seconds old, received at 2016-08-26 08:01:16.960641: client_request(client.1840252:164889 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:16.965089) currently failed to rdlock, waiting 2016-08-26 08:01:49.905848 mds.0 192.168.14.120:6800/171607 6 : cluster [WRN] slow request 32.819194 seconds old, received at 2016-08-26 08:01:17.086515: client_request(client.1840252:164893 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:17.091092) currently failed to rdlock, waiting 2016-08-26 08:01:49.905852 mds.0 192.168.14.120:6800/171607 7 : cluster [WRN] slow request 33.942552 seconds old, received at 2016-08-26 08:01:15.963158: client_request(client.1840252:164835 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.967068) currently failed to rdlock, waiting 2016-08-26 08:01:49.905857 mds.0 192.168.14.120:6800/171607 8 : cluster [WRN] slow request 33.930154 seconds old, received at 2016-08-26 08:01:15.975555: client_request(client.1840252:164836 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.980068) currently failed to rdlock, waiting 2016-08-26 08:01:54.905862 mds.0 192.168.14.120:6800/171607 9 : cluster [WRN] 7 slow requests, 2 included below; oldest blocked for > 38.942642 secs 2016-08-26 08:01:54.905868 mds.0 192.168.14.120:6800/171607 10 : cluster [WRN] slow request 38.920220 seconds old, received at 2016-08-26 08:01:15.985579: client_request(client.1840252:164837 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.990068) currently failed to rdlock, waiting 2016-08-26 08:01:54.905871 mds.0 192.168.14.120:6800/171607 11 : cluster [WRN] slow request 38.894827 seconds old, received at 2016-08-26 08:01:16.010972: client_request(client.1840252:164838 getattr pAsLsXsFs #100 |
1 2 3 4 | cluster 75f7dde4-d350-4853-9asda6b4ed2 health HEALTH_WARN mds0: Client l10-190 failing to respond to capability release mds0: Client l10-191 failing to respond to capability release |
1 2 | 问题原因: cdn 预取图片回源比较到,导致ngx 进程down 了 |
1 2 | 解决方法: 临时解决办法:关闭ngx 服务,重新mont -t ceph *:path |
1 2 3 4 5 | 思考: 1. 视频一当回源将会导致ngx 进程宕机,是否可以改变视频回源,针对视频文件进行切片的大小 设置,并发数减小 2. ngx 是否可以优化,改大缓存参数 3. cephfs 是否可以优化 |
1 2 3 4 5 | 经过长时间关注,问题依然发生:终极解决办法,思路: 1. 检测 client 端口状态 2. 如果端口不通或超时 3. 重启NGX 服务 4. 代码如下 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | #!/usr/bin/env python #-*-coding:UTF- 8 -*- """ @Item : Cephfs @Author : Villiam Sheng @Group : System Group @Date : 2016 - 08 - 11 @Mail : swq .499809608 @hotmail . com @Funtion: Check NGX Server """ import ftplib, os,time,sys,traceback,socket,hashlib LOGFILE = '/tmp/ngx.log' def LOG (info): if not os . path . exists(LOGFILE): os . system("touch %s"%LOGFILE) fopen = open(LOGFILE, 'a' ) fopen . write ("%s INFO %s \n" %(time . ctime(),info)) class TelnetHttp( object ): def __init__ (self): version = 0 def work (self,dest_addr,port): sock=socket . socket(socket . AF_INET,socket . SOCK_STREAM) sock . settimeout( 6 ) try : sock . connect((dest_addr,int(port))) LOG("ping ngx port %s: 80 is oK" %dest_addr) sock . close() return True except socket . error,e: sock . close() return False if __name__ == '__main__' : try : pid = os . fork() if pid > 0 : sys . exit( 0 ) os . setsid() os . chdir( '/' ) sys . stdin = open("/dev/null","r+") sys . stdout = os . dup(sys . stdin . fileno()) sys . stderr = os . dup(sys . stdin . fileno()) while True : try : start = TelnetHttp() result = start . work(" 127.0.0.1 ", 80 ) if result: time . sleep( 3 ) else : os . system("/usr/local/nginx/sbin/nginx -s stop") time . sleep( 6 ) os . system("/usr/local/nginx/sbin/nginx") LOG("Ngx service anomalies, restart the NGX services") except : os . system("/usr/local/nginx/sbin/nginx -s stop") time . sleep( 6 ) os . system("/usr/local/nginx/sbin/nginx") LOG("Ngx service anomalies, restart the NGX services") continue except IOError,e: LOG(traceback . format_exc()) |