共计 4497 个字符,预计需要花费 12 分钟才能阅读完成。
最近发现公司的测试环境中有个 Socket 服务的端口总是莫名其妙 Down 掉,但是服务却正常运行着,看样子是僵死了。
虽然是测试环境,但是也不能这样放着不管,于是连夜写了一个简单的监控脚本。因为服务器是 Windows 的,所以要用到 wmi 模块。逻辑如下:
1、用 wmi 模块获取系统中处于停止状态的服务,生成一个字典。
2、判断监控的服务是否存在于字典中,如果存在说明服务已经停止,那么将尝试启动服务,并发送报警邮件。
3、向本地的 Socket 服务端口发送一个 connect,如果捕获到异常将尝试重启服务,并发送报警邮件。
4、每次执行时脚本将会循环执行以上步骤三次,间隔 10 秒,以确保服务状态正常。
在运行的时候发现了一个问题,Python 使用 wmi 模块来对 Windows 系统进行操作的时候速度格外的慢,不知道有没有其他的代替方法,哪位如果有更好的方法可以指点一下。
源码如下:
#!/usr/bin/env python
import os
import wmi
import time
import socket
import base64
import smtplib
import logging
from email.mime.text import MIMEText
def get_stop_service(designation):
"""Get stopped service name and caption,
Filtration 'designation' service whether there is 'Stopped'.
:return: service state
"""
c = wmi.WMI()
ret = dict()
for service in c.Win32_Service():
state, caption = service.State, service.Caption
if state == 'Stopped':
t = ret.get(state, [])
t.append(caption)
ret[state] = t
# If 'designation' service in the 'Stopped', return status is 'down'
if designation in ret.get('Stopped'):
logging.error('Service [%s] is down, try to restart the service. \r\n' % designation)
return 'down'
return True
def monitor(sname):
"""Send the machine IP port 20000 socket request,
If capture the abnormal returns the string 'ex'.
:return: string 'ex'
"""
s = socket.socket()
s.settimeout(3) # timeout
host = ('127.0.0.1', 20000)
try: # Try connection to the host
s.connect(host)
except socket.error as e:
logging.warning('[%s] service connection failed: %s \r\n' % (sname, e))
return 'ex'
return True
def restart_service(rstname, conn, run):
"""First check whether the service is stopped,
if stop, start the service directly.
The check whether the zombies,
if a zombie, then restart the service.
:return: flag or True
"""
flag = False
try:
# From get_stop_service() to obtain the return value, the return value
if run == 'down':
ret = os.system('sc start"%s"' % rstname)
if ret != 0:
raise Exception('[Errno %s]' % ret)
flag = True
elif conn == 'ex':
retStop = os.system('sc stop"%s"' % rstname)
retSart = os.system('sc start"%s"' % rstname)
if retSart != 0:
raise Exception('retStop [Status code %s] '
'retSart [Status code %s] ' % (retStop, retSart))
flag = True
else:
logging.info('[%s] service running status to normal' % rstname)
return True
except Exception as e:
logging.warning('[%s] service restart failed: %s \r\n' % (rstname, e))
return flag
def send_mail(to_list, sub, contents):
"""Send alarm mail.
:return: flag
"""
mail_server = 'mail.stmp.com' # STMP Server
mail_user = 'YouAccount' # Mail account
mail_pass = base64.b64decode('Password') # The encrypted password
mail_postfix = 'smtp.com' # Domain name
me = 'Monitor alarm<%s@%s>' % (mail_user, mail_postfix)
message = MIMEText(contents, _subtype='html', _charset='utf-8')
message['Subject'] = sub
message['From'] = me
message['To'] = ';'.join(to_list)
flag = False # To determine whether a mail sent successfully
try:
s = smtplib.SMTP()
s.connect(mail_server)
s.login(mail_user, mail_pass)
s.sendmail(me, to_list, message.as_string())
s.close()
flag = True
except Exception, e:
logging.warning('Send mail failed, exception: [%s]. \r\n' % e)
return flag
def main(sname):
"""Parameter type in the name of the service need to monitor,
perform functions defined in turn, and the return value is correct.
After the program is running, will test three times,
each time interval to 10 seconds.
:return: retValue
"""
retry = 3
count = 0
retValue = False # Used return to the state of the socket
while count < retry:
ret = monitor(sname)
if ret != 'ex': # If socket connection is normaol, return retValue
retValue = ret
return retValue
isDown = get_stop_service(sname)
restart_service(rstname=sname, conn=ret, run=isDown)
host = socket.gethostname()
address = socket.gethostbyname(host)
mailto_list = ['mail@smtp.com', ] # Alarm contacts
send_mail(mailto_list,
'Alarm',
' <h4>Level: <u>ERROR</u></br> Host name: %s</br>'
' IP Address: %s</br>'
' Service name:</h4> <h5>%s</h5>'
% (host, address, sname))
count += 1
time.sleep(10)
else:
logging.error('[%s] service try to restart more than three times \r\n' % sname)
return retValue
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s',
datefmt='%Y/%m/%d %H:%M:%S',
filename='D:\\logs\\Monitor.log',
filemode='ab')
name = 'Service Name'
response = main(name)
if response:
logging.info('The [%s] service connection is normal \r\n' % name)
以上代码还是有可以改进的地方,将多个服务名写到文件中,程序去读取文件中的服务依次进行检测。
Ubuntu 14.04 安装 Python 3.3.5 http://www.linuxidc.com/Linux/2014-05/101481.htm
CentOS 上源码安装 Python3.4 http://www.linuxidc.com/Linux/2015-01/111870.htm
《Python 核心编程 第二版》.(Wesley J. Chun).[高清 PDF 中文版] http://www.linuxidc.com/Linux/2013-06/85425.htm
《Python 开发技术详解》.(周伟, 宗杰).[高清 PDF 扫描版 + 随书视频 + 代码] http://www.linuxidc.com/Linux/2013-11/92693.htm
Python 脚本获取 Linux 系统信息 http://www.linuxidc.com/Linux/2013-08/88531.htm
在 Ubuntu 下用 Python 搭建桌面算法交易研究环境 http://www.linuxidc.com/Linux/2013-11/92534.htm
Python 语言的发展简史 http://www.linuxidc.com/Linux/2014-09/107206.htm
Python 的详细介绍 :请点这里
Python 的下载地址 :请点这里
本文永久更新链接地址 :http://www.linuxidc.com/Linux/2016-09/135620.htm