Python利用模糊哈希实现对比文件相似度详解

121次阅读

共计 2492 个字符，预计需要花费 7 分钟才能阅读完成。

导读	对比两个文件相似度，python 中可通过 difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh 实现，在大量需要对比，且文件较大时，需要更高的效率，可以考虑模糊哈希，本文就来和大家详细聊聊

对比两个文件相似度，python 中可通过 difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh 实现，在大量需要对比，且文件较大时，需要更高的效率，可以考虑模糊哈希（fuzzy hash），如 ssdeep/python_mmdt

测试过程发现：

difflib 方法，读取文件后，可以实现匹配度输出

ssdeep/mmdt/tlsh 方法可以实现，实现提前模糊哈希值，验证时，只读取一次，完成对比，从而优化对比时间，及内存 /cpu 消耗

tlsh 测试时，值越小，相似度越高，在对比小文件时，很不理想

在对比小文件时，三种方法相差不大，在对比大文件（案例中 81MB），difflib 方法慢的难以接受

在实际环境中，建议使用 mmdt 方法，因为 ssdeep 在二进制对比中差别较大，失去参考价值，具体还有哪些文件类型存在此问题有待考量，

测试环境：

OS：ubuntu20.04

python:3.8.10

py-tlsh==4.7.2

python-mmdt==0.3.1

ssdeep==3.4

 # -*- coding: utf-8 -*-
 
import ssdeep
import time
from python_mmdt.mmdt.mmdt import MMDT
from difflib import SequenceMatcher
 
def difflib_test(file1,file2):
    start_time = time.time()
    with open(file1,'rb') as f:
        s1 = f.read()
    with open(file2,'rb') as f:
        s2 = f.read()
    match_obj =  SequenceMatcher(None,s1,s2)
    print("difflib match:",match_obj.ratio())
    end_time = time.time()
    print('difflib_test cost：',end_time-start_time)
 
def mmdt_test(file1,file2):
    start_time = time.time()
    mmdt=MMDT()
    r1 = mmdt.mmdt_hash(file1)
    print(r1)
    r2 = mmdt.mmdt_hash_streaming(file2)
    print(r2)
    # sim1 = mmdt.mmdt_compare(file1, file2)
    # print("mmdt match:",sim1)
    sim2 = mmdt.mmdt_compare_hash(r1, r2)
    print("mmdt match:",sim2)
    end_time = time.time()
    print('mmdt_test cost：',end_time-start_time)
 
def ssdeep_test(file1,file2):
    start_time = time.time()
    sig1=ssdeep.hash_from_file(file1)
    sig2=ssdeep.hash_from_file(file2)
    print(sig1)
    print(sig2)
    print("ssdeep match:",ssdeep.compare(sig1,sig2))
    end_time = time.time()
    print('ssdeep_test cost：',end_time-start_time)
 
if __name__ == '__main__':
    start_time = time.time()
    file1='/root/test/fstab'
    file2='/root/test/fstab2'
    # file1 = '/root/test/initrd.img-5.4.0-125-generic'
    # file2 = '/root/test/initrd.img-5.4.0-135-generic'
    mmdt_test(file1,file2)    
    ssdeep_test(file1,file2)
    difflib_test(file1,file2)
    end_time = time.time()
    print('总执行时间：',end_time-start_time)

下面给出对比小文件 / 大文件效果：

Python 利用模糊哈希实现对比文件相似度详解

测试 tlsh

 import tlsh
import time
 
def tlsh_test(file1,file2):
    start_time = time.time()
    with open(file1,'rb') as f:
        s1 = tlsh.hash(f.read())
    with open(file2,'rb') as f:
        s2 = tlsh.hash(f.read())
    match_obj =  tlsh.diff(s1,s2)
    print("tlsh match:",match_obj)
    end_time = time.time()
    print('difflib_test cost：',end_time-start_time)
 
 
if __name__ == '__main__':
    start_time = time.time()
    # file1='/root/test/fstab'
    # file2='/root/test/fstab2'
    file1 = '/root/test/initrd.img-5.4.0-125-generic'
    file2 = '/root/test/initrd.img-5.4.0-135-generic'
    tlsh_test(file1,file2)
    end_time = time.time()
    print('总执行时间：',end_time-start_time)

对比小文件 / 大文件

Python 利用模糊哈希实现对比文件相似度详解

到此这篇关于 Python 利用模糊哈希实现对比文件相似度的文章就介绍到这了

阿里云 2 核 2G 服务器 3M 带宽 61 元 1 年，有高配

腾讯云新客低至 82 元 / 年，老客户 99 元 / 年

代金券：在阿里云专用满减优惠券

正文完

星哥玩云-微信公众号

发表至： linux教程

2024-07-24

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

资源控制器之Deployment

用 sar 工具检测系统性能瓶颈

使用Epoll 能监听普通文件吗？

Linux 中的 Cat 命令——用 Bash 示例解释连接

Python中非常有用的三个数据科学库

Linux面试真题- DNS域名系统主要负责主机名和什么之间的解析？

Linux下高效指令

Linux启用 “激活 Linux” 水印

Golang创建构造函数的方法详解

Python利用模糊哈希实现对比文件相似度详解

申请腾讯混元的API Key并且使用LobeChat调用混元AI

Docker部署搭建一个开源强大的图书管理系统

基于Docker快速搭建一个开源的IT人员在线工具箱-it-tools

让每个人都可以轻松使用Git-腾讯自研Git客户端

使用Docker部署开源的WPS-Office

别再说人工智能、deepseek不好用了，那是你不会这样用，赶紧收藏起来！

解析Web开发中的几种认证方法及应用场景

fold 命令入门学习

yumdownloader:下载保存Yum包而不安装

阿里云和腾讯云哪个好？服务器哪个速度快？

	# -- coding: utf-8 --

	import ssdeep
	import time
	from python_mmdt.mmdt.mmdt import MMDT
	from difflib import SequenceMatcher

	def difflib_test(file1,file2):
	start_time = time.time()
	with open(file1,'rb') as f:
	s1 = f.read()
	with open(file2,'rb') as f:
	s2 = f.read()
	match_obj = SequenceMatcher(None,s1,s2)
	print("difflib match:",match_obj.ratio())
	end_time = time.time()
	print('difflib_test cost：',end_time-start_time)

	def mmdt_test(file1,file2):
	start_time = time.time()
	mmdt=MMDT()
	r1 = mmdt.mmdt_hash(file1)
	print(r1)
	r2 = mmdt.mmdt_hash_streaming(file2)
	print(r2)
	# sim1 = mmdt.mmdt_compare(file1, file2)
	# print("mmdt match:",sim1)
	sim2 = mmdt.mmdt_compare_hash(r1, r2)
	print("mmdt match:",sim2)
	end_time = time.time()
	print('mmdt_test cost：',end_time-start_time)

	def ssdeep_test(file1,file2):
	start_time = time.time()
	sig1=ssdeep.hash_from_file(file1)
	sig2=ssdeep.hash_from_file(file2)
	print(sig1)
	print(sig2)
	print("ssdeep match:",ssdeep.compare(sig1,sig2))
	end_time = time.time()
	print('ssdeep_test cost：',end_time-start_time)

	if __name__ == '__main__':
	start_time = time.time()
	file1='/root/test/fstab'
	file2='/root/test/fstab2'
	# file1 = '/root/test/initrd.img-5.4.0-125-generic'
	# file2 = '/root/test/initrd.img-5.4.0-135-generic'
	mmdt_test(file1,file2)
	ssdeep_test(file1,file2)
	difflib_test(file1,file2)
	end_time = time.time()
	print('总执行时间：',end_time-start_time)

	import tlsh
	import time

	def tlsh_test(file1,file2):
	start_time = time.time()
	with open(file1,'rb') as f:
	s1 = tlsh.hash(f.read())
	with open(file2,'rb') as f:
	s2 = tlsh.hash(f.read())
	match_obj = tlsh.diff(s1,s2)
	print("tlsh match:",match_obj)
	end_time = time.time()
	print('difflib_test cost：',end_time-start_time)


	if __name__ == '__main__':
	start_time = time.time()
	# file1='/root/test/fstab'
	# file2='/root/test/fstab2'
	file1 = '/root/test/initrd.img-5.4.0-125-generic'
	file2 = '/root/test/initrd.img-5.4.0-135-generic'
	tlsh_test(file1,file2)
	end_time = time.time()
	print('总执行时间：',end_time-start_time)