Finding Duplicate Files Using Python3 and xxHash
Having problems looking for duplicate files on your machine? This article shows a simple, fast and efficient solution using Python3 and xxhash.
Why use xxHash?
One way to identify files is through hashing algorithms which produces shorter strings that acts like a signature.
There are a lot of popular hashing algorithms like
SHA, however, xxHash promises faster speed.
Listing all files recursively
First thing we need is to list files in a directory including the files in its subdirectories.
The simplest way is to use os.walk.
generator below lists all unique files by absolute path from a list of directories.
import os def list_all_files_recursively(*directories): file_path_set = set() for directory in directories: for directory_path, directory_names, file_names in os.walk(directory): for file_name in file_names: file_path = os.path.join(directory_path, file_name) if os.path.isfile(file_path): absolute_path = os.path.abspath(file_path) if absolute_path not in file_path_set: file_path_set.add(absolute_path) yield absolute_path
Calculating file hash
To efficiently calculate the hash of a file, particularly large ones, is to read the file in chunks.
from xxhash import xxh64 def calculate_hash(file_path): hasher = xxh64() with open(file_path, 'rb') as f: while True: buf = f.read(4096) if not buf: break hasher.update(buf) return hasher.hexdigest()
One of the fastest methods to find duplicates in a list of objects (in our case, the hash string) is to use collections.defaultdict.
if __name__ == '__main__': # hash counter hash_dict = defaultdict(int) # absolute path of files file_hash_list =  # store tuple(hash, path) # populate for file_path in list_all_files_recursively(*sys.argv[1:]): try: file_hash = calculate_hash(file_path) hash_dict[file_hash] += 1 # append (hash, path) tuple to list file_hash_list.append((file_hash, file_path)) except PermissionError: # file not accessible pass # find duplicate hashes for k in (k for k, v in hash_dict.items() if v > 1): print(k) # find files associated with hash `k` for p in (p for h, p in file_hash_list if h==k): print('', p, sep='\t')
Python script in action
The complete source code is available here. To execute the script, use the following command:
python3 check_duplicate.py <directory_1> <directory_2> ... <directory_N>