Blogs

Ronie Martinez

May 12, 2017

Finding Duplicate Files Using Python3 and xxHash

Having problems looking for duplicate files on your machine? This article shows a simple, fast and efficient solution using Python3 and xxhash.

Why use xxHash?

One way to identify files is through hashing algorithms which produces shorter strings that acts like a signature. There are a lot of popular hashing algorithms like MD5 and SHA, however, xxHash promises faster speed.

Listing all files recursively

First thing we need is to list files in a directory including the files in its subdirectories. The simplest way is to use os.walk. The generator below lists all unique files by absolute path from a list of directories.

import os

def list_all_files_recursively(*directories):
    file_path_set = set()
    for directory in directories:
        for directory_path, directory_names, file_names in os.walk(directory):
            for file_name in file_names:
                file_path = os.path.join(directory_path, file_name)
                if os.path.isfile(file_path):
                    absolute_path = os.path.abspath(file_path)
                    if absolute_path not in file_path_set:
                        file_path_set.add(absolute_path)
                        yield absolute_path

Calculating file hash

To efficiently calculate the hash of a file, particularly large ones, is to read the file in chunks.

from xxhash import xxh64

def calculate_hash(file_path):
    hasher = xxh64()
    with open(file_path, 'rb') as f:
        while True:
            buf = f.read(4096)
            if not buf:
                break
            hasher.update(buf)
    return hasher.hexdigest()

Finding duplicates

One of the fastest methods to find duplicates in a list of objects (in our case, the hash string) is to use collections.defaultdict.

if __name__ == '__main__':
    # hash counter
    hash_dict = defaultdict(int)

    # absolute path of files
    file_hash_list = []  # store tuple(hash, path)

    # populate
    for file_path in list_all_files_recursively(*sys.argv[1:]):
        try:
            file_hash = calculate_hash(file_path)
            hash_dict[file_hash] += 1
            # append (hash, path) tuple to list
            file_hash_list.append((file_hash, file_path))
        except PermissionError:  # file not accessible
            pass

    # find duplicate hashes
    for k in (k for k, v in hash_dict.items() if v > 1):
        print(k)
        # find files associated with hash `k`
        for p in (p for h, p in file_hash_list if h==k):
            print('', p, sep='\t')

Python script in action

The complete source code is available here. To execute the script, use the following command:

python3 check_duplicate.py <directory_1> <directory_2> ... <directory_N>

References