Duplicate Detection
Duplicate Detection
This project includes a lightweight duplicate-detection helper available at src/filoma/dedup.py.
- Exact duplicates:
compute_sha256(path)andfind_duplicates(paths)detect byte-for-byte duplicates. - Text near-duplicates: uses k-shingles and Jaccard similarity. Configure
text_kandtext_threshold. - Image near-duplicates: basic perceptual hashing (aHash/dHash) using Pillow. Configure
image_hashandimage_max_distance.
Examples:
from filoma import dedup
files = ["data/a.jpg", "data/b.jpg", "data/c.txt"]
res = dedup.find_duplicates(files, text_threshold=0.8, image_max_distance=6)
print(res["exact"]) # exact matches
print(res["text"]) # near-duplicate text groups
print(res["image"]) # near-duplicate image groups
Optional dependencies:
Pillow— recommended for image hashing.datasketch— optional for MinHash acceleration on large text datasets.