Data Quality
This page documents the duplicate-detection and near-duplicate matching utilities available in filoma. Use these helpers to evaluate dataset quality, find exact duplicates, and detect near-duplicates in text and images.
Duplicate Detection¶
filoma provides robust tools for identifying duplicate files based on content hash or metadata.
import filoma
# Find exact duplicates in a directory
df = filoma.probe_to_df(".")
duplicates = df.find_duplicates(by="sha256")
print(duplicates)
Near-Duplicate Detection¶
For images and text, filoma can help identify similar content that isn't an exact byte-for-byte match.
- Image Hashing: Detect visually similar images.
- Text Normalization: Compare text files after removing whitespace or other noise.
Refer to the Cookbook for more examples.
Dataset Integrity¶
filoma provides a higher-level DatasetVerifier class to check for dataset-wide integrity issues.
from filoma.core.verifier import DatasetVerifier
# Perform all checks
verifier = DatasetVerifier("/path/to/dataset")
verifier.run_all()
verifier.print_summary()
This automates checking for:
- Corrupt and zero-byte files
- Dimension consistency
- Near-duplicates
- Label balance
- Split leakage
- Anomalous pixel statistics