Skip to content

Data Quality

This page documents the duplicate-detection and near-duplicate matching utilities available in filoma. Use these helpers to evaluate dataset quality, find exact duplicates, and detect near-duplicates in text and images.

Duplicate Detection

filoma provides robust tools for identifying duplicate files based on content hash or metadata.

import filoma

# Find exact duplicates in a directory
df = filoma.probe_to_df(".")
duplicates = df.find_duplicates(by="sha256")
print(duplicates)

Near-Duplicate Detection

For images and text, filoma can help identify similar content that isn't an exact byte-for-byte match.

  • Image Hashing: Detect visually similar images.
  • Text Normalization: Compare text files after removing whitespace or other noise.

Refer to the Cookbook for more examples.

Dataset Integrity

filoma provides a higher-level DatasetVerifier class to check for dataset-wide integrity issues.

from filoma.core.verifier import DatasetVerifier

# Perform all checks
verifier = DatasetVerifier("/path/to/dataset")
verifier.run_all()
verifier.print_summary()

This automates checking for:

  • Corrupt and zero-byte files
  • Dimension consistency
  • Near-duplicates
  • Label balance
  • Split leakage
  • Anomalous pixel statistics