Now in Active Development

Organize, Clean &
Analyze Image Datasets

A comprehensive Python toolkit for dataset curation, duplicate detection, quality control, and exploratory data analysis with state-of-the-art ML models.

Total Downloads
PyPI Version
GitHub Stars
License

Powerful Features

Everything you need to organize and analyze your image datasets

Image Clustering

Automatically group similar images using KMeans, HDBSCAN, or GMM algorithms with state-of-the-art feature extractors.

Modern ML Models

Leverage DINOv2, CLIP, ViT, ResNet, EfficientNet, and ConvNeXt for powerful feature extraction.

Duplicate Detection

Find and remove duplicate or near-duplicate images from your datasets efficiently.

Visual Grids

Generate beautiful visual grids of clustered images for easy inspection and analysis.

Dimensionality Reduction

Visualize high-dimensional features with PCA, UMAP, or t-SNE for exploratory analysis.

Auto Organization

Automatically organize images into cluster folders and export results to JSON.

Installation

Get started in seconds with pip

Basic Installation

Terminal
pip install imageatlas

Full Installation (with all models)

Terminal
pip install imageatlas[full]

Note on CLIP

If you wish to use the CLIP model, install it manually:

pip install git+https://github.com/openai/CLIP.git

Quick Start

Cluster your images in just a few lines of code

example.py
from imageatlas import ImageClusterer

# Initialize clusterer with state-of-the-art features
clusterer = ImageClusterer(
    model='dinov2',           # DINOv2, CLIP, ViT, ResNet, etc.
    clustering_method='kmeans',
    n_clusters=10,
    device='cuda'             # or 'cpu'
)

# Run clustering on your images
results = clusterer.fit("./path/to/images")

# Save results to JSON
results.to_json("./output/clustering_results.json")

# Create visual grids for each cluster
results.create_grids(
    image_dir="./path/to/images",
    output_dir="./output/grids"
)

# Organize images into cluster folders
results.create_cluster_folders(
    image_dir="./path/to/images",
    output_dir="./output/clusters"
)

That's it! Your images are now clustered, visualized, and organized. 🎉

Documentation

Explore available models and algorithms

Feature Extraction

State-of-the-art models

  • dinov2 - DINOv2 (ViT)
  • clip - OpenAI CLIP
  • vit - Vision Transformer
  • resnet - ResNet (18-152)
  • efficientnet - EfficientNet
  • convnext - ConvNeXt

Clustering

Grouping algorithms

  • kmeans - K-Means
  • hdbscan - HDBSCAN
  • gmm - Gaussian Mixture

Parameters: n_clusters, min_cluster_size, min_samples, n_components

Dim. Reduction

Visualization methods

  • pca - PCA
  • umap - UMAP
  • tsne - t-SNE

Parameters: n_components, n_neighbors, min_dist, perplexity

Citation

If you use ImageAtlas in your research, please cite it

BibTeX
@software{imageatlas,
  author = {Ahmad Javed},
  title = {ImageAtlas: A Toolkit for Organizing, Cleaning and Analyzing Image Datasets},
  year = {2024},
  url = {https://github.com/ahmadjaved97/ImageAtlas}
}