Claude project prompt
I need an AI system to identify and correct old pictures. The AI system must be able to :
- Repair cracks (see crack.png)
- Revive colors (hazycolors.jpg)
- Identify and tag faces so that a database of persons across all pictures is made
- Identify and tag objects in pictures
- Identify and tag types of scenes
- Identify the rotation (0,90,180,270 degrees) (rotated.jpg)
- Crop picture (tractionholes.jpg)
Use your MCP connection to Hugging face to find models that are suited for these tasks. Make an assesment where tasks can be combined, for example face recogition can be used to detect the orientation of the picture, the same goes for scene detection as we can assume that the sky is always in the top side of a picture.
Provide a list of models specifying for each model where it can be used, what the license model is and if it is compatible with ONXX.
Claude research
Hugging Face Models for Old Photo Restoration: Complete System Design
Production-ready models exist for all seven required tasks, with multiple ONNX-compatible options and permissive licenses available. The landscape includes powerful multi-task restoration models alongside specialized single-purpose solutions, enabling flexible system architectures from lightweight pipelines to comprehensive restoration suites. Key finding: face detection models serve double duty for both person databases and rotation correction, while general restoration models like Real-ESRGAN and LaMa handle multiple degradation types simultaneously.
The multi-task advantage: comprehensive vs specialized approaches
Old photo restoration presents a fundamental architectural choice between comprehensive models that address multiple degradations simultaneously and specialized single-task models combined in pipelines. Bringing Old Photos Back to Life (hoangmnsd/Bringing-Old-Photos-Back-to-Life) handles scratches, cracks, noise, blur, and color fading in a single forward pass using its VAE-based triplet domain translation network. This Apache 2.0 licensed model separates structured defects (scratches) from unstructured ones (noise) through parallel processing branches that fuse in latent space, making it ideal for general restoration workflows.
However, face-centric photos benefit more from specialized face restoration models. GFPGAN (TencentARC/GFPGANv1) and CodeFormer (bluefoxcreation/Codeformer-ONNX) achieve superior results on portraits by leveraging generative facial priors. GFPGAN has multiple ONNX implementations including Neus/GFPGANv1.4 with AMD/CUDA/DirectML support, while CodeFormer offers controllable fidelity trade-offs through its transformer-based architecture. Both models restore facial details and enhance colors in single forward passes without expensive optimization.
The practical implication: process non-face regions with general restoration models, then apply specialized face models to detected faces, maximizing quality across the entire image. This hybrid approach leverages the strengths of both categories while avoiding the computational overhead of running heavyweight models on regions where they provide minimal benefit.
Task 1: Crack repair and image inpainting
Carve/LaMa-ONNX emerges as the definitive choice for production deployment, already exported to ONNX format with Apache 2.0 licensing and optimized for 512×512 resolution. LaMa (Resolution-robust Large Mask Inpainting with Fourier Convolutions) excels at both small scratch removal and large missing region reconstruction, requiring only mask inputs without text prompts. The model handles photo completion, crack repair, and traction hole removal with approximately 200MB footprint and fast inference suitable for desktop applications.
Alternative PyTorch implementations exist through smartywu/big-lama, also Apache 2.0 licensed but requiring manual ONNX conversion using the provided Jupyter notebook. For applications requiring text-guided creative restoration of large missing areas, diffusers/stable-diffusion-xl-1.0-inpainting-0.1 (openrail++ license, 12.1M downloads) generates plausible content at 1024×1024 resolution, though at significantly slower speeds with its 6.9GB model size and 20-30 inference steps.
Modern alternatives include FLUX.1-dev-Controlnet-Inpainting-Beta (alimama-creative) for professional-grade quality with high computational requirements, and Qualcomm's LaMa-Dilated variant optimized for mobile deployment with TFLite format. All inpainting models require mask inputs – either manual user annotation or automatic damage detection through classical computer vision or dedicated crack detection models. For automated workflows, couple LaMa with a damage segmentation model to generate masks programmatically.
Task 2: Color restoration and enhancement for faded photos
Faded scanned slides demand color enhancement rather than full recolorization to preserve original intent. Google's MAXIM dehazing models (google/maxim-s2-dehazing-sots-outdoor and indoor variants) provide Apache 2.0 licensed solutions that remove haze while preserving and enhancing existing colors. These Multi-Axis MLP architectures deliver fast, efficient dehazing specifically designed for clarity restoration, with separate indoor and outdoor specialized variants for optimal results.
For resolution enhancement with simultaneous quality improvement, ai-forever/Real-ESRGAN stands as the most battle-tested solution with 200 likes and usage across 100+ Spaces. This GAN-based enhanced super-resolution model handles degraded photos excellently while preserving and slightly enhancing existing colors. The qualcomm/Real-ESRGAN-x4plus variant offers mobile-optimized TFLite format with 2.6K downloads and 77 likes, though licensing requires review. For immediate ONNX deployment, xiongjie/lightweight-real-ESRGAN-anime provides a smaller footprint despite its anime-focused training.
Complementary enhancement models include keras-io/low-light-image-enhancement (Apache 2.0, 86 likes) for underexposed scans, and opencv/deblurring_nafnet (ONNX format) for motion blur correction. The suncongcong/AST_Dehazing model applies modern attention-based spatial transformers for advanced haze removal. Avoid full colorization models like camenduru/cv_ddcolor_image-colorization and neurallove/controlnet-sd21-colorization-diffusers unless processing true black-and-white photos, as they replace rather than enhance existing colors.
The recommended pipeline: MAXIM dehazing → Real-ESRGAN upscaling → face-specific models for portraits. This preserves photographic authenticity while maximizing quality improvements, with total processing time under 5 seconds per image on modern GPUs.
Task 3: Face detection and recognition for person database
Face detection and recognition require separate specialized models in a two-stage pipeline. For detection, py-feat/retinaface (MIT license, 6.8K downloads) represents the industry standard for challenging degraded images, leveraging ResNet-50 backbone to detect faces at various angles while outputting 5-point facial landmarks (eyes, nose, mouth corners) crucial for alignment. The amd/retinaface variant provides Apache 2.0 licensed ONNX format with MobileNet backbone for lighter deployment.
Modern alternatives include AdamCodd/YOLOv11n-face-detection (Apache 2.0, 340 downloads, 16 likes) with ONNX export support and real-time capability, and arnabdhar/YOLOv8-Face-Detection (97 likes, 20+ demo spaces) though AGPL-3.0 licensing restricts commercial use. The deepghs/real_face_detection model explicitly targets real photographs with ONNX models available, distinguishing it from anime-focused alternatives.
For recognition and embedding generation, minchul/cvlface_arcface_ir101_webface4m delivers state-of-the-art 512-dimensional face embeddings through its Improved ResNet-101 architecture trained on the massive WebFace4M dataset (65.2M parameters, 858 downloads). These embeddings enable accurate clustering and matching with cosine similarity thresholds of 0.6-0.7 for same-person identification. The MIT-licensed py-feat/facenet provides 128 or 512-dimensional embeddings using the classic FaceNet architecture, while garavv/arcface-onnx offers pre-converted ONNX format for immediate deployment.
All-in-one solution: public-data/insightface (7 likes, 10+ spaces) combines SCRFD detection with ArcFace recognition in a complete pipeline including age, gender, and landmark detection, though licensing requires careful review for commercial use.
Database implementation strategy: Store embeddings in vector databases like FAISS or Pinecone, cluster using DBSCAN or hierarchical clustering to group identical persons, and maintain links between embeddings and source image bounding boxes. Properly aligned faces using landmark-based rotation correction significantly improve embedding quality and subsequent matching accuracy. Process each detected face separately, align using the 5-point landmarks, generate embeddings, then cluster globally across all photos.
Task 4 & 5: Object detection and scene classification
For object detection and tagging, facebook/detr-resnet-50 (Apache 2.0, 19.2M downloads) provides production-ready detection of 80 COCO object classes including people, vehicles, animals, furniture, and everyday objects. This Detection Transformer with 41.6M parameters balances speed and accuracy, with Xenova/detr-resnet-50 offering pre-converted ONNX format for browser and edge deployment. Higher accuracy demands facebook/detr-resnet-101 (60.7M parameters) or SenseTime/deformable-detr (improved small object detection), while resource-constrained environments benefit from hustvl/yolos-tiny at just 6.5M parameters with Apache 2.0 licensing and fast inference.
Scene classification achieves comprehensive coverage through corenet-community/places365-224x224-vit-base, trained on 365 scene categories spanning indoor spaces (bedroom, kitchen, living room, office, restaurant, classroom), outdoor environments (beach, mountain, forest, street, highway), and natural settings (sky, clouds, sunset, waterfall, canyon). This Vision Transformer base model surpasses limited indoor-only classifiers like keremberke/yolov8m-scene-classification, though the YOLOv8 variants offer faster inference with ONNX export capability for applications prioritizing speed over category breadth.
Key architectural insight: DETR models use transformers for end-to-end detection without hand-crafted components like non-maximum suppression, while YOLOS applies pure Vision Transformers to detection tasks. Both approaches support straightforward ONNX conversion through the transformers library's export utilities or dedicated conversion tools.
Combined deployment for maximum efficiency: Run object detection once per image to tag all present items, simultaneously classify the overall scene for contextual tagging, then merge results into comprehensive image metadata. This dual-model approach takes under 1 second per image on modern hardware while providing rich semantic understanding for searchability and organization.
Task 6: Image rotation detection
Specialized rotation detection models conclusively outperform repurposing scene classifiers, eliminating the need for complex heuristics. amaye15/Beit-Base-Image-Orientation-Fixer (Apache 2.0, 3.4K downloads, 86M parameters) directly classifies orientation as 0°, 90°, 180°, or 270° using BEiT (Bidirectional Encoder representation from Image Transformers) fine-tuned from Microsoft's SwinV2 architecture. This single-model, single-inference approach with demonstration Space available provides reliable rotation correction without additional complexity.
ONNX-ready alternatives include Chuckame/deep-image-orientation-angle-detection (MIT license) and DuarteBarbosa/deep-image-orientation-detection (MIT, trained on COCO), both offering native ONNX format for immediate integration. These ViT-based models deliver fast inference suitable for batch processing thousands of scanned slides.
Face orientation provides powerful secondary validation: After primary rotation detection, RetinaFace's 5-point facial landmarks enable verification through geometric analysis of eye and mouth positions. If face landmarks suggest inconsistent orientation, flag for manual review. This hybrid approach catches edge cases where ambiguous content confuses rotation classifiers, particularly abstract compositions or unusual angles.
The most efficient workflow: Run rotation detection first, apply correction, then proceed with other processing tasks on properly oriented images. This prevents downstream models from wasting capacity on rotated inputs and ensures face detection, object detection, and scene classification operate at peak accuracy. Scene classifiers theoretically could provide rotation hints through sky/ground positioning, but dedicated models prove simpler, faster, and more reliable in practice.
Task 7: Image cropping and border removal
Border removal and traction hole cleanup fall naturally within inpainting model capabilities. LaMa-ONNX (Carve/LaMa-ONNX) excels at edge reconstruction when provided masks covering damaged borders, slide frame artifacts, and perforation holes from projection slides. The same model recommended for crack repair handles border cleanup without requiring separate specialized solutions, consolidating the restoration pipeline.
For automated workflows, implement edge detection algorithms (Canny, Sobel) to identify irregular borders and traction holes, generate binary masks marking these regions, then apply LaMa inpainting to reconstruct clean edges. Alternatively, for simple border cropping without reconstruction, classical computer vision techniques using contour detection and perspective correction suffice – deep learning models offer minimal advantage for straightforward geometric operations.
Advanced border reconstruction leveraging diffusion models like diffusers/stable-diffusion-xl-1.0-inpainting-0.1 enables creative extension of image content beyond original boundaries, useful when slides were cropped during scanning. This approach generates plausible surrounding context, though computational cost (20-30 inference steps, 6.9GB model) makes it impractical for batch processing unless quality demands justify the overhead.
Model combination strategies and task overlap analysis
Face detection serves triple duty: identifying individuals for person databases, enabling specialized face restoration processing, and providing rotation validation through facial landmark geometry. Deploy RetinaFace once per image, then route detected face regions to GFPGAN or CodeFormer while processing non-face areas with general restoration models. This selective processing reduces computational cost while maximizing per-region quality.
LaMa consolidates multiple inpainting needs: crack repair, scratch removal, traction hole cleanup, and border reconstruction all utilize the same model with different masks, eliminating redundancy. Generate separate masks for different damage types through classical computer vision or manual annotation, then process in a single inference pass by combining masks or multiple targeted passes for granular control over each repair type.
Scene classification enhances object detection metadata: While Places365 provides high-level scene context (beach, forest, living room), DETR supplies specific object presence (person, dog, surfboard). Combining both creates rich hierarchical tagging – "beach scene with person and surfboard" rather than just object lists. This semantic richness dramatically improves photo organization and search capabilities across large slide collections.
Rotation detection must precede all other tasks: Running object detection, face detection, or scene classification on rotated images degrades accuracy significantly. Always correct orientation first, then proceed with content analysis. The computational overhead of rotation detection (single lightweight ViT inference) pays for itself many times over through improved downstream accuracy.
ONNX compatibility comprehensive summary
Native ONNX models ready for immediate deployment: Carve/LaMa-ONNX (inpainting), Xenova/detr-resnet-50 (object detection), Larvik/GFPGANx and Neus/GFPGANv1.4 (face restoration), bluefoxcreation/Codeformer-ONNX (face restoration), opencv/deblurring_nafnet (deblurring), rocca/swin-ir-onnx (super-resolution), xiongjie/lightweight-real-ESRGAN-anime (upscaling), Chuckame/deep-image-orientation-angle-detection and DuarteBarbosa/deep-image-orientation-detection (rotation), garavv/arcface-onnx and astaileyyoung/facenet-onnx (face recognition), deepghs/real_face_detection (face detection).
Easily convertible to ONNX through standard tools: All transformers-based models (DETR, YOLOS, ViT, BEiT) via Hugging Face Optimum library (optimum-cli export onnx), all PyTorch models including Bringing Old Photos Back to Life, Real-ESRGAN, and CodeFormer variants. YOLOv8 models include native ONNX export through ultralytics library.
Complex ONNX conversion requiring pipeline decomposition: Stable Diffusion models (SDXL inpainting, SD v2 inpainting) need separate export of UNet, VAE, and text encoder components, then custom inference pipeline construction. FLUX.1 models present significant challenges due to size and architecture complexity. For production desktop applications, avoid diffusion models unless quality demands justify the implementation effort.
The ONNX export process for most PyTorch models: pip install optimum[exporters] then optimum-cli export onnx --model <model-id> <output-directory>. Test exported models thoroughly as some architectures may require manual graph optimization or operator compatibility adjustments for specific ONNX runtimes.
Licensing summary for commercial deployment
Apache 2.0 licensed (most permissive): hoangmnsd/Bringing-Old-Photos-Back-to-Life, Carve/LaMa-ONNX, smartywu/big-lama, google/maxim-s2-dehazing (both variants), keras-io/low-light-image-enhancement, rocca/swin-ir-onnx, camenduru/cv_ddcolor_image-colorization, facebook/detr models, microsoft/conditional-detr, SenseTime/deformable-detr, hustvl/yolos variants, amd/retinaface, AdamCodd/YOLOv11n-face-detection, JustinLeee/FaceMind_ArcFace, hansin91/scene_classification, amaye15/Beit-Base-Image-Orientation-Fixer.
MIT licensed (fully permissive): leonelhs/gfpgan, dnnagy/RestoreFormerPlusPlus, ohayonguy/PMRF_blind_face_image_restoration, deepghs/image_restoration, saravakun9090/gan_dehazing_model, chuxiaojie/NAFNet, py-feat/retinaface, py-feat/facenet, astaileyyoung/facenet-onnx, Chuckame/deep-image-orientation-angle-detection, DuarteBarbosa/deep-image-orientation-detection.
OpenRAIL/OpenRAIL++ (permissive with responsible AI restrictions): diffusers/stable-diffusion-xl-1.0-inpainting-0.1, stabilityai/stable-diffusion-2-inpainting, neurallove/controlnet-sd21-colorization-diffusers, lllyasviel/control_v11p_sd15_inpaint. Read terms carefully but generally allow commercial use with ethical AI commitments.
Restrictive licenses requiring careful review: AGPL-3.0 (arnabdhar/YOLOv8-Face-Detection) imposes copyleft requiring source disclosure if distributed, Qualcomm models often have proprietary restrictions, models marked "other" need individual license file examination. When licensing unclear, contact model authors or choose alternatives with explicit permissive licenses.
Missing license information: Many models lack explicit license fields – TencentARC/GFPGANv1, ai-forever/Real-ESRGAN, wwy/codeformer, ziixzz/codeformer-v0.1.0.pth, and others. For production use, either obtain written permission from authors, find licensed alternatives, or restrict to internal non-distributed use only.
Recommended system architectures
Lightweight pipeline (fastest, smallest footprint)
Total model size: ~500MB, Processing time: <2 seconds per image on CPU
- Rotation detection: DuarteBarbosa/deep-image-orientation-detection (ONNX, MIT)
- Face detection: deepghs/real_face_detection (ONNX)
- Face restoration: Larvik/GFPGANx (ONNX)
- Inpainting: Carve/LaMa-ONNX (ONNX, Apache 2.0)
- Enhancement: xiongjie/lightweight-real-ESRGAN-anime (ONNX)
- Object detection: hustvl/yolos-tiny (6.5M params, Apache 2.0, convertible to ONNX)
- Scene classification: keremberke/yolov8n-scene-classification (ONNX-ready)
- Face recognition: astaileyyoung/facenet-onnx (MIT)
All ONNX models enable cross-platform C++, Python, or C# deployment with ONNX Runtime, minimizing dependencies and maximizing portability. Suitable for distribution to end users with modest hardware.
Balanced pipeline (quality and performance optimized)
Total model size: ~2GB, Processing time: <5 seconds per image on GPU
- Rotation detection: amaye15/Beit-Base-Image-Orientation-Fixer (Apache 2.0)
- Face detection: py-feat/retinaface (MIT, industry standard)
- Face restoration: bluefoxcreation/Codeformer-ONNX (ONNX native)
- General restoration: hoangmnsd/Bringing-Old-Photos-Back-to-Life (Apache 2.0, handles multiple degradations)
- Color enhancement: google/maxim-s2-dehazing-sots-outdoor (Apache 2.0)
- Upscaling: ai-forever/Real-ESRGAN (most proven, 100+ spaces)
- Object detection: facebook/detr-resnet-50 (Apache 2.0, 19.2M downloads)
- Scene classification: corenet-community/places365-224x224-vit-base (365 categories)
- Face recognition: minchul/cvlface_arcface_ir101_webface4m (state-of-the-art embeddings)
Converts to ONNX where not native, balances quality with reasonable computational requirements, all Apache 2.0 or MIT licensing where specified.
Maximum quality pipeline (best possible results)
Total model size: ~5GB, Processing time: 10-30 seconds per image on high-end GPU
- Rotation detection: amaye15/Beit-Base-Image-Orientation-Fixer
- Face detection: py-feat/retinaface (with landmark alignment)
- Face restoration: TencentARC/GFPGANv1 or wwy/codeformer (highest quality, verify licensing)
- General restoration: hoangmnsd/Bringing-Old-Photos-Back-to-Life
- Inpainting large regions: diffusers/stable-diffusion-xl-1.0-inpainting-0.1 (OpenRAIL++)
- Small scratch inpainting: smartywu/big-lama (Apache 2.0)
- Color enhancement: google/maxim-s2-dehazing + Real-ESRGAN pipeline
- Deblurring: deepinv/Restormer (transformer-based)
- Object detection: facebook/detr-resnet-101 or SenseTime/deformable-detr
- Scene classification: corenet-community/places365-224x224-vit-large
- Face recognition: minchul/cvlface_arcface_ir101_webface4m
Prioritizes quality over speed, suitable for professional restoration services or archival projects where processing time is secondary to output quality. Requires substantial GPU memory (12GB+ recommended).
Performance characteristics and hardware requirements
Model size and speed relationships directly impact user experience. RetinaFace, YOLOS-tiny, and NAFNet deliver real-time performance on CPU, processing hundreds of images per minute without GPU acceleration. These lightweight architectures suit desktop applications where users process personal photo collections on modest hardware.
GAN-based models (Real-ESRGAN, GFPGAN, LaMa) achieve excellent quality-speed balance, typically completing in 0.5-2 seconds per image on mid-range GPUs (RTX 3060, similar). Single forward passes without iterative optimization make them production-ready, contrasting sharply with diffusion models requiring 20-50 inference steps.
Transformer models (DETR, BEiT, ViT, CodeFormer) trade speed for accuracy, generally requiring 1-5 seconds per inference on GPU. Their attention mechanisms capture global context superior to CNNs for tasks like rotation detection and scene classification, justifying the overhead. However, CPU inference becomes impractically slow – 30+ seconds typical – making GPU acceleration essential for reasonable throughput.
Diffusion models (Stable Diffusion XL, FLUX.1) deliver exceptional quality at severe computational cost: 20-30 seconds minimum per image on high-end GPUs, with multi-gigabyte memory requirements. Reserve these for specific high-value use cases like creative border reconstruction or extreme quality demands, not batch processing workflows.
Memory requirements scale roughly with model parameters: tiny models (5-10M params) need ~500MB, medium models (50-100M params) require 2-4GB, large models (300M+ params) demand 8GB+. Account for batch processing memory multiplication and framework overhead when sizing deployment hardware.
Conclusion: assembling the optimal restoration system
The Hugging Face model ecosystem provides comprehensive coverage for old photo restoration with mature, production-ready solutions across all seven required tasks. Three-tier architecture emerges as optimal: dedicated rotation correction first, specialized face processing for portraits, and efficient general restoration for remaining content. This approach maximizes per-region quality while minimizing computational waste.
Critical success factors: Prioritize Apache 2.0 and MIT licensed models for commercial deployment, favor ONNX format for cross-platform compatibility and deployment flexibility, validate model outputs on representative degraded photos from your actual slide collection before full system integration, implement fallback strategies for edge cases where automated processing fails, and provide manual override tools for user correction of misclassified orientations or missed damage.
The balanced pipeline recommended above combines proven models with clear licensing, achieving professional restoration quality at reasonable computational cost. Start with this foundation, then optimize based on your specific slide collection characteristics – adjust face model selection for demographic match, tune inpainting aggressiveness for damage severity patterns, and scale computational resources to match throughput requirements. Most importantly, process thousands of representative slides through your pipeline before committing to architecture decisions, as real-world performance often surprises initial expectations.