CLI for ML engineers, AI researchers, and dataset builders

Turn GitHub repositories into clean, AI-ready datasets.

RepoCurator is a TypeScript-powered CLI that scrapes, filters, scores, and validates codebases for machine learning and LLM training.

Protected filtering

Drop .git, binaries, lockfiles, and non-training artifacts before they leak into exports.

Per-file scoring

Promote useful code paths and demote low-signal files with deterministic heuristics.

Export + validate

Produce ML-ready datasets with verification checkpoints instead of one-off scripts.

Core Extraction Pipeline

GitHub Repo → Clone → Filter → Score → Export → Validate

Problem

GitHub repositories are valuable, but raw source trees are not model-ready.

Training data falls apart when collection stays naive. RepoCurator exists to filter noise, preserve signal, and make code exports repeatable.

Raw repositories are noisy

Source trees include generated assets, vendored blobs, duplicate files, and long-tail junk that poison downstream training data.

Quality is inconsistent

Even clean repositories still mix strong signal with stale, trivial, or malformed files that need scoring before model usage.

Structure is missing

ML workflows need repeatable exports, validations, and pipeline checkpoints instead of ad hoc scripts and manual cleanup.

Solution

A CLI and pipeline designed specifically for repository-to-dataset workflows.

RepoCurator combines scraping, cleanup, scoring, export, and validation into one deliberate ingestion path.

Intelligent repository scraping

Clone and process public GitHub repositories with a CLI flow designed for dataset generation at scale.

Protected system filtering

Skip binaries, lockfiles, protected directories, and other non-training artifacts before they contaminate your corpus.

File quality scoring engine

Apply heuristics from 0.0 to 1.0 so only useful files reach export and model preparation stages.

Per-file dataset export

Write structured JSON or TXT outputs for downstream indexing, labeling, and fine-tuning workflows.

Dataset validation CLI

Run verification passes before model usage so bad samples are caught early and repeatedly.

Batch processing support

Move from one repository to hundreds without rebuilding your ingestion process each time.

GitHub auth and rate-limit insight

Inspect authenticated scraping flows and understand when API constraints affect ingestion throughput.

Architecture

A clean curation pipeline from raw repository input to validated output.

Each stage is visually isolated so the flow stays understandable as the system scales: Clone Repo → Filter Engine → Quality Scoring → Staging → Export → Validate.

Stage control

Each step is isolated so filtering rules, scoring heuristics, and validation policies can evolve without tearing up the full workflow.

Traceable exports

A staging layer creates a clean handoff before JSON or TXT export, keeping downstream dataset generation auditable.

Validation-first

Validation is treated as a built-in pipeline gate rather than an optional post-processing script.

Use Cases

Built for the teams turning public code into training signal.

RepoCurator fits both one-off research projects and disciplined ingestion pipelines that need repeatability.

LLM fine-tuning datasets

Build clean corpora from targeted repositories without hand-curating every file.

Code intelligence models

Feed scoring-aware code samples into search, ranking, and code-understanding systems.

AI research labs

Stand up repeatable repository ingestion for experiments that depend on consistent code quality.

Dataset engineering pipelines

Build clean datasets from 1000+ GitHub repos without manual cleaning.

CLI Preview

A terminal experience that feels purpose-built for dataset engineers.

The preview below uses a typed command sequence, ember glow terminal styling, and success-state feedback tuned to the product aesthetic.

Trust

Engineered like a real system, not a landing-page promise.

RepoCurator is positioned as an opinionated CLI pipeline with strict typing, phase ownership, and architecture validation baked into the story.

Source-to-JSON extraction preserves repository metadata and file-level quality scores.

Automated validation gates prevent corrupted or low-signal blobs from leaking into model exports.

Deep repository scanning identifies high-quality code patterns while skipping boilerplate noise.

Verified Data Integrity
High-Signal Filtering

Quality First

Noise-free dataset ingestion

RepoCurator applies deterministic heuristics to every file, ensuring only the most useful code reaches your final training corpus.

Infrastructure

Build datasets at scale

Standardize your ingestion pipeline with a purpose-built CLI that handles GitHub rate limits and complex repository structures automatically.

User validation

Would you use RepoCurator?

We are building RepoCurator as a focused CLI product. Register your interest to validate demand and get notified of the first public release.

Who it is for

ML engineers, researchers, founders, and students building code datasets for real model workflows.

What you will get

Early release notification, product progress context, and a direct signal that shapes what ships first.