Raw repositories are noisy
Source trees include generated assets, vendored blobs, duplicate files, and long-tail junk that poison downstream training data.
RepoCurator is a TypeScript-powered CLI that scrapes, filters, scores, and validates codebases for machine learning and LLM training.
Drop .git, binaries, lockfiles, and non-training artifacts before they leak into exports.
Promote useful code paths and demote low-signal files with deterministic heuristics.
Produce ML-ready datasets with verification checkpoints instead of one-off scripts.
Core Extraction Pipeline
GitHub Repo → Clone → Filter → Score → Export → Validate
Problem
Training data falls apart when collection stays naive. RepoCurator exists to filter noise, preserve signal, and make code exports repeatable.
Source trees include generated assets, vendored blobs, duplicate files, and long-tail junk that poison downstream training data.
Even clean repositories still mix strong signal with stale, trivial, or malformed files that need scoring before model usage.
ML workflows need repeatable exports, validations, and pipeline checkpoints instead of ad hoc scripts and manual cleanup.
Solution
RepoCurator combines scraping, cleanup, scoring, export, and validation into one deliberate ingestion path.
Clone and process public GitHub repositories with a CLI flow designed for dataset generation at scale.
Skip binaries, lockfiles, protected directories, and other non-training artifacts before they contaminate your corpus.
Apply heuristics from 0.0 to 1.0 so only useful files reach export and model preparation stages.
Write structured JSON or TXT outputs for downstream indexing, labeling, and fine-tuning workflows.
Run verification passes before model usage so bad samples are caught early and repeatedly.
Move from one repository to hundreds without rebuilding your ingestion process each time.
Inspect authenticated scraping flows and understand when API constraints affect ingestion throughput.
Architecture
Each stage is visually isolated so the flow stays understandable as the system scales: Clone Repo → Filter Engine → Quality Scoring → Staging → Export → Validate.
Use Cases
RepoCurator fits both one-off research projects and disciplined ingestion pipelines that need repeatability.
Build clean corpora from targeted repositories without hand-curating every file.
Feed scoring-aware code samples into search, ranking, and code-understanding systems.
Stand up repeatable repository ingestion for experiments that depend on consistent code quality.
Build clean datasets from 1000+ GitHub repos without manual cleaning.
CLI Preview
The preview below uses a typed command sequence, ember glow terminal styling, and success-state feedback tuned to the product aesthetic.
Trust
RepoCurator is positioned as an opinionated CLI pipeline with strict typing, phase ownership, and architecture validation baked into the story.
Source-to-JSON extraction preserves repository metadata and file-level quality scores.
Automated validation gates prevent corrupted or low-signal blobs from leaking into model exports.
Deep repository scanning identifies high-quality code patterns while skipping boilerplate noise.
User validation
We are building RepoCurator as a focused CLI product. Register your interest to validate demand and get notified of the first public release.
Pipeline preview
The interface is tuned for batch scraping, dataset validation, and predictable per-file exports.