RepoMicrosoftMicrosoftpublished May 26, 2026seen 5d

microsoft/selective-repo-fetch

TypeScript

Open original ↗

Captured source

source ↗
published May 26, 2026seen 5dcaptured 11hhttp 200method plain

microsoft/selective-repo-fetch

Description: Enforce docs-as-code by declaratively separating documentation from code. Match manifest glob patterns against repo file trees to fetch only the files your build needs.

Language: TypeScript

License: MIT

Stars: 2

Forks: 0

Open issues: 0

Created: 2026-05-26T23:02:27Z

Pushed: 2026-05-27T01:36:00Z

Default branch: main

Fork: no

Archived: no

README:

selective-repo-fetch

Docs-as-code made practical.

Declaratively define which files are documentation and which are code. When documentation lives alongside code in large repositories, building a documentation site shouldn't require cloning the entire repo. selective-repo-fetch reads a JSON manifest that declares which files your doc pipeline needs, matches those patterns against a file listing, and tells you exactly what to fetch — nothing more.

The Problem

Docs-as-code means your documentation is:

  • ✅ Versioned in git alongside source code
  • ✅ Reviewed through pull requests
  • ✅ Built by CI/CD pipelines

But large monorepos create real pain:

  • Full clones are slow — repos with 100K+ files take minutes to clone
  • API throttling is real — GitHub/Azure DevOps/GitLab rate-limit file downloads
  • Doc builds only need a fraction — your manifest already declares what files matter

The Solution

selective-repo-fetch sits between your git provider API and your doc build pipeline:

┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Git Provider │ │ selective-repo-fetch │ │ Doc Pipeline │
│ (file listing) │────▶│ (manifest matching │────▶│ (build only │
│ │ │ + reference filter) │ │ matched files)│
└─────────────────┘ └──────────────────────┘ └─────────────────┘

1. Get a file listing from any git API (cheap metadata call) 2. Resolve manifest → get content matches (markdown, configs) and resource matches (images, videos) 3. Fetch the content files (small text files — fast and cheap) 4. Filter resources by reference → scan content for ![](...), ``, etc. and keep only resources actually used 5. Fetch only the referenced resources → skip unreferenced large binaries entirely

Installation

npm install github:microsoft/selective-repo-fetch

Quick Start

import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

// Your manifest declares what your doc site needs
const manifest = {
build: {
content: [{ files: ['**/*.md'], src: 'docs' }],
resource: [{ files: ['**/*.{png,jpg,svg}'], src: 'docs/images' }],
template: ['templates/custom'],
},
};

// Step 1: Get file listing from any git API (cheap metadata call)
const repoFiles = [
{ path: '/docs/getting-started.md' },
{ path: '/docs/api-reference.md' },
{ path: '/docs/images/architecture.png' },
{ path: '/docs/images/unused-screenshot.png' },
{ path: '/src/main.ts' }, // ← not documentation
{ path: '/scripts/deploy.ps1' }, // ← not documentation
];

// Step 2: Resolve manifest patterns → content + resource matches
const result = resolveFileMatches(repoFiles, manifest, '/', '/manifest.json');

console.log(result.contentMatches);
// ['/docs/getting-started.md', '/docs/api-reference.md']

console.log(result.resourceMatches);
// ['/docs/images/architecture.png', '/docs/images/unused-screenshot.png']

// Step 3: Fetch the content files (small text — fast and cheap)
const contentFileTexts = {
'/docs/getting-started.md': '# Getting Started\n![Architecture](images/architecture.png)',
'/docs/api-reference.md': '# API Reference\nNo images here.',
};

// Step 4: Filter resources to only those actually referenced in content
const referencedResources = filterReferencedResources(result.resourceMatches, contentFileTexts);

console.log(referencedResources);
// ['/docs/images/architecture.png']
// ↑ unused-screenshot.png is dropped — it matched the glob but no content file references it

// Step 5: Fetch only the referenced resources — skip unreferenced large binaries

Use Cases

Documentation portals pulling from multiple repos

Your portal builds docs from 50+ repos. Instead of cloning each one, get the tree listing and resolve only the doc files.

AI agent knowledge bases

Selectively ingest documentation from multiple repos into a RAG pipeline — only content files, not code, tests, or CI configs. The manifest-driven separation means your agents always have fresh, accurate documentation without processing entire repositories.

Monorepo doc builds

A 200K-file monorepo where docs live in /docs, /api-docs, and scattered README.md files. The manifest declares exactly which paths matter.

Incremental content pipelines

Combined with a git diff, resolve which *documentation* files changed — not which *code* files changed.

Static site generators (DocFX, MkDocs, Sphinx, Docusaurus)

Any SSG that uses a manifest/config to declare its inputs can benefit from pre-filtering the repo file list.

API

resolveFileMatches(files, manifest, patternPrefix?, manifestPath?)

The core function. Resolves manifest patterns against a file listing.

Parameters:

  • files: FileEntry[] — array of { path: string } representing all files in the repo (from any git tree API)
  • manifest: object — the manifest JSON declaring content/resource patterns
  • patternPrefix: string — prefix for relative patterns (usually the manifest folder path)
  • manifestPath: string — path to the manifest file, used to resolve relative src paths

Returns:

{
contentMatches: string[]; // Files needed for content (markdown, notebooks, configs)
resourceMatches: string[]; // Files needed as resources (images, videos, binaries)
}

resolveExternalPatterns(manifest, manifestPath?)

Discovers patterns that reference files outside the manifest folder (via src: "../other-folder"). Use this to know which additional tree paths to enumerate before calling resolveFileMatches.

extractStaticPathPrefix(pattern)

Extracts the non-glob prefix from a pattern — useful for converting glob patterns to API-compatible folder paths.

extractStaticPathPrefix('/docs/**/*.md') // → '/docs'
extractStaticPathPrefix('**/*.md') // → '/'

resolveResourceFiles(resourcePaths, resourceSections, manifestPath)

Resolves candidate file system paths for resource files relative to a manifest.

filterReferencedResources(resourcePaths, contentFileTexts)

Filters resource paths to only include files that are actually…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low-star routine repo