digitalocean/pgremapper

Go

Open original ↗

Captured source

source ↗
published Apr 30, 2021seen 5dcaptured 8hhttp 200method plain

digitalocean/pgremapper

Description: CLI tool for manipulating Ceph's upmap exception table.

Language: Go

License: Apache-2.0

Stars: 68

Forks: 21

Open issues: 10

Created: 2021-04-30T13:06:42Z

Pushed: 2026-05-30T01:20:03Z

Default branch: main

Fork: no

Archived: no

README:

pgremapper

When working with Ceph clusters, there are actions that cause backfill (CRUSH map changes) and cases where you want to cause backfill (moving data between OSDs or hosts). Trying to manage backfill via CRUSH is difficult because changes to the CRUSH map cause many ancillary data movements that can be wasteful.

Additionally, controlling the amount of in-progress backfill is difficult, and having PGs in backfill_wait state has consequences:

  • Any PG performing recovery or backfill must obtain local and remote reservations.
  • A PG in a wait state may hold some of its necessary reservations, but not all. This may, in turn, block other recoveries or backfills that could otherwise make independent progress.
  • For EC pools, the source of a backfill read is likely not the primary, and this is not considered as a part of the reservation scheme. A single OSD could have any number of backfills reading from it; no knobs outside of recovery sleep can be used to mitigate this. Pacific's [mclock scheduler](

https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/) should theoretically improve this situation.

  • There are no reservation slots held for recoveries, meaning that a recovery could be waiting behind another backfill (or several backfills if they stack in a wait state).

The primary control knob for backfills, osd-max-backfills, sets the number of local and remote reservations available on a given OSD. Given the above, this knob is not sufficient given the way that backfill can pile up in the face of a large-scale change; one sometimes has to set it unacceptably high to achieve backfill concurrency across many OSDs.

This tool, pgremapper, is intended to aid with all of the above usecases and problems. It operates by manipulating the pg-upmap exception table available in Luminous+ to override CRUSH decisions based on a number of algorithms exposed as commands, outlined below. Many of these commands are intended to be run in a loop in order to achieve some target state.

Acknowledgments

The initial version of this tool, which became the cancel-backfill command below, was heavily inspired by techniques developed by CERN IT.

Requirements

As mentioned above, the upmap exception table was introduced in Luminous (v12), and this is a hard requirement for pgremapper. However, there were significant improvements to the upmap code in the first couple of years after it was introduced, and thus it's recommended that you are running Luminous v12.2.13 (the last release), Mimic v13.2.7+, Nautilus v14.2.5+, or any newer major release, at least on the mons/mgrs.

pgremapper has been tested on a variety of versions of Luminous, Nautilus, and Pacific.

Caveats

  • If the system is still processing osdmaps and peering, pgremapper can become confused and make incorrect decisions, since upmap entries at the mon layer may not yet be reflected in current PG state. If making CRUSH changes or running pgremapper multiple times, give the system time to finish processing osdmaps before running pgremapper.
  • Given a recent enough Ceph version, CRUSH cannot be violated by an upmap entry. This is good, but it can make certain manipulations impossible; consider a case where a backfill is swapping EC chunks between two racks. To the best of our knowledge today, no upmap entry can be created to counteract such a backfill, as Ceph will evaluate the correctness of the upmap entry in parts, rather than as a whole. (If you have evidence to the contrary or this is actually possible in newer versions of Ceph, let us know!)

Bug Reports

If you find a situation where pgremapper isn't working right, please file a report with a clear description of how pgremapper was invoked and any of its output, what the system was doing at the time, and output from the following Ceph commands:

  • ceph osd dump -f json
  • ceph osd tree -f json
  • ceph pg dump pgs_brief -f json
  • If a specific PG is named in pgremapper error output, then ceph pg query -f json

Building

If you have a Go environment configured, you can use go install:

go install github.com/digitalocean/pgremapper@latest

Otherwise, clone this repository and use a golang Docker container to build:

docker run --rm -v $(pwd):/pgremapper -w /pgremapper golang:1.21.4 go build -o pgremapper .

You can also download one of the pre-built binaries from the releases page.

Usage

pgremapper makes no changes by default and has some global options:

$ ./pgremapper [--concurrency ] [--yes] [--verbose]
  • --concurrency: For commands that can be issued in parallel, this controls the concurrency. This is set at a reasonable default that generally doesn't lead to too much concurrent peering in the cluster when manipulating the pg-upmap table.
  • --yes: Apply changes instead of emitting the diff output that would show which changes would be applied.
  • --verbose: Display Ceph commands being run, for debugging purposes.

osdspec

For commands or options that take a list of OSDs, pgremapper uses the concept of an osdspec (inspired by Git's refspec) to simplify the command line. An osdspec can either be an OSD ID (e.g. 42) or a CRUSH bucket prefixed by bucket: (e.g. bucket:rack1 or bucket:host4). In the latter case, all OSDs found under that CRUSH bucket are included.

diff output

When --yes is not specified, pgremapper will make no changes to the system, and will print the proposed changes in a diff-like format. For many of the subcommands below, goals are accomplished through a combination of adding and removing mappings to and from the upmap exception table. Unchanged mappings, which will be left alone, or stale mappings, which will be removed, are also noted. (Stale mappings are those that currently have no effect and should probably have been cleaned up by Ceph; we've seen cases of…

Excerpt shown — open the source for the full document.