RepoDatabricks (DBRX)Databricks (DBRX)published Jul 18, 2022seen 5d

databricks/mlops-stacks

Python

Open original ↗

Captured source

source ↗
published Jul 18, 2022seen 5dcaptured 8hhttp 200method plain

databricks/mlops-stacks

Description: This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.

Language: Python

License: Apache-2.0

Stars: 685

Forks: 257

Open issues: 13

Created: 2022-07-18T17:24:44Z

Pushed: 2026-05-01T16:23:26Z

Default branch: main

Fork: no

Archived: no

README:

Databricks MLOps Stacks

> _NOTE:_ This feature is in public preview.

This repo provides a customizable stack for starting new ML projects on Databricks that follow production best-practices out of the box.

Using Databricks MLOps Stacks, data scientists can quickly get started iterating on ML code for new projects while ops engineers set up CI/CD and ML resources management, with an easy transition to production. You can also use MLOps Stacks as a building block in automation for creating new data science projects with production-grade CI/CD pre-configured. More information can be found at https://docs.databricks.com/en/dev-tools/bundles/mlops-stacks.html.

The default stack in this repo includes three modular components:

| Component | Description | Why it's useful | |-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [ML Code](template/{{.input_root_dir}}/{{template%20project_name_alphanumeric_underscore%20.}}/) | Example ML project structure ([training](template/{{.input_root_dir}}/{{template%20project_name_alphanumeric_underscore%20.}}/training) and [batch inference](template/{{.input_root_dir}}/{{template%20project_name_alphanumeric_underscore%20.}}/deployment/batch_inference), etc), with unit tested Python modules and notebooks | Quickly iterate on ML problems, without worrying about refactoring your code into tested modules for productionization later on. | | [ML Resources as Code](template/{{.input_root_dir}}/{{template%20project_name_alphanumeric_underscore%20.}}/resources) | ML pipeline resources ([training](template/{{.input_root_dir}}/{{template%20project_name_alphanumeric_underscore%20.}}/resources/model-workflow-resource.yml.tmpl) and [batch inference](template/{{.input_root_dir}}/{{template%20project_name_alphanumeric_underscore%20.}}/resources/batch-inference-workflow-resource.yml.tmpl) jobs, etc) defined through databricks CLI bundles | Govern, audit, and deploy changes to your ML resources (e.g. "use a larger instance type for automated model retraining") through pull requests, rather than adhoc changes made via UI. | | CI/CD([GitHub Actions](template/{{.input_root_dir}}/.github/) or [Azure DevOps](template/{{.input_root_dir}}/.azure/)) | GitHub Actions or Azure DevOps workflows to test and deploy ML code and resources | Ship ML code faster and with confidence: ensure all production changes are performed through automation and that only tested code is deployed to prod |

See the [FAQ](#FAQ) for questions on common use cases.

ML pipeline structure and development loops

An ML solution comprises data, code, and models. These resources need to be developed, validated (staging), and deployed (production). In this repository, we use the notion of dev, staging, and prod to represent the execution environments of each stage.

An instantiated project from MLOps Stacks contains an ML pipeline with CI/CD workflows to test and deploy automated model training and batch inference jobs across your dev, staging, and prod Databricks workspaces.

Data scientists can iterate on ML code and file pull requests (PRs). This will trigger unit tests and integration tests in an isolated staging Databricks workspace. Model training and batch inference jobs in staging will immediately update to run the latest code when a PR is merged into main. After merging a PR into main, you can cut a new release branch as part of your regularly scheduled release process to promote ML code changes to production.

Develop ML pipelines

https://github.com/databricks/mlops-stacks/assets/87999496/00eed790-70f4-4428-9f18-71771051f92a

Create a PR and CI

https://github.com/databricks/mlops-stacks/assets/87999496/f5b3c82d-77a5-4ee5-85f5-8f00b026ae05

Merge the PR and deploy to Staging

https://github.com/databricks/mlops-stacks/assets/87999496/7239e4d0-2327-4d30-91cc-5e7f8328ef73

https://github.com/databricks/mlops-stacks/assets/87999496/013c0d32-c283-494b-8c3f-2a9a60366207

Deploy to Prod

https://github.com/databricks/mlops-stacks/assets/87999496/0d220d55-465e-4a69-bd83-1e66ad2e8464

[See this page](Pipeline.md) for detailed description and diagrams of the ML pipeline structure defined in the default stack.

Using MLOps Stacks

Prerequisites

Databricks CLI contains Databricks asset bundle templates for the purpose of project creation.

Please follow the instruction to install and set up databricks CLI. Releases of databricks CLI can be found in the releases section of databricks/cli repository.

Databricks asset bundles and Databricks asset bundle templates are in public preview.

Start a new project

To create a new project, run:

databricks bundle init mlops-stacks

This will prompt for parameters for initialization. Some of these parameters are required to get started:

  • input_setup_cicd_and_project` : If both CI/CD and the project should be set up, or only one of them.
  • CICD_and_Project` - Setup both CI/CD and project, the default option.
  • Project_Only` - Setup project only, easiest for Data Scientists to get started with.

*…

Excerpt shown — open the source for the full document.