WritingAnthropicAnthropicpublished May 20, 2024seen 2d

Reflections On Our Responsible Scaling Policy

Open original ↗

Captured source

source ↗

Reflections on our Responsible Scaling Policy \ Anthropic Policy Reflections on our Responsible Scaling Policy May 20, 2024

Last summer we published our first Responsible Scaling Policy (RSP) , which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as possible standards. As we operationalize the policy, we expect to learn a great deal and plan to share our findings. This post shares reflections from implementing the policy so far. We are also working on an updated RSP and will share this soon. We have found having a clearly-articulated policy on catastrophic risks extremely valuable. It has provided a structured framework to clarify our organizational priorities and frame discussions around project timelines, headcount, threat models, and tradeoffs. The process of implementing the policy has also surfaced a range of important questions, projects, and dependencies that might otherwise have taken longer to identify or gone undiscussed. Balancing the desire for strong commitments with the reality that we are still seeking the right answers is challenging. In some cases, the original policy is ambiguous and needs clarification. In cases where there are open research questions or uncertainties, setting overly-specific requirements is unlikely to stand the test of time. That said, as industry actors face increasing commercial pressures we hope to move from voluntary commitments to established best practices and then well-crafted regulations. As we continue to iterate on and improve the original policy, we are actively exploring ways to incorporate practices from existing risk management and operational safety domains. While none of these domains alone will be perfectly analogous, we expect to find valuable insights from nuclear security, biosecurity, systems safety, autonomous vehicles, aerospace, and cybersecurity. We are building an interdisciplinary team to help us integrate the most relevant and valuable practices from each. Our current framework for doing so is summarized below, as a set of five high-level commitments. Establishing Red Line Capabilities. We commit to identifying and publishing "Red Line Capabilities" which might emerge in future generations of models and would present too much risk if stored or deployed under our current safety and security practices (referred to as the ASL-2 Standard) . Testing for Red Line Capabilities (Frontier Risk Evaluations). We commit to demonstrating that the Red Line Capabilities are not present in models, or - if we cannot do so - taking action as if they are (more below). This involves collaborating with domain experts to design a range of "Frontier Risk Evaluations" – empirical tests which, if failed, would give strong evidence against a model being at or near a red line capability. We also commit to maintaining a clear evaluation process and a summary of our current evaluations publicly. Responding to Red Line Capabilities. We commit to develop and implement a new standard for safety and security sufficient to handle models that have the Red Line Capabilities. This set of measures is referred to as the ASL-3 Standard . We commit not only to define the risk mitigations comprising this standard, but also detail and follow an assurance process to validate the standard’s effectiveness. Finally, we commit to pause training or deployment if necessary to ensure that models with Red Line Capabilities are only trained, stored and deployed when we are able to apply the ASL-3 standard. Iteratively extending this policy. Before we proceed with activities which require the ASL-3 standard, we commit to publish a clear description of its upper bound of suitability: a new set of Red Line Capabilities for which we must build Frontier Risk Evaluations , and which would require a higher standard of safety and security (ASL-4) before proceeding with training and deployment. This includes maintaining a clear evaluation process and summary of our evaluations publicly. Assurance Mechanisms. We commit to ensuring this policy is executed as intended, by implementing Assurance Mechanisms . These should ensure that our evaluation process is stress-tested; our safety and security mitigations are validated publicly or by disinterested experts; our Board of Directors and Long-Term Benefit Trust have sufficient oversight over the policy implementation to identify any areas of non-compliance; and that the policy itself is updated via an appropriate process.

Threat Modeling and Evaluations Our Frontier Red Team and Alignment Science teams have focused on threat modeling and engaging with domain experts. They are primarily focused on (a) improving threat models to determine which capabilities would warrant the ASL-3 standard of security and safety, (b) working with teams developing ASL-3 controls to ensure that those controls are tailored to the correct risks, and (c) mapping capabilities which the ASL-3 standard would be insufficient to handle, and which we would continue to test for even once it is implemented. Some key reflections are: Each new generation of models has emergent capabilities, making anticipating properties of future models unusually challenging. There is a serious need for further threat modeling. There is reasonable disagreement amongst experts over which risks to prioritize and how new capabilities might cause harm, even in relatively established Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Talking to a wide variety of experts in different sub-domains has been valuable, given the lack of consensus view. Attempting to make threat models quantitative has been helpful for deciding which capabilities and scenarios to prioritize.

Our Frontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing teams are focused on building evaluations and improving our overall methodology. Currently, we conduct pre-deployment testing in the domains of cybersecurity, CBRN, and Model Autonomy for frontier models which have reached 4x the compute of our most recently tested model (you can read a more detailed description of our most recent set of evaluations on Claude 3 Opus here ). We also test models mid-training if they reach this threshold, and re-test our most capable model every 3 months to account for…

Excerpt shown — open the source for the full document.

Notability

AI safety narratives are overblown hype for text generators, serving regulatory capture and investor interests.

Anthropic has a writing signal matching infrastructure, safety and policy.