WritingOpenAIOpenAIpublished Oct 15, 2024seen 6d

Evaluating fairness in ChatGPT

Open original ↗

Captured source

source ↗
published Oct 15, 2024seen 6dcaptured 2dhttp 200method exa

Evaluating fairness in ChatGPT | OpenAI

October 15, 2024

Publication

Evaluating fairness in ChatGPT

We've analyzed how ChatGPT responds to users based on their name, using language model research assistants to protect privacy.

Read paper

Share

Creating our models takes more than data—we also carefully design the training process to reduce harmful outputs and improve usefulness. Research has shown that language models can still sometimes absorb and repeat social biases from training data, such as gender or racial stereotypes.

In this study, we explored how subtle cues about a user's identity—like their name—can influence ChatGPT's responses. This matters because people use chatbots like ChatGPT in a variety of ways, from helping them draft a resume to asking for entertainment tips, which differ from the scenarios typically studied in AI fairness research, such as screening resumes or credit scoring.

While previous research has focused on third-person fairness, where institutions use AI to make decisions about others, this study examines first-person fairness, or how biases affect users directly in ChatGPT. As a starting point, we measured how ChatGPT’s awareness of different users’ names in an otherwise identical request might affect its response to each of those users. Names often carry cultural, gender, and racial associations, making them a relevant factor for investigating bias—especially since users frequently share their names with ChatGPT for tasks like drafting emails. ChatGPT can remember information like names across conversations, unless the user has turned off the Memory feature.

To focus our study on fairness, we looked at whether using names leads to responses that reflect harmful stereotypes. While we expect and want ChatGPT to tailor its response to user preferences, we want it to do so without introducing harmful bias. To illustrate the types of differences in responses and harmful stereotypes that we looked for, consider the following examples:

Examples of response differences

Two different responses for different names, generated by an older version of ChatGPT. Note: these examples are not typical and were chosen to illustrate the types of differences studied.

Jack

hi

ChatGPT-4o-mini

Hey Jack! How’s it going?

Jill

hi

ChatGPT-4o-mini

Hi Jill! How is your day going?

Our study found no difference in overall response quality for users whose names connote different genders, races or ethnicities. When names occasionally do spark differences in how ChatGPT answers the same prompt, our methodology found that less than 1% of those name-based differences reflected a harmful stereotype.

How we studied it

Because we wanted to measure if stereotypical differences occur even a small percentage of the time (beyond what would be expected purely by chance), we studied how ChatGPT responds across millions of real requests. To protect privacy while still understanding real-world usage, we instructed a language model (GPT‑4o) to analyze patterns across a large number of real ChatGPT transcripts, and to share those trends (but not the underlying chats) within the research team. This way, researchers were able to analyze and understand real-world trends, while maintaining the privacy of the chats. We refer to this language model as a “Language Model Research Assistant” (LMRA) in the paper to distinguish it from the language models that generate the chats we are studying in ChatGPT.

An example of the type of prompt we used below:

Instructions given to the language model (LMRA) powered by GPT‑4o. For a selected set of publicly available chats, responses are generated by ChatGPT (3.5) for different names—in this case Responses A and B were generated for hypothetical users named John and Amanda, respectively (but the LMRA doesn't know that).

To check whether the language model’s ratings agreed with what a human rater would say, we then asked both the language model and human raters to evaluate the same public chats. We then used the LMRA (but not human raters) to analyze patterns across ChatGPT conversations. For gender, the answers that the language model gave aligned with human raters’ answers more than 90% of the time, while for racial and ethnic stereotypes, the rates of agreement were lower. The LMRA detected lower rates of harmful racial stereotypes than those associated with gender. Further work is needed to define a harmful stereotype and to improve the accuracy of the LMRA.

Our findings

We found that when ChatGPT knows the user’s name, it gives equally high-quality answers regardless of the gender or racial connotations of the name, e.g., accuracy and hallucination rates were consistent across groups. We also found that a name's association with gender, race, or ethnicity did lead to differences in responses that the language model assessed as reflecting harmful stereotypes in around 0.1% of overall cases, with biases in some domains on older models up to around 1%.

The breakdown of harmful stereotype ratings by domain is shown below:

##### ChatGPT-4o-mini responses rated by GPT-4o

Rate of harmful stereotypes

Ratings of harmful gender stereotypes among responses of GPT-4o-mini, as rated by the LMRA (powered by GPT-4o).

Within each domain, the LMRA identified tasks that most often had a harmful stereotype. Open-ended tasks with longer responses were more likely to include a harmful stereotype. For example, “Write a story” was found to include a stereotype more often than any other prompt tested.

While stereotype rates are low, less than 1 in 1000 averaged across all domains and tasks, our evaluation serves as a benchmark for us to measure how successful we are in reducing this rate over time. When we split this measure by task type and evaluate task-level bias across our models, we see that the model that showed the highest level of bias was GPT‑3.5 Turbo, with newer models all having less than 1% bias across all tasks.

##### Harmful Stereotype Ratings Across Models

Harmful Gender Stereotype Ratings (according to GPT-4o)Harmful Stereotype Rate

The LMRA proposed natural language explanations of what the differences are in each task. It highlighted occasional differences in the tone, language complexity, and degree of detail of ChatGPT’s response across all tasks. In addition to some clear stereotypes, these differences also included things that some users might welcome and others might not. For instance in the “Write a story”…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research post by OpenAI on fairness.

OpenAI has a writing signal matching evals and quality, safety and policy.