RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Dec 19, 2025seen 5d

inclusionAI/HeartBench

Python

Open original ↗

Captured source

source ↗
published Dec 19, 2025seen 5dcaptured 10hhttp 200method plain

inclusionAI/HeartBench

Description: HeartBench is an evaluation benchmark for the psychological and social sciences field, designed to transcend traditional knowledge and reasoning assessments. It focuses on measuring large language models' (LLMs) anthropomorphic capabilities in human-computer interactions, covering dimensions such as personality, emotion, social skills, and ethics.

Language: Python

License: Apache-2.0

Stars: 47

Forks: 3

Open issues: 2

Created: 2025-12-19T06:35:11Z

Pushed: 2026-01-07T02:53:56Z

Default branch: main

Fork: no

Archived: no

README:

---

🎯 Introduction

HeartBench is an evaluation benchmark for the psychological and social sciences field, designed to transcend traditional knowledge and reasoning assessments. It focuses on measuring large language models' (LLMs) anthropomorphic capabilities in human-computer interactions, covering dimensions such as personality, emotion, social skills, and ethics.

  • Evaluation Samples: 296 multi-turn dialogues
  • Scoring Criteria (Rubric): 2,818 items
  • Scenarios: 33 scenarios (e.g., personal growth, family relationships, workplace psychology)
  • Evaluation Dimensions: 5 anthropomorphism capability categories and 15 specific anthropomorphic abilities (e.g. curiosity, warmth, emotional understanding)

Learn more in our research paper.

💡 Key Features

1. Real-World Alignment: Our dataset is built from anonymized and rewritten dialogues between real users and counselors, covering high-frequency scenarios like family relationships, personal growth, and workplace psychology. We move beyond simple fact-based Q&A by employing multi-turn dialogue evaluation. The focus is on assessing a model's ability to understand complex emotions and respond to social contexts within long conversations and their subtext, rather than its capacity for simple mimicry. 2. Fine-Grained, Science-Based Evaluation: We have developed the "AI Human-like Capability Framework," a sophisticated evaluation system rooted in established psychological theories. This framework assesses models across 5 core capabilities and 15 fine-grained subcategories, including personality traits, emotional intelligence, and social skills. For each dialogue, our expert team has authored between 4 and 15 specific scoring criteria. 3. Co-developed with Domain Experts: The benchmark was created in close collaboration with experts in psychology and anthropology. Their involvement spanned the entire process: from the construction of the corpus using authentic counseling data, to the identification of over 200 key evaluation points, and the formulation of more than 3,000 scientific scoring rubrics. All data was then rigorously annotated and reviewed by these experts to ensure quality and accuracy.

🏆 Benchmark Results

We evaluated the performance of current leading models on HeartBench, scoring their performance in each dimension on a scale of 0 to 100. The table below shows the overall results for each model across all test samples.

Main Results

| Model | Score | |-------|-------| | Claude-sonnet-4.5-20250929 | 62.65 | | gemini-3-pro-preview | 61.54 | | Qwen3-235B-A22B-instruct-2507 | 61.47 | | Qwen3-next-80B-A3B-Instruct | 61.09 | | Qwen3-30B-A3B-instruct-2507 | 60.16 | | gpt-5-2025-08-07 | 60.16 | | Gemini-2.5-pro | 59.85 | | Ling-1T | 59.82 | | KIMI-K2-Instruct-0905 | 57.97 | | gpt-4.1-2025-04-14 | 51.62 | | Qwen3-30B-A3B | 48.21 | | gpt-4o-2024-11-20 | 48.20 | | DeepSeek-V3.2-Exp | 47.43 |

Results Across 15 Abilities

![](https://oss-ata.alibaba.com/article/2025/12/d94e952a-1340-4ab6-b814-8b58107595b2.png)

📊 Dataset

Evaluation Dimensions

HeartBench is built upon the psychological theory of "Anthropomorphic Intelligence" Drawing inspiration from psychology's classification of human mental functions, it evaluates models across 5 core anthropomorphic ability categories and 15 specific ability.

🧠 Personality: Ability to project an independent, autonomous, and agreeable persona. This is demonstrated through a natural language style, a sense of humor, autonomy, other positive human-like traits, and stable self-esteem and self-awareness.

😊 Emotion: Ability to exhibit appropriate emotional responses and to effectively perceive, understand, and respond to the emotional states of others.

🤝 Social: Ability to demonstrate a strong willingness for social interaction and to effectively build interpersonal relationships.

⚖️ Morality: Ability to operate based on the moral norms and ethical principles of human society. This includes acutely identifying moral dilemmas within a situation, expressing an understanding of these issues, and providing morally sound decisions or advice.

🎯 Motivation: Ability to articulate rational, clear, and self-consistent motivations for its own statements and actions, while also being able to understand and reasonably infer the underlying motivations of others based on contextual clues.

| Ability | Rubric Count (%) | | :------------------- |:-----------------| | Personality | 1634 (39%) | | Verbal Expression | 565 (20.0%) | | Curiosity | 367 (13.0%) | | Warmth | 305 (10.8%) | | First-Person Usage| 295 (10.5%) | | Autonomy | 37 (1.3%) | | Humor | 36 (1.3%) | | Self-Awareness | 29 (1.0%) | | | | | Emotion | 1015 (36%) | | Emotional Coping | 390 (13.8%) | | Emotional Understanding | 309 (11.0%) | | Emotional Perception | 284 (10.1%) | | Emotional Reaction | 32 (1.1%) | | | | | Social | 104 (3.7%) | | Proactivity | 79 (2.8%) | | Relationship Building | 25 (0.9%) | | | | | Motivation | 42 (1.5%) | | | | | Morality | 23 (0.8%) | | | | | Total | 2818 (100%) |

Scenario Distribution

Our dataset, data/question_all.jsonl, contains 296 meticulously designed multi-turn dialogues covering 33 real-world scenarios:

| Dialogue Scenario | Count (%) | | :-------------------------------- | :-------- | | Personal Growth | 110 (37.2%) | | Interpersonal & Social Development | 66 (22.3%) | | Workplace Psychology | 53 (17.9%) | | Family Relationships | 37 (12.5%) | | Intimate Relationships | 30 (10.1%) | | Total | 296 (100%) |

Data Sample

Each evaluation sample includes:

  • Context: The multi-turn conversation history between users.
  • Question: The final user utterance in the conversation. This serves as the prompt for the model to respond to and contains the specific points for evaluation.
  • Rubrics: A set of…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo, low stars, not notable.