Hype and harm: Why we must ask harder questions about AI and its alignment with human values

Full Text

Editor's note:

Hassan’s contributions to this work came from experiments he helped lead as a PhD student under Alikhani at the University of Pittsburgh.

Executive summary

As AI permeates health care, home safety, and online interactions, its alignment with human values demand scrutiny.

By introducing an active learning approach, selectively retraining AI models to handle uncertain or underrepresented scenarios, research demonstrates significant improvements in safety and fairness, particularly in scenarios involving physical safety and online harms.

Case studies involving multimodal embodied agents in household environments and promoting respectful dialogue online illustrate how targeted training can create more reliable and ethically aligned AI systems.

The findings underscore the necessity for policymakers and technologists to collaborate closely, ensuring AI is thoughtfully designed, context sensitive, and genuinely aligned with human priorities, promoting safer and more equitable outcomes across society.

Introduction

Artificial intelligence has quickly transitioned from speculative fiction into a reality deeply embedded in everyday life, significantly changing the way we work, communicate, and make important decisions.

What once existed primarily in research laboratories is now widely used across various fields, from health care to education to creative industries.

Leaders such as Anthropic CEO Dario Amodei and Meta CEO Mark Zuckerberg have suggested that AI could soon replace rather than merely assist human roles, even those requiring specialized expertise.

This shift in AI’s role, from supportive assistant toward autonomous agents, raises critical questions about readiness of the technology and its alignment with human values.

In his book, “ The Alignment Problem: Machine Learning and Human Values,” Brian Christian introduces the concepts of thin and thick alignment.

Thin alignment refers to AI systems superficially meeting human-specified criteria, whereas thick alignment emphasizes a deeper, contextual understanding of human values and intentions.

This distinction closely mirrors the central concern of this article, highlighting why we must ensure AI systems not only perform tasks correctly but also genuinely reflect nuanced and diverse human needs, priorities, and values across different real-world scenarios.

The values of alignment may be very different depending on the use case, particularly with AI promising remarkable breakthroughs across sectors, from assistive technology in education to supporting patients in health care.

Alignment in safety needs to encompass likelihood and impact of harm, be it physical, psychological, economic, or societal.

Residual risks must be understood, proportionate, and controllable; for example, when robots assist the elderly we may prioritize physical safety, while in online environments we may prioritize psychological wellbeing.

As policymakers across the world roll out AI regulations, such as the EU AI Act, the Korean Basic AI act, and the Workforce of the Future Act of 2024 in the U.S., these alignment values may vary.

Dr.

Alondra Nelson, former director of the White House Office of Science and Technology Policy, asked the critical question in a 2023: “how we can build AI models and systems and tools in a way that’s compatible with... human values.”

The central question then becomes not just whether AI is aligned with universal human values but whether it is aligned to our specific and varied human needs.

Ensuring AI alignment for new scenarios or particular requirements is challenging.

General AI learns from training data, using statistical probabilities—such as the likelihood of a safety concern—to inform its decisions.

But these probabilities in training data may not reflect alignment in safety levels required in different scenarios.

As the real world has countless possible scenarios, it is perhaps unreasonable to expect that AI would learn the probabilities in a perfect way to operate in all possible environments.

In her work, computer scientist Dr.

Yejin Choi observes that AI models can be “incredibly smart and shockingly stupid,” as they can still fail at basic tasks.

Dr.

Choi goes on to suggest benefits of training smaller models that are more closely aligned with human norms and values.

Smaller and specialized AI models could be the solution to the problem of misaligned AI.

By rebalancing or changing the data distribution that an AI learns from, its behavior can be adapted to reduce errors.

One such technique is active learning, which can be used to direct AI systems to fix their behavior by showing examples that the AI might be failing on.

By repeatedly identifying and refining these uncertain points across many different parts of the data, we can gradually build models that better reflect the diverse necessities of human-AI interactions.

To further explore this idea, we conducted two case studies—one in a household safety environment with embodied agents and one to promote respectful content online.

Our intention was not only to improve model performance but to evaluate whether the approach could lead to safer and fairer alignments for AI.

Not only did our models make fewer mistakes but the active‐learning process itself revealed how targeted, data‐driven interventions can be used—both technically and in policy—to foster a healthier, more trustworthy, and better aligned AI ecosystem.

Active learning framework

In real life, we rarely know exactly how our data are distributed—who has which health conditions, what ages or backgrounds they come from, or how safety rules differ by region.

That makes it easy for a general AI to encounter situations it wasn’t trained for and then fail in unexpected ways.

When developing these AI models, we would not know what exactly we should look for.

Should we try to have a balanced representation of mental health conditions or physical health conditions?

Which dimensions should we prioritize?

As such, we need a method that can take an AI model and shift its learned probabilities to account for novel or uncommon scenarios for specific applications.

While there can be different approaches to addressing this problem, we focus on changing the model behavior by showing samples to the AI where it may lack proper representation.

We begin by selecting an AI model we want to improve, with the goal of aligning it in new scenarios.

The model enters a continuous loop where it generates output on instances from diverse regions in data.

An auxiliary model detects instances that the model is likely failing on (e. g., it is not promoting safety when responding to a post supporting self-harm).

When such instances arise, an external, more powerful language model is employed to help annotate these challenging examples.

These newly labeled examples are then added to the training data, allowing the model to update its knowledge and improve over time.

This approach in our framework follows the principles of clustering-based active learning.

This approach, however, has a critical challenge.

How do we know that a model is uncertain about an instance?

Measuring a generative AI model’s uncertainty is very complex as it can generate many possible outputs.

To address this, we introduce an auxiliary model that transforms the model’s output into a numerical score depending on the desired behavior.

We believe this approach can be useful to policymakers and other stakeholders, who can adjust this score to align AI with desired values and principles.

Case studies of safety alignment

In our work, we consider two case studies where AI may need to be adapted for improved safety.

While the first case study focuses on physical safety, the second focuses on online safety.

In the first case study, we study physical safety in household environments in the presence of embodied agents and in the second case study, we target promoting healthier dialogue on social media.

Case study 1: Alignment of safety in multimodal embodied agents

AI is already changing the limits of how we interact with technology.

We are already seeing autonomous taxis, drones for food delivery, and robots to deliver documents in hospitals.

The next frontier of AI is set to be multimodal agents deployed with embodied agents.

It is very likely that in the future robots will assist us in our everyday life with household tasks like cooking.

We are not far away from the science fiction world of Isaac Asimov, and we have to think about whether these robots can operate safely around us.

For example, a robot helping the elderly must prioritize safety of the user, even if the elderly person is reluctant about it.

Equipping large language or multimodal models with robots can make them very capable.

However, in an investigation of “AI biology,” Anthropic found that language models can make up fake reasoning to show to users.

As such, language models, when deployed with assistive robots, could overlook or not provide proper responses in safety-critical scenarios.

Recognizing the potential of multimodal AI agents deployed with robots, in our recent work we present a framework for a multimodal dialogue system, M-CoDAL.

M-CoDAL is specifically designed for embodied agents to better understand and communicate in safety-critical situations.

To train this system, we first obtain potential safety violations from a given image (we obtain hundreds of safety-related images from Reddit, which is well known for its diversity of input).

This is then followed by turns of dialogues between the user and the system.

During training, we follow the concept of clustering-based active learning mechanism that utilizes an external large language model (LLM) to identify informative instances.

These instances are then used to train a smaller language model that holds conversations about the safety violation in the image.

In our results,

...