top of page

Navigating the AI Alignment Problem: A Critical Role for Product Managers

It's imperative to talk about the elephant in the room: alignment issues in AI systems!

Language AI agents like OpenAI's GPT series are seeing a surge in adoption across industries, however, with great power comes great responsibility, and in this case, the responsibility lies in correctly specifying the goals and behavior of these AI agents. Therefore, I would like to emphasize the critical role product managers play in proactively controlling the risks that arise from the use of foundational models in products and services.

The AI Alignment problem refers to when an AI system does not do what we intended it to do. It is a challenge of ensuring that the behavior and decisions of AI systems are in line with human values, goals, and ethical considerations. It is a multifaceted issue that involves technical, philosophical, and practical aspects. These issues might arise due to a lack of proper specifications, and can include harmful content generation, gaming of objectives, or even producing deceptive language.

"For artificial intelligence to be beneficial to humans the behaviour of AI agents needs to be aligned with what humans want."(1)

While AI misalignment is often discussed concerning physical agents (e.g., robots), language agents have been somewhat overlooked, likely due to a misconception that they're somehow less potent in their ability to cause harm. Though language agents might not have physical actuators, their text outputs can have far-reaching impacts, influencing opinions, spreading misinformation, or inadvertently creating social divides. Particularly LLMs, are like delegate agents – they are supposed to act on our behalf. But when there's a disconnect between what we want them to do and what they end up doing, we have an alignment problem.

A critical challenge is the accidental misspecification (2) by system designers. Misspecification happens when an AI system is designed to optimize a certain objective, but due to errors in specifying training data, the training process, or requirements, the system behaves differently from what was intended. Essentially, the AI ends up “rewarding A, while hoping for B”. For example, a language model trained to be helpful might prioritize giving quick answers, even if they're not the most accurate or comprehensive.

Reward systems are nothing new to humans. In "On the Folly of Rewarding A, While Hoping for B (1975)" (4), Prof. Steve Kerr shares his findings about reward systems: "Whether dealing with monkeys, rats, or human beings, it is hardly controversial to state that most organisms seek information concerning what activities are rewarded, and then seek to do (or at least pretend to do) those things, often to the virtual exclusion of activities not rewarded".

A wrong reward system in an AI agent, can lead to various behavioral problems, including the production of harmful content or engaging in manipulative language. Furthermore, as the frontier AI systems evolve, the AI agents are showcasing capabilities, sometimes unexpected and unintentional, that could be harnessed for both benefit and harm. Among these, the dangerous capabilities such as offensive cyber skills, manipulation abilities, or even instructions on acts of terrorism, ring alarm bells.

Therefore, it’s imperative to closely scrutinize language AI agents and address potential issues that arise from misspecification. This involves employing robust training data that reflects a diversity of perspectives, utilizing techniques to monitor and correct biases, and establishing feedback loops where the system can learn from its mistakes. Moreover, involving ethicists, sociologists, and other domain experts during system design can help to identify potential pitfalls and guardrails.

Product managers must be the custodians of alignment. They need to meticulously understand the potentials and limitations of the AI systems, and provide clear, unambiguous specifications that align with organizational goals and societal values. They must ensure that mistakes in specifying training data, process, or requirements are minimized.

Furthermore, to put it succinctly, as AI systems evolve, the role of product managers in providing clear specifications becomes not only pivotal but indispensable in navigating the challenges and harnessing the full potential of these systems in a responsible manner. For product managers, this underlines an even deeper responsibility. They are not just charting the course of a ship but navigating through treacherous waters. There is an urgent need for a vigilant, robust, and proactive approach in evaluating these systems, and product managers have to be at the forefront.

I have been working in the software industry for the past 33 years, with 13 years dedicated solely to AI. I have worked at Silicon Valley giants such as Yahoo, eBay, and NVIDIA, from AI platforms to personalization and automatic ML, but I have never before experienced any risk like that which originates from the AI alignment issue of LLMs. Without going into too much detail I would like to share one suggestions with you that is based on DeepMind's recent research on AI alignment issues (2): manual evaluation!

Manual evaluation is nothing new, it has been around for a long time. In fact, Andrew Ng has a great explanation of manual evaluation at the ML Model level in one of his classes from 2017 (3).

DeepMind researchers cover the evaluation topic in the context of extreme risks around general-purpose models in great detail. I propose to draw parallel to their methods for product management purposes. They underline that evaluation is critical for addressing extreme risks and follow with a suggestion to identify dangerous capabilities through dangerous capability evaluations and the propensity of applying these for harm through alignment evaluations. The importance of these evaluations cannot be overstated. They act as the radar system of the ship, detecting icebergs well before they could cause damage. Aligning with this, product managers are in the best position to evaluate language AI agents, or any general-purpose AI system from the users' perspective. This requires many tools, frameworks and techniques far beyond the purpose of this article.

In the trainings I conduct, which cater to both individual learners and product management teams within major corporations, I provide instruction on employing various frameworks, such as the one outlined below, for assessing the viability and feasibility of incorporating Generative AI elements into products or services.

If you want to learn more about it, please just visit our website to sign up to one of my Generative AI classes at

Keep learning!

(1) Alignment of Language Agents (Kenton et al., 2021)(

(2) Model evaluation for extreme risks (Shevlane et al., 2023) (

(3) After minute 5:00 at

(4) On the Folly of Rewarding A, While Hoping (Kerr, 1975) ( )

84 views0 comments
bottom of page