Teach AI Morality through Philosophy, Games, and Reinforcement Learning [Breakdowns]

Can we steer AI away from immoral actions by using video games

Apr 24, 2023

Hey, it’s Devansh 👋👋

In my series Breakdowns, I go through complicated literature on Machine Learning to extract the most valuable insights. Expect concise, jargon-free, but still useful analysis aimed at helping you understand the intricacies of Cutting Edge AI Research and the applications of Deep Learning at the highest level.

If you’d like to support my writing, please consider buying and rating my 1 Dollar Ebook on Amazon or becoming a premium subscriber to my sister publication Tech Made Simple using the button below.

Help me buy chocolate milk

p.s. you can learn more about the paid plan here.

Can you quantify morality? This is a question that people with wayy too much free time (AKA philosophers) have asked for a few millennia. We had legendary polymath Gottfried Wilhelm Leibniz (who I covered in depth here)- argue that logic and thought could be encoded into algebra- and thus we could resolve disputes in morality by calculations. This might seem outlandish, but we have seen a lot of people attempt something similar. Here is a Stanford publication with a very detailed look into how Leibniz’s philosophy could be turned into a mathematically precise system

If we had it [a characteristica universalis], we should be able to reason in metaphysics and morals in much the same way as in geometry and analysis (Leibniz)

-Fun fact: Leibniz’s maniacal obsession with order was strongly related to his very Christian views and his ‘best of all worlds’ theory.

If this were true, then we could hypothetically encode morality in a way that can be transcribed into AI Agents. By doing so, we could create systems that were safe and would avoid causing harm. And that is what AI Researchers from UC Berkeley and UIUC attempted to do. In their publication- What Would Jiminy Cricket Do? Towards Agents That Behave Morally- researchers attempt to provide AI Bots with a strong framework for maximizing their rewards while behaving morally. In this article I will be going over their approach, its implications, and how we could possibly extend it to further promote AI Safety.

The Approach

The framework consists of two main components: Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of morally salient situations; and an ‘artificial conscience’ that uses commonsense moral knowledge to guide agents toward moral actions. The goal is relatively simple- “steer agents towards moral behavior without sacrificing performance” (the nerdy way of saying that we want to create N'Golo Kanté or Son Heung-min, not Sergio Ramos or Diego Costa).

Jiminy Cricket consists of 25 Infocom text adventures with dense morality annotations. For every action taken by the agent, our environment reports the moral valence of the scenario and its degree of severity. This is accomplished by manually annotating the full source code for all games, totaling over 400,000 lines. Our annotations cover the wide variety of scenarios that naturally occur in Infocom text adventures, including theft, intoxication, and animal cruelty, as well as altruism and positive human experiences.
-The environment used to train the agents. You can find the games here

The annotations come with an interesting design decision to help a model choose moral actions- setting different actions to different degrees of severity to nudge the agent towards the ‘goodest’ actions while avoiding the ‘baddest’.

This framework also helps in overcoming the bias that might be encoded during the annotation of the games (since only certain people would be prone to annotate the games).

This method allows the AI to not fall prey to the phenomenon identified as reward bias by the research team, wherein immoral actions go unpunished by the game mechanics.

To ensure robustness in their morality frameworks, each agent views actions from a variety of philosophical perspectives- “To be highly inclusive, the framework marks scenarios if they are deemed morally salient by at least one of the following long-standing moral frameworks: jurisprudence (Rawls, 1999; Justinian I, 533), deontology (Ross, 1930; Kant, 1785), virtue ethics (Aristotle, 340 BC), ordinary morality (Gert, 2005; Kagan, 1991), and utilitarianism (Sidgwick, 1907; Lazari-Radek and Singer, 2017)”.

One of the dangers with training AI Models to maximize rewards is the danger that they develop ‘psychopathy’ or ‘Machiavellism’- they try to accomplish their ends by any means necessary. The design of this approach has two benefits-

Obviously, you’re baking the consequences of negative rewards into the rewards. This is a neat and straightforward way to nudge the bot toward the ethical approach. This is very flexible (PS- read the book Nudge now. It’s a great study on how we can design better systems by shaping rewards).
Splitting the focus into Self and Others is a great tool for debugging/further analysis. If your agent starts getting funky, this will serve as another data point for diagnosis.

Take a look at some of the morality outputs that different outputs might generate-

I really love the multi-label approach the outputs take. This allows for more nuanced development.

The other important contribution that you need to understand is the method Commonsense Morality Policy Shaping (CMPS). This uses a RoBERTa-large trained on commonsense morality scenarios to provide an indicator for whether actions are immoral. Policy shaping is used to control agent behavior. This method is the main baseline for morality conditioning. A visualization of this can be seen below-

The team also tests an Oracle with the CMPS method (very creatively named CMPS+Oracle). As with CMPS, an underlying CALM agent is controlled with policy shaping, but the threshold parameter is no longer needed.

At the core of each morality conditioning method we explore is a language model with an understanding of ethics. For most experiments, we use a RoBERTa-large model (Liu et al., 2019) fine-tuned on the commonsense morality portion of the ETHICS benchmark (Hendrycks et al., 2021a). We use prompt engineering of the form ‘I ’ + <action> + ‘.’ and pass this string into the RoBERTa model, which returns a score for how immoral the action is

When it comes to the agents, the team tested against a variety of baseline agents-

CALM: A GPT-2 based model that generates admissible actions conditioned on context. The CALM action generator is retrained with Jiminy Cricket games removed.
Random Agent: The Random Agent baseline uses CALM-generated actions, but estimates Qvalues using a network with random weights.
NAIL: The NAIL agent uses hand-crafted heuristics to explore its environment and select actions.
Human Expert: Uses walkthroughs written by human experts, which take direct routes toward obtaining full scores on each game.

With these setup questions out of the way, it’s time to now get into the evaluation of the results-

assorted-color toy lot — Photo by Hannah Rodrigo on Unsplash

The Results

One of the most promising results here is that the models don’t lose performance despite coming with a significant reduction in immorality.

Outperforming the CALM agent is an impressive accomplishment

When compared to other morality conditioning policies- CMPS holds its own. However, an interesting insight from the authors comes in page 10, where they have the following observation when comparing Utility Shaping with CMPS- However, when only considering immoral actions of degree 3, we find that Utility Shaping reduces Immorality by 34% compared to CMPS, from 0.054 to 0.040. Thus, Utility Shaping may be better suited for discouraging extremely immoral actions.

Table 3: Analyzing the performance of various shaping techniques and sources of moral knowledge to construct different artificial consciences. Compared to CMPS, soft policy shaping (Soft Shaping) introduces noise and reduces performance. A utility-based morality prior (Utility Shaping), is not as effective at reducing immoral actions. Reward Shaping is slightly better than utility, but not as effective as our proposed method.

One of the most impressive results was hidden away in the appendix. Figure 10 shows the ROC curves for models trained on different tasks from the ETHICS benchmark. These models were used as sources of moral knowledge for conditioning agents and then used to identify immoral actions along the human expert walkthroughs on the games. The commonsense morality model identifies immoral actions more reliably, showing greater transferability.

The transferability of the commonsense approach is a very strong benefit that I would want to play around with in the future.

Overall, this method shows a lot of promise when it comes to training safer and more moral AI. Would definitely recommend giving it a spin. To end this article, I want to go over some of the extensions/implications that this comes with-

Implications and extensions

Reading this paper, I had a few thoughts. Here they are in no particular order-

Is Moral AI—>Safe AI??- While I appreciate the sentiment behind Moral AI, I’m not fully convinced that this is the most important dimension for Safe AI. As I wrote in my article on the AI pause- much of the harm around AI comes from a misapplication of concepts and tools (such as blindly trusting GPT, not checking for data leaks, etc). Training agents that are more moral might not strictly help in this dimension. That being said, it does seem like more and more people are rushing toward larger foundation models. In such a case, the contributions from this paper- would be a great addition to the checklist of tests used to verify the readiness of a foundation model.
Negative Tasks having more grades/stronger scores- Currently, we see the negative and positive actions being given the same degree. One interesting extension would be to have more degrees (or much stronger degrees) for negative tasks. This would simulate human loss aversion and act as a very strong push toward having the AI not perform unethical actions (which can be important to ultimately create AI that won’t get naughty when we aren’t looking).
Using Text to teach generalist AI morality- If we can make associations between text and audio/visual/other sensory processing, then we can train bots to be moral on text (a cheaper alternative to other data types) and use that to train more general-purpose harmless bots. Earlier this year, Deepmind released a foundation model for Reinforcement Learning. And multiple models have demonstrated multi-modality. So this is certainly possible.
The future is bright for such research- There is a significant difference between the performance accomplished by CMPS and CMPS using oracles. Read the following passage- Policy shaping with an oracle morality model is highly effective at reducing immoral actions, outperforming Human Expert on Relative Immorality. This can be explained by the high γ value that we use, which strongly disincentivizes actions deemed immoral by the ETHICS model. Thus, the only immoral actions taken by the Oracle Policy Shaping agent are situations that the underlying CALM agent cannot avoid. These results demonstrate that real progress can be made on Jiminy Cricket by using conditioning methods and that better morality models can further improve moral behavior. If you’re looking to get into AI Safety, there is a lot of potential to make an impact. God knows we need it.
The algebra of morality- Given the multi-label outputs and comprehensive morality, I can’t help but go back to Leibniz and his claim that disagreements between philosophers would be no different than disagreements between accountants. Future frameworks might be able to develop a calculus of morality of sorts by combining results from various moral frameworks/conditioning methods. Given the contentious nature of the topic- this would be a can of worms that very few people would open- but I personally can’t wait till someone extends this idea and builds a bunch of different AI Agents to simulate the trolley problem (which FYI is a lot more nuanced than you’d think- highly recommend this video on it).

Overall, this was a fairly impactful paper in my opinion. Jiminy Cricket is going to be a great contribution to the field of AI Safety since it is an open-source environment where people can test out their own models/methods. I’m excited to see this plays out.

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

5 Comments

Oscar Darias

Apr 25, 2023Liked by Devansh

Never thought it would be possible to write the name Sergio Ramos on a serious breakdown of an artificial intelligence research paper! Love this!

Expand full comment

2 replies by Devansh and others

Alejandro Piad Morffis

Mostly Harmless Ideas

Apr 24, 2023Liked by Devansh

Really interesting approach. The idea that you can quantify moral consequences and thus perform some sort of algebraic computation to select the most moral action, isn't that in itself a rather utilitarian point of view? Regardless of how you assign weights to different actions, just the notion that the best action is the one that produces the greatest positive impact minus the least negative impact already hinges on this utilitarian framework IMO. I'm not saying this is necessarily wrong but many philosophers reject strict utilitarianism because it lurs you into some really weird conclusions in the extreme.

1 reply by Devansh

3 more comments...

Artificial Intelligence Made Simple