Fluent dreaming for language models

Paper companion page.

T. Ben Thompson

Zygimantas Straznickas

Michael Sklar


January 23, 2024

This is a companion page for our paper, “Fluent dreaming for language models.”.

There is an interactive demo of this page on Colab.

Dreaming Phi-2
L8.N1 activation: 2.45           Cross-entropy: 4.42
 study  provided  another  example  of  similar - un matched  pairs ,  with

Dreaming is the process of maximizing some internal or output feature of a neural network by iteratively tweaking the input to the network. The most well-known example is DeepDream [1]. Besides making pretty images, dreaming is useful for interpreting the purpose of the internal components of a neural network [2][4]. To our knowledge, Dreaming has previously only been applied to vision models because the input space to a vision model is approximately continuous and algorithms like gradient descent work well. For language models, the input space is discrete and very different algorithms are needed. Extending work in the adversarial attacks literature [5], in the paper, we introduce the Evolutionary Prompt Optimization (EPO) algorithm for dreaming with language models.

On this page, we demonstrate running the EPO algorithm for a neuron in Phi-2. There is also a Colab notebook version of this page available.

Installation and setup

First, we install necessary dependencies and install the dreamy library:

!pip install "poetry==1.7.1" "torch==2.1.2" "numpy==1.26.3" "transformers==4.37.0" "accelerate==0.26.1" pandas pyarrow matplotlib ipywidgets
![ -e dreamy_clone ] && rm -rf dreamy_clone
!git clone https://github.com/Confirm-Solutions/dreamy dreamy_clone
!cd dreamy_clone; poetry install

Next, we import the dreamy library and load Phi-2:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import transformers
import torch
from IPython.display import HTML, display

from dreamy.epo import epo, add_fwd_hooks, build_pareto_frontier
from dreamy.attribution import resample_viz

%config InlineBackend.figure_format='retina'
np.set_printoptions(edgeitems=10, linewidth=100)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

We load up the Phi-2 model:

model_name = "microsoft/phi-2"
model = transformers.AutoModelForCausalLM.from_pretrained(
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Running EPO

In this section, we run the EPO algorithm. In order to use EPO, we first need to define an objective function. The objective function is responsible for executing a model forward pass and capturing whatever optimization “target” that we want to maximize. The API for defining an objective function is:

  • accept an arbitrary set of arguments that will be passed on to the model.
  • return a dictionary with a minimum of two keys:
    • target: a target scalar that will be maximized.
    • logits: the token probabilities output by the model. These are used to calculate cross-entropy/fluency.
    • other keys in the dictionary can optionally be used to pass info to a per-iteration monitoring callback. For more details, see the docstring of the epo function.

Here, we are going to define an objective that maximizes the activation of a chosen neuron in Phi-2. We use a hook on the MLP layer to capture the activations of the chosen neuron. We maximize the activation only on the last token of the sequence.

def neuron_runner(layer, neuron):
    def f(*model_args, **model_kwargs):
        out = {}

        def get_target(module, input, output):
            out["target"] = input[0][:, -1, neuron]

        with add_fwd_hooks(
                (model.model.layers[layer].mlp.fc2, get_target),
            out["logits"] = model(*model_args, **model_kwargs).logits
        return out

    return f
runner = neuron_runner(layer=8, neuron=1)
history = epo(runner, model, tokenizer)
beginning step 299, current pareto frontier prompts:
penalty=0.01 xentropy=8.09 target=4.56 ' study found another pattern by told-Mike Heyya, making[ the]'
penalty=0.16 xentropy=7.59 target=4.48 ' study found another pattern by told-Mike Heyde, making[ the]'
penalty=0.41 xentropy=5.16 target=3.50 ' study encountered another example of similar-unmatched pairs, with[ the]'
penalty=0.98 xentropy=4.43 target=2.80 ' study found another pattern of similar-unmatched pairs, with[ the]'
penalty=2.25 xentropy=4.43 target=2.80 ' study found another pattern of similar-unmatched pairs, with[ the]'

The Pareto frontier

To visualize the results of this EPO run, we first plot the Pareto frontier of cross-entropy against activation.

pareto = build_pareto_frontier(tokenizer, history)

ordering = np.argsort(pareto.xentropy)
plt.scatter(pareto.xentropy, pareto.target, c='k', label='Pareto frontier')
for i, k in enumerate(ordering):
    plt.text(pareto.xentropy[k] + 0.05, pareto.target[k] + 0.05, pareto.text[k], fontsize=8, rotation=-25, va='top', color='black', alpha=1.0)
plt.xlim(4, 11)
plt.ylim(0, 5)

We also plot the evolution of the Pareto frontier over the course of the optimization run.

linestyles = ['k--o', 'k:o', 'k--*', 'k:*']
for i, n in enumerate([20, 40, 100, 300]):
    pareto = build_pareto_frontier(tokenizer, history.subset(slice(0, n)))
    ordering = np.argsort(pareto.xentropy)
    plt.plot(pareto.full_xentropy, pareto.full_target, linestyles[i % len(linestyles)], label=f"{n} iterations")
plt.xlim([4, 12])
plt.ylim([-0.25, 5])
plt.legend(loc='lower right')

Thresholding cross-entropy

An alternative way of visualizing the results of an EPO run is to consider only the subset of prompts with cross-entropy below some fixed threshold. Below, we plot the maximum activation across the 300 iterations of EPO for six different thresholds. The title of each plot shows the maximum activating prompt under the cross-entropy threshold across all iterations. The sharp drops every 50 iterations are from restarts. Sometimes there’s a plateau before the restart and other times progress is continuing. This suggests that a more adaptive restarting algorithm would perform better.

plt.figure(figsize=(8, 12), constrained_layout=True)
for i, thresh in enumerate([5, 6, 7, 8, 9, 15]):
    plt.subplot(3, 2, i + 1)
    best_under = np.where(history.xentropy < thresh, history.target, 0).max(axis=-1)
    plt.plot(best_under, 'k-')
    if i >= 3:
    plt.ylabel("Max activation")
    plt.ylim(-0.1, 5)
    if thresh > 14:
        plt.text(0.05, 0.95, "No cross-entropy filter", transform=plt.gca().transAxes, va="center")
        plt.text(0.05, 0.95, f"Cross-entropy < {thresh}", transform=plt.gca().transAxes, va="center")

    flat_xe = history.xentropy.flatten()
    flat_target = history.target.flatten()
    best_idx = np.where(flat_xe < thresh, flat_target, 0).argmax()
    best_ids = history.ids.reshape((-1, history.ids.shape[-1]))[best_idx]
    best_text = tokenizer.decode(best_ids)
    plt.title('"' + best_text + '"', fontsize=7)

Causal token attribution

The visualizations below show the sensitivity to each token in the prompts. We first filter to the 32 “best” alternative tokens based on backpropagated token gradients. Then, amongst those 32 tokens, we calculate two sensitivities:

  • the drop in activation from swapping the token to the next highest activation alternative token. In the visualization, we show this in the height of the token bars.
  • the drop in activation from swapping the token to the lowest activation alternative token. In the visualization, we show this with the color of the tokens. Darker reds indicate a larger drop in activation.

The visualizations are interactive. Hover over each token to see a tooltip with the top-3 highest activation alternative tokens and the single lowest alternative token.

We show attribution visualizations for each prompt on the Pareto frontier. For all the prompts, swapping the last token can reduce the neuron activation to zero. Swapping other token can reduces the activation much less. The comma in the second-to-last position is also important and often has no viable substitute which is indicated by its tall bar.

for i in range(len(ordering)):
    _, viz_html = resample_viz(
        target_name="L8.N1 activation",
L8.N1 activation: 2.45           Cross-entropy: 4.42
 study  provided  another  example  of  similar - un matched  pairs ,  with
L8.N1 activation: 2.80           Cross-entropy: 4.44
 study  found  another  pattern  of  similar - un matched  pairs ,  with
L8.N1 activation: 2.80           Cross-entropy: 4.44
 study  found  another  pattern  of  similar - un matched  pairs ,  with
L8.N1 activation: 2.98           Cross-entropy: 4.60
 study  found  another  example  of  similar - un matched  pairs ,  with
L8.N1 activation: 3.50           Cross-entropy: 5.15
 study  encountered  another  example  of  similar - un matched  pairs ,  with
L8.N1 activation: 4.48           Cross-entropy: 7.59
 study  found  another  pattern  by  told - Mike  Hey de ,  making
L8.N1 activation: 4.75           Cross-entropy: 8.69
 findings  noted  another  paradox ... for _ x Blake , ,  although
L8.N1 activation: 4.80           Cross-entropy: 9.49
 findings  found  another  twist  see  study _ un interrupted  aversion  control  although
L8.N1 activation: 4.82           Cross-entropy: 10.28
 findings  found  another  twist  see  study _ done ivariate  aversion  control  although
L8.N1 activation: 4.84           Cross-entropy: 10.65
 findings  found  another  twist  see  study _ ds ivariate  aversion  control  although


A. Mordvintsev, C. Olah, and M. Tyka, “Inceptionism: Going deeper into neural networks.” 2015. Available: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html
N. Cammarata et al., “Thread: circuits,” Distill, 2020, doi: 10.23915/distill.00024.
C. Olah, A. Mordvintsev, and L. Schubert, “Feature visualization,” Distill, 2017, doi: 10.23915/distill.00007.
J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks through deep visualization.” 2015. Available: https://arxiv.org/abs/1506.06579
A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models.” 2023. Available: https://arxiv.org/abs/2307.15043


BibTeX citation:
  author = {Thompson, T. Ben and Straznickas, Zygimantas and Sklar,
  title = {Fluent Dreaming for Language Models},
  date = {2024-01-23},
  url = {https://confirmlabs.org/posts/dreamy.html},
  langid = {en}
For attribution, please cite this work as:
T. B. Thompson, Z. Straznickas, and M. Sklar, “Fluent dreaming for language models,” Jan. 23, 2024. https://confirmlabs.org/posts/dreamy.html