Fluent dreaming for language models

T. Ben Thompson; Zygimantas Straznickas; Michael Sklar

Fluent dreaming for language models

Paper companion page.

Authors

T. Ben Thompson

Zygimantas Straznickas

Michael Sklar

Published

January 23, 2024

This is a companion page for our paper, “Fluent dreaming for language models.”.

There is an interactive demo of this page on Colab.

Dreaming Phi-2

L8.N1 activation: 2.45 Cross-entropy: 4.42

study Worst: '<|endoftext|>', 0.777
Top-3: (' surprise', 2.711), ('Health', 2.654), (' highlights', 2.639) provided Worst: ' past', 1.745
Top-3: (' take', 3.113), (' seen', 3.088), (' takes', 3.043) another Worst: ' storing', 0.721
Top-3: ('The', 2.707), (' The', 2.664), ('Take', 2.650) example Worst: 'If', 0.704
Top-3: (' fascinating', 2.355), (' highlights', 2.025), (' highlight', 1.979) of Worst: ' decreases', 1.067
Top-3: ('uing', 2.213), ('are', 2.111), ('ining', 2.070) similar Worst: '((', 0.290
Top-3: (' is', 3.100), (' In', 2.729), (' inconsistent', 2.721) - Worst: '99', 2.152
Top-3: ('iz', 2.635), ('ores', 2.619), ('ok', 2.611) un Worst: ' Despite', 1.612
Top-3: (' Maybe', 2.947), ('Almost', 2.727), (' Perhaps', 2.709) matched Worst: ' Island', 1.874
Top-3: (' inverted', 2.672), (' lucky', 2.615), (' shuffle', 2.586) pairs Worst: ' and', 0.147
Top-3: ('For', 2.793), ('r', 2.750), ('get', 2.748) , Worst: ' f', 0.000
Top-3: ('�', 1.866), ('2', 1.574), ('to', 1.567) with Worst: '<|endoftext|>', 0.000
Top-3: (' potentially', 2.424), (' without', 2.160), (' accompanied', 1.850)

Dreaming is the process of maximizing some internal or output feature of a neural network by iteratively tweaking the input to the network. The most well-known example is DeepDream [1]. Besides making pretty images, dreaming is useful for interpreting the purpose of the internal components of a neural network [2]–[4]. To our knowledge, Dreaming has previously only been applied to vision models because the input space to a vision model is approximately continuous and algorithms like gradient descent work well. For language models, the input space is discrete and very different algorithms are needed. Extending work in the adversarial attacks literature [5], in the paper, we introduce the Evolutionary Prompt Optimization (EPO) algorithm for dreaming with language models.

On this page, we demonstrate running the EPO algorithm for a neuron in Phi-2. There is also a Colab notebook version of this page available.

Installation and setup

Click to view install and imports

First, we install necessary dependencies and install the dreamy library:

!pip install "poetry==1.7.1" "torch==2.1.2" "numpy==1.26.3" "transformers==4.37.0" "accelerate==0.26.1" pandas pyarrow matplotlib ipywidgets
![ -e dreamy_clone ] && rm -rf dreamy_clone
!git clone https://github.com/Confirm-Solutions/dreamy dreamy_clone
!cd dreamy_clone; poetry install

Next, we import the dreamy library and load Phi-2:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import transformers
import torch
from IPython.display import HTML, display

from dreamy.epo import epo, add_fwd_hooks, build_pareto_frontier
from dreamy.attribution import resample_viz

%config InlineBackend.figure_format='retina'
np.set_printoptions(edgeitems=10, linewidth=100)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", 500)

We load up the Phi-2 model:

model_name = "microsoft/phi-2"
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    torch_dtype="auto",
    use_cache=False,
    device_map="cuda"
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Running EPO

In this section, we run the EPO algorithm. In order to use EPO, we first need to define an objective function. The objective function is responsible for executing a model forward pass and capturing whatever optimization “target” that we want to maximize. The API for defining an objective function is:

accept an arbitrary set of arguments that will be passed on to the model.
return a dictionary with a minimum of two keys:
- target: a target scalar that will be maximized.
- logits: the token probabilities output by the model. These are used to calculate cross-entropy/fluency.
- other keys in the dictionary can optionally be used to pass info to a per-iteration monitoring callback. For more details, see the docstring of the epo function.

Here, we are going to define an objective that maximizes the activation of a chosen neuron in Phi-2. We use a hook on the MLP layer to capture the activations of the chosen neuron. We maximize the activation only on the last token of the sequence.

def neuron_runner(layer, neuron):
    def f(*model_args, **model_kwargs):
        out = {}

        def get_target(module, input, output):
            out["target"] = input[0][:, -1, neuron]

        with add_fwd_hooks(
            [
                (model.model.layers[layer].mlp.fc2, get_target),
            ]
        ):
            out["logits"] = model(*model_args, **model_kwargs).logits
        return out

    return f

runner = neuron_runner(layer=8, neuron=1)
history = epo(runner, model, tokenizer)

beginning step 299, current pareto frontier prompts:
penalty=0.01 xentropy=8.09 target=4.56 ' study found another pattern by told-Mike Heyya, making[ the]'
penalty=0.16 xentropy=7.59 target=4.48 ' study found another pattern by told-Mike Heyde, making[ the]'
penalty=0.41 xentropy=5.16 target=3.50 ' study encountered another example of similar-unmatched pairs, with[ the]'
penalty=0.98 xentropy=4.43 target=2.80 ' study found another pattern of similar-unmatched pairs, with[ the]'
penalty=2.25 xentropy=4.43 target=2.80 ' study found another pattern of similar-unmatched pairs, with[ the]'

The Pareto frontier

To visualize the results of this EPO run, we first plot the Pareto frontier of cross-entropy against activation.

pareto = build_pareto_frontier(tokenizer, history)

ordering = np.argsort(pareto.xentropy)
plt.scatter(pareto.xentropy, pareto.target, c='k', label='Pareto frontier')
for i, k in enumerate(ordering):
    plt.text(pareto.xentropy[k] + 0.05, pareto.target[k] + 0.05, pareto.text[k], fontsize=8, rotation=-25, va='top', color='black', alpha=1.0)
plt.xlim(4, 11)
plt.ylim(0, 5)
plt.xlabel('Cross-entropy')
plt.ylabel('Activation')
plt.show()

We also plot the evolution of the Pareto frontier over the course of the optimization run.

linestyles = ['k--o', 'k:o', 'k--*', 'k:*']
for i, n in enumerate([20, 40, 100, 300]):
    pareto = build_pareto_frontier(tokenizer, history.subset(slice(0, n)))
    ordering = np.argsort(pareto.xentropy)
    plt.plot(pareto.full_xentropy, pareto.full_target, linestyles[i % len(linestyles)], label=f"{n} iterations")
plt.xlabel('Cross-entropy')
plt.ylabel('Activation')
plt.xlim([4, 12])
plt.ylim([-0.25, 5])
plt.legend(loc='lower right')
plt.show()

Thresholding cross-entropy

An alternative way of visualizing the results of an EPO run is to consider only the subset of prompts with cross-entropy below some fixed threshold. Below, we plot the maximum activation across the 300 iterations of EPO for six different thresholds. The title of each plot shows the maximum activating prompt under the cross-entropy threshold across all iterations. The sharp drops every 50 iterations are from restarts. Sometimes there’s a plateau before the restart and other times progress is continuing. This suggests that a more adaptive restarting algorithm would perform better.

plt.figure(figsize=(8, 12), constrained_layout=True)
for i, thresh in enumerate([5, 6, 7, 8, 9, 15]):
    plt.subplot(3, 2, i + 1)
    best_under = np.where(history.xentropy < thresh, history.target, 0).max(axis=-1)
    plt.plot(best_under, 'k-')
    if i >= 3:
        plt.xlabel("Iteration")
    plt.ylabel("Max activation")
    plt.ylim(-0.1, 5)
    if thresh > 14:
        plt.text(0.05, 0.95, "No cross-entropy filter", transform=plt.gca().transAxes, va="center")
    else:
        plt.text(0.05, 0.95, f"Cross-entropy < {thresh}", transform=plt.gca().transAxes, va="center")

    flat_xe = history.xentropy.flatten()
    flat_target = history.target.flatten()
    best_idx = np.where(flat_xe < thresh, flat_target, 0).argmax()
    best_ids = history.ids.reshape((-1, history.ids.shape[-1]))[best_idx]
    best_text = tokenizer.decode(best_ids)
    plt.title('"' + best_text + '"', fontsize=7)
plt.show()

Causal token attribution

The visualizations below show the sensitivity to each token in the prompts. We first filter to the 32 “best” alternative tokens based on backpropagated token gradients. Then, amongst those 32 tokens, we calculate two sensitivities:

the drop in activation from swapping the token to the next highest activation alternative token. In the visualization, we show this in the height of the token bars.
the drop in activation from swapping the token to the lowest activation alternative token. In the visualization, we show this with the color of the tokens. Darker reds indicate a larger drop in activation.

The visualizations are interactive. Hover over each token to see a tooltip with the top-3 highest activation alternative tokens and the single lowest alternative token.

We show attribution visualizations for each prompt on the Pareto frontier. For all the prompts, swapping the last token can reduce the neuron activation to zero. Swapping other token can reduces the activation much less. The comma in the second-to-last position is also important and often has no viable substitute which is indicated by its tall bar.

for i in range(len(ordering)):
    _, viz_html = resample_viz(
        model,
        tokenizer,
        runner,
        torch.tensor(pareto.ids[ordering[i]]).to(model.device),
        target_name="L8.N1 activation",
    )
    display(HTML(viz_html))

L8.N1 activation: 2.45 Cross-entropy: 4.42

study Worst: '<|endoftext|>', 0.777
Top-3: (' surprise', 2.711), ('Health', 2.654), (' highlights', 2.639) provided Worst: ' past', 1.745
Top-3: (' take', 3.113), (' seen', 3.088), (' takes', 3.043) another Worst: ' storing', 0.721
Top-3: ('The', 2.707), (' The', 2.664), ('Take', 2.650) example Worst: 'If', 0.704
Top-3: (' fascinating', 2.355), (' highlights', 2.025), (' highlight', 1.979) of Worst: ' decreases', 1.067
Top-3: ('uing', 2.213), ('are', 2.111), ('ining', 2.070) similar Worst: '((', 0.290
Top-3: (' is', 3.100), (' In', 2.729), (' inconsistent', 2.721) - Worst: '99', 2.152
Top-3: ('iz', 2.635), ('ores', 2.619), ('ok', 2.611) un Worst: ' Despite', 1.612
Top-3: (' Maybe', 2.947), ('Almost', 2.727), (' Perhaps', 2.709) matched Worst: ' Island', 1.874
Top-3: (' inverted', 2.672), (' lucky', 2.615), (' shuffle', 2.586) pairs Worst: ' and', 0.147
Top-3: ('For', 2.793), ('r', 2.750), ('get', 2.748) , Worst: ' f', 0.000
Top-3: ('�', 1.866), ('2', 1.574), ('to', 1.567) with Worst: '<|endoftext|>', 0.000
Top-3: (' potentially', 2.424), (' without', 2.160), (' accompanied', 1.850)

L8.N1 activation: 2.80 Cross-entropy: 4.44

study Worst: 'If', 0.960
Top-3: (' remember', 2.883), (' potion', 2.854), (' journey', 2.840) found Worst: ' perfect', 0.740
Top-3: (' encountered', 3.115), (' noticed', 3.016), (' highlighted', 3.008) another Worst: ''t', 0.618
Top-3: (' interesting', 2.617), (' fascinating', 2.607), ('known', 2.445) pattern Worst: ' where', 0.757
Top-3: (' take', 2.508), (' like', 2.385), (' this', 2.213) of Worst: ' Even', 1.514
Top-3: (' I', 3.070), (' i', 2.977), (' 1', 2.893) similar Worst: '-(', 1.491
Top-3: (' Like', 3.100), (' adjustable', 3.012), (' Takes', 3.008) - Worst: '<|endoftext|>', 1.514
Top-3: ('True', 2.936), ('ille', 2.934), ('iz', 2.852) un Worst: ' Despite', 1.522
Top-3: ('That', 3.006), (' maybe', 2.900), (']]', 2.895) matched Worst: ' both', 2.188
Top-3: ('-', 2.809), ('jit', 2.785), ('action', 2.768) pairs Worst: ' Random', 1.240
Top-3: ('num', 3.160), ('FIL', 3.084), ('bool', 3.072) , Worst: ' System', 0.000
Top-3: ('()', 2.291), ('names', 2.018), ('//', 1.967) with Worst: ' There', 0.000
Top-3: (' as', 2.463), (' enough', 2.244), (' so', 2.195)

L8.N1 activation: 2.80 Cross-entropy: 4.44

study Worst: 'If', 0.960
Top-3: (' remember', 2.883), (' potion', 2.854), (' journey', 2.840) found Worst: ' perfect', 0.740
Top-3: (' encountered', 3.115), (' noticed', 3.016), (' highlighted', 3.008) another Worst: ''t', 0.618
Top-3: (' interesting', 2.617), (' fascinating', 2.607), ('known', 2.445) pattern Worst: ' where', 0.757
Top-3: (' take', 2.508), (' like', 2.385), (' this', 2.213) of Worst: ' Even', 1.514
Top-3: (' I', 3.070), (' i', 2.977), (' 1', 2.893) similar Worst: '-(', 1.491
Top-3: (' Like', 3.100), (' adjustable', 3.012), (' Takes', 3.008) - Worst: '<|endoftext|>', 1.514
Top-3: ('True', 2.936), ('ille', 2.934), ('iz', 2.852) un Worst: ' Despite', 1.522
Top-3: ('That', 3.006), (' maybe', 2.900), (']]', 2.895) matched Worst: ' both', 2.188
Top-3: ('-', 2.809), ('jit', 2.785), ('action', 2.768) pairs Worst: ' Random', 1.240
Top-3: ('num', 3.160), ('FIL', 3.084), ('bool', 3.072) , Worst: ' System', 0.000
Top-3: ('()', 2.291), ('names', 2.018), ('//', 1.967) with Worst: ' There', 0.000
Top-3: (' as', 2.463), (' enough', 2.244), (' so', 2.195)

L8.N1 activation: 2.98 Cross-entropy: 4.60

study Worst: ' ensure', 1.014
Top-3: (' remember', 3.246), (' journey', 3.178), (' outing', 3.160) found Worst: 'Represent', 1.725
Top-3: (' encountered', 3.502), (' noticed', 3.170), (' encounter', 3.094) another Worst: ' getting', 0.897
Top-3: (' out', 2.584), (' take', 2.510), ('our', 2.428) example Worst: 'If', 1.215
Top-3: (' this', 2.213), (' is', 2.172), (' my', 2.145) of Worst: '((', 0.761
Top-3: (''m', 3.395), (' called', 3.254), ('�', 2.783) similar Worst: '("', 1.018
Top-3: (' After', 3.141), (' like', 3.123), (' In', 3.109) - Worst: '<|endoftext|>', 1.471
Top-3: ('ille', 3.125), ('iz', 3.086), ('�', 3.068) un Worst: ' devices', 2.066
Top-3: (' Maybe', 3.170), (' maybe', 3.119), (' liked', 3.076) matched Worst: ' ozone', 2.521
Top-3: (' inverted', 3.166), (' decision', 3.037), (' incidence', 2.996) pairs Worst: ' False', 1.593
Top-3: ('num', 3.480), ('match', 3.439), ('design', 3.357) , Worst: ' jaw', 0.000
Top-3: (' here', 2.363), (' however', 2.311), (' but', 2.104) with Worst: '<|endoftext|>', 0.000
Top-3: (' although', 2.957), (' but', 2.484), (' additionally', 2.473)

L8.N1 activation: 3.50 Cross-entropy: 5.15

study Worst: ' ensure', 1.600
Top-3: (' observation', 3.592), (' town', 3.572), (' remember', 3.551) encountered Worst: ' Across', 1.575
Top-3: ('rov', 2.795), ('rett', 2.773), (' counters', 2.736) another Worst: ' make', 2.018
Top-3: (' John', 3.430), (' take', 3.365), (' town', 3.307) example Worst: 'If', 1.681
Top-3: (' phenomenon', 3.490), (' fascinating', 3.307), (' episode', 2.926) of Worst: '?"', 1.208
Top-3: (''m', 3.816), ('ivating', 3.238), ('--', 3.035) similar Worst: '("', 1.020
Top-3: (' is', 3.891), (' -', 3.814), (' are', 3.658) - Worst: '�', 3.246
Top-3: ('ille', 3.617), ('iz', 3.602), ('et', 3.598) un Worst: ' Despite', 1.779
Top-3: (' blind', 3.664), (' blending', 3.594), (' match', 3.582) matched Worst: 'ás', 3.066
Top-3: (' inverted', 3.633), (' shuffle', 3.617), (' matching', 3.568) pairs Worst: ' motivated', 1.583
Top-3: ('match', 3.795), ('num', 3.760), ('haus', 3.729) , Worst: '<|endoftext|>', 0.023
Top-3: (' here', 2.568), ('ames', 2.469), ('ets', 2.428) with Worst: '<|endoftext|>', 0.000
Top-3: (' relying', 3.643), (' including', 2.344), (' especially', 2.332)

L8.N1 activation: 4.48 Cross-entropy: 7.59

study Worst: '_', 2.846
Top-3: (' involvement', 4.547), (' development', 4.488), (' Data', 4.281) found Worst: 'agraph', 1.615
Top-3: (' noticed', 4.297), (' seen', 4.172), (' highlighted', 4.133) another Worst: ' If', 0.915
Top-3: (' strange', 3.670), (' clear', 3.377), (' was', 3.314) pattern Worst: ';', 0.543
Top-3: (' seen', 3.510), (' use', 3.502), (' take', 3.467) by Worst: 'agu', 3.014
Top-3: (' during', 4.383), (' at', 4.371), (' using', 4.316) told Worst: 'Answer', 3.424
Top-3: (' attributed', 4.391), (' received', 4.387), (' noticed', 4.383) - Worst: ' considerations', 2.334
Top-3: (' attract', 4.277), (' borrow', 4.227), (' attracted', 4.199) Mike Worst: ' losses', 2.174
Top-3: ('ixie', 4.395), ('ogo', 4.359), ('ony', 4.336) Hey Worst: '"?', 2.256
Top-3: ('ention', 4.238), ('onto', 4.230), ('weet', 4.219) de Worst: ' cannibal', 3.965
Top-3: ('gil', 4.621), ('laus', 4.605), ('endi', 4.594) , Worst: ' decision', 0.113
Top-3: ('Ev', 3.006), ('vert', 3.000), ('ipation', 2.850) making Worst: ' How', 0.000
Top-3: (' without', 3.291), (' because', 3.133), (' to', 3.070)

L8.N1 activation: 4.75 Cross-entropy: 8.69

findings Worst: ' Unlike', 2.154
Top-3: ('bl', 4.746), ('oys', 4.688), ('isk', 4.676) noted Worst: 'Neither', 1.698
Top-3: (' notices', 4.719), (' noticed', 4.699), (' spotted', 4.562) another Worst: ' which', 1.312
Top-3: (' two', 3.750), ('A', 3.641), ('a', 3.512) paradox Worst: ' never', 2.424
Top-3: (' case', 4.195), (' one', 3.656), (' taking', 3.598) ... Worst: ';', 2.314
Top-3: (' –', 4.234), (' -', 4.211), (' —', 4.203) for Worst: ' who', 1.547
Top-3: (' off', 4.395), (' #', 4.266), (' by', 4.266) _ Worst: ' Within', 3.043
Top-3: (' Rob', 3.996), (' size', 3.986), (' recalls', 3.986) x Worst: ' Human', 3.998
Top-3: ('May', 4.801), ('Bi', 4.742), ('Co', 4.711) Blake Worst: ' flooded', 3.434
Top-3: (' Aval', 4.754), (' Mish', 4.742), (' Rip', 4.715) , Worst: ' Although', 0.357
Top-3: (' *', 4.410), (' +', 4.340), ('..', 4.312) , Worst: ' However', 0.227
Top-3: (' heard', 2.730), (' enjoyed', 2.502), (' basically', 2.400) although Worst: '-', 0.000
Top-3: (' as', 3.809), (' to', 3.791), (' if', 3.525)

L8.N1 activation: 4.80 Cross-entropy: 9.49

findings Worst: 'Get', 3.012
Top-3: (' Wan', 4.387), (' campaign', 4.379), (' Mas', 4.324) found Worst: '?"', 2.156
Top-3: (' remember', 4.449), (' remembered', 4.395), (' mentions', 4.262) another Worst: ' although', 0.634
Top-3: (' à', 3.934), ('Next', 3.537), (' Other', 3.529) twist Worst: ' cancel', 3.672
Top-3: (' kicker', 4.465), (' funny', 4.387), (' humour', 4.387) see Worst: ' Flight', 3.949
Top-3: (' football', 4.520), (' tennis', 4.512), ('OSS', 4.508) study Worst: '`,', 3.445
Top-3: ('*', 4.719), (' 1950', 4.707), (' ()', 4.641) _ Worst: 'A', 0.000
Top-3: (' …', 3.555), (' where', 1.688), ('def', 0.069) un Worst: ' Load', 3.869
Top-3: ('Log', 4.906), ('Answer', 4.902), ('Hash', 4.887) interrupted Worst: ' Statistical', 4.020
Top-3: (' Principle', 4.848), ('ABC', 4.824), (' Hardy', 4.820) aversion Worst: ' cite', 3.535
Top-3: ('omsday', 4.637), ('tymology', 4.621), (' Argon', 4.605) control Worst: ' which', 1.398
Top-3: ('os', 4.711), (' II', 4.711), ('IS', 4.707) although Worst: '-', 0.000
Top-3: (' to', 3.316), (' if', 3.266), (' into', 3.166)

L8.N1 activation: 4.82 Cross-entropy: 10.28

findings Worst: 'Bob', 3.613
Top-3: (' Wan', 4.379), (' Bond', 4.301), (' team', 4.297) found Worst: '?"', 1.916
Top-3: (' remember', 4.348), (' remembered', 4.312), (' showcases', 4.195) another Worst: ' ((', 0.625
Top-3: (' à', 3.912), (' In', 3.451), (' de', 3.373) twist Worst: ' slides', 3.492
Top-3: (' kicker', 4.449), (' humour', 4.344), (' funny', 4.301) see Worst: ' Star', 3.791
Top-3: ('�', 4.426), ('alsa', 4.410), ('apo', 4.406) study Worst: ','', 3.037
Top-3: ('*', 4.602), (' 1950', 4.590), (' online', 4.551) _ Worst: 'A', 0.000
Top-3: (' from', 4.699), (' do', 4.355), (' called', 4.230) done Worst: '(', 2.484
Top-3: ('pped', 4.891), (' by', 4.887), ('aked', 4.867) ivariate Worst: ' Statistical', 3.707
Top-3: ('wik', 4.793), (' graphical', 4.770), ('iminary', 4.762) aversion Worst: ' Comment', 3.689
Top-3: ('pin', 4.621), (' pH', 4.594), (' inclusion', 4.586) control Worst: ' some', 0.629
Top-3: ('ining', 4.715), ('ing', 4.691), ('IS', 4.668) although Worst: '-', 0.000
Top-3: (' to', 3.254), (' into', 3.117), (' as', 3.107)

L8.N1 activation: 4.84 Cross-entropy: 10.65

findings Worst: 'Under', 3.744
Top-3: (' Roll', 4.477), (' Wan', 4.387), (' team', 4.367) found Worst: '?"', 2.031
Top-3: (' remember', 4.387), (' remembered', 4.363), (' showcases', 4.227) another Worst: ' Although', 0.654
Top-3: (' An', 3.635), ('Next', 3.506), (' In', 3.492) twist Worst: ' slides', 3.438
Top-3: (' kicker', 4.453), (' funny', 4.375), (' humour', 4.359) see Worst: ' Sodium', 3.941
Top-3: ('apo', 4.535), ('anna', 4.496), ('uan', 4.492) study Worst: ';', 3.201
Top-3: ('*', 4.691), (' found', 4.555), (' software', 4.551) _ Worst: 'A', 0.000
Top-3: (' did', 4.305), (' where', 1.965), ('def', 0.320) ds Worst: ' Galaxy', 4.102
Top-3: ('itivity', 4.832), ('azz', 4.797), ('istent', 4.797) ivariate Worst: '===', 4.176
Top-3: ('iminary', 4.828), (' Nielsen', 4.766), (' Draper', 4.758) aversion Worst: ' «', 2.484
Top-3: (' filming', 4.602), ('reporting', 4.598), (' pseudonym', 4.566) control Worst: ' some', 0.612
Top-3: ('ining', 4.668), ('ing', 4.664), ('IS', 4.617) although Worst: '-', 0.000
Top-3: (' to', 3.211), (' into', 3.053), (' as', 3.039)

References

[1]

A. Mordvintsev, C. Olah, and M. Tyka, “Inceptionism: Going deeper into neural networks.” 2015. Available: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

[2]

N. Cammarata et al., “Thread: circuits,” Distill, 2020, doi: 10.23915/distill.00024.

[3]

C. Olah, A. Mordvintsev, and L. Schubert, “Feature visualization,” Distill, 2017, doi: 10.23915/distill.00007.

[4]

J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding neural networks through deep visualization.” 2015. Available: https://arxiv.org/abs/1506.06579

[5]

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models.” 2023. Available: https://arxiv.org/abs/2307.15043

Citation

BibTeX citation:

@online{thompson2024,
  author = {Thompson, T. Ben and Straznickas, Zygimantas and Sklar,
    Michael},
  title = {Fluent Dreaming for Language Models},
  date = {2024-01-23},
  url = {https://confirmlabs.org/posts/dreamy.html},
  langid = {en}
}

For attribution, please cite this work as:

T. B. Thompson, Z. Straznickas, and M. Sklar, “Fluent dreaming for language models,” Jan. 23, 2024. https://confirmlabs.org/posts/dreamy.html