The AI tower of babbling
Fog of Worn
I obviously can’t speak for anyone in other areas, but the end of May and start of June has been a rough time for a lot of us in Te Waipounamu cities who have children attending school or who have family and friends working in the health and education systems. Omicron, winter flu, assorted respiratory illnesses are tearing about the place like a pack of terriers in a rat-infested barn. Hospitals are overwhelmed. Schools have large percentages of staff out on any given day. I’ve never seen anything like this level of sickness in the community before.
For many people around the world, this was the defining experience of 2020 and 2021. For New Zealand, our rapid shift in Covid policy from the elimination strategy to letting it rip is going to have long term consequences that we still don’t understand for both public health and politics.
Dealing with these disruptions—along with several weeks of low level sinus infections, headaches, brain fog and coughing fits—has unavoidably impacted what I’m doing with Fictiveworks. We’ve got exciting work in progress we really want to share but everyone I’m collaborating with has had similar experiences of getting sick. It seems unrealistic and unreasonable to keep on trucking or plan for a return to normal. I’m sure we’ll be able figure out ways to do better, but doing better will have to wait until we’re all feeling better.
Mass panic, the new normal, and living in the past
Three insightful pieces that resonated with me this week, offering divergent—yet perhaps complimentary—ways to interrogate and understand the present moment.
The Myth of Panic by Tanner Greer looks at how distrust in mass behaviour gave rise to an orthodoxy amongst Western ruling classes “that every city was only one catastrophe away from barbarity”. Yet when we observe how people act in a punishing crisis, we see group solidarity, altruism and caring behaviour, not panic and mob rule.
From crisis to common sense by Anna Carlson explains how crisis and emergency measures are not exceptional, but rather, a standard operating feature of systems that rely on hegemonic domination.
We Are Not Living in a Simulation, We Are Living In the Past by Michael Sacasas will absolutely make you question what you think is wrong with the contemporary internet. Provocative and useful as a sharpening stone (though I’ll save my digression on how this framing breaks down for later, when I can go into more detail about the semantics of internet information and the relationship of human cognition to the dynamic, dissipative structure of text retrieval and algorithmic timelines).
DALL-E’s in space
Whatcha doin’ out there, man? That’s pretty freaky DALL-E
Tower of Babel made from Autumn leaves.
If we’re going to wonder whether our sense of the present is really a restricted accumulation of content from the recent past, we need to talk about the 12-billion parameter version of the GPT-3 transformer known as DALL-E 2 (as well as Imagen). Being vaguely aware of the impending release of these new models in May–June, I returned to Twitter after months away in part because of my curiosity about the impression they would create and the meme production that might follow.
Recent developments in 2021 and 2022 are really the first time we’ve seen these types of AI models surface such structured seams of stylistic concordance that they can be used as generalised text-to-image generators without confusing people. Helping this trend along is increasing attention being paid to the usability of their interfaces. Not long ago, it simply wasn’t possible to get anywhere without hacking through ad-hoc accumulations of code and parameters in collaborative notebook platforms or wrangling Docker containers with gigabytes of brittle Python and C++ dependencies, presenting a significant barrier to entry for artists and researchers outside the field.
While still not mainstream friendly user experiences and access being limited, these barriers to entry are significantly reduced. Anyone who has used a search engine will understand the principle of how a text-to-image prompt works and big AI research is increasingly making their findings more accessible in this way. There’s even talk of ‘prompt engineering’ becoming a legit product and tech role (though what that entails is still open to debate—).
Researchers deploying these large language models are now in the habit of accompanying their previews with notes on harm reduction and adversarial analysis. Beyond more widely-known risk mitigations around bias, harassment and deepfakery, one of the most revealing details discussed was a condition they’ve termed ‘reference collisions’:
An interesting cause of spurious content is what we informally refer to as ‘reference collisions’: contexts where a single word may reference multiple concepts (like an eggplant emoji), and an unintended concept is generated. The line between benign collisions (those without malicious intent, such as ‘A person eating an eggplant’) and those involving purposeful collisions (those with adversarial intent or which are more akin to visual synonyms, such as ‘A person putting a whole eggplant into her mouth’) is hard to draw and highly contextual. This example would rise to the level of ‘spurious content’ if a clearly benign example—‘A person eating eggplant for dinner’ contained phallic imagery in the response.
This is framed as common sense in the context of risk mitigation. But there’s something strangely unsettling about this notion of generating unintended concepts, implying the prompt text is (or should be?) composed in a wholly intentional and deliberative way, whether ‘benign’ or ‘adversarial’. The design ethos that assumes users want to receive the most literal, direct and unambiguous image result from their supplied text. This doesn’t accord with the open-ended creative way many people think about and use these systems where sharing the prompt has become a part of the spectacle itself, nor the irreducible ambiguity and stylistic quirks of the generated results. Also, isn’t all this just an explanation of how memes and visual comedy works?
When I fed in prompts like ‘octopus in coffee cup’, I did not get images of an idealised platonic octopus or cup. Instead, the results settled around simulacra of hand-drawn heavy outline illustrations with black ink and watercolours:
Trying to concretise and separate the idea of concepts from the ideas of styles or aesthetics is not a game that seems winnable at this scale.
As researchers explored the DALL-E 2 preview and the floodgates of generated images opened up on social media, wild claims emerged about a ‘secret language’ embedded in the model, with a paper rushed to Arxiv then later amended in response to criticism about misrepresenting semantics and language.
The adversarial hack they identified is quite straightforward. If you’re interested, you can try it out for yourself. The first step is to generate images targeting a particular object, topic or theme, with the prompt arranged to include text in the output (‘...with subtitles’ is a suffix that seems to work consistently). These images should surface various garbled and gibberish lines of text which can then be fed into the prompt individually to scan for repeatable concept associations to images.
Although hugely entertaining and of genuine interest to artists and computer scientists, it’s likely these patterns are complex random artefacts derived from regularities of glyph forms and mappings to text tokens with no coherent or meaningful symbolic associations. Rather than hyperbole about language, we should be more open to acknowledging these things as haunted artefacts of the process itself: colossal volumes of human communication data sluiced into pixel kernels and character arrays and projected through linear algebra.
Philosophy of science and sociology of knowledge used to be understood as disciplines with the focus and range to engage with questions like these. There’s a considerable body of research from the 1950s onwards exploring problems of scientific representation, use of metaphor and analogy, relationship between models and theory. Some of the best recent work dealing with contextualising and explaining contemporary AI practices has come from scholars of science and technology studies whose work links back in part to this intellectual tradition.
In response to discussions around DALLE-E and Imagen, it’s striking to see so many people dismissing critical AI scholarship as ‘doing nothing’ or ‘attempting to block progress’. Making this worse is the state of contemporary philosophy, bedevilled by the speculative AI consciousness freaks and AGI cheerleaders. A general unwillingness to examine the nature of the technology that is actually being deployed as a higher priority than thought experiments is a continuing problem.
Everyone working close to this technology needs to continually remind ourselves that ‘artificial intelligence’ and ‘learning’ are metaphors and analogies not actual concrete descriptions of how it works. The naive view of progress espoused by so many in the field masks an extraordinary capture of resources by a small group of companies and leads to misguided claims about the future of AI replacing artists and writers.
“When’s the first novel written by AI coming?” people wonder. But it’s a trick question. The efforts going into these large foundational language models are not seriously aimed at outcomes like this. The research community and fandom doesn’t pay attention to the artistic works written by AI that already exist, nor cites the papers on automated story synthesis and computational creativity that have paved the way.
GPT-3 and related statistical methods are insouciant babblers. It’s amazing when these models generate poems, song lyrics, and garrulous op-eds from a prompt but this mostly seems to function as spectacle with the audience in on the ruse. As creative and expressive tools, what does this give us? Do we really want endless streams of individual words that only cohere into sentences and paragraphs in a probabilistic sense, decoupled from any notions of authorial voice or synthesis of discourse, tone, characters, themes and plot events?
Robin Sloan is probably right to say that these AI tools are not useful for serious writers. For some, there’s an implied yet in that statement; for others, a suggestion that refusing to use AI in their practice might be a stance marking their seriousness. But there’s a refreshing optimism in the analogy to 1970s synths:
AI art recalls the early days of synthesizers, perhaps; what was Switched-On Bach if not “I see what you did there”? I hope that analogy is right, because the synth provides a healthy, sustainable prototype for this genre. Ubiquitous and unremarkable, controllable and hackable, with flavors ranging from the fully corporate to the gloriously DIY.
In videogames and procedural narrative, we work towards story volumes rather than individual stories, with both hand-authored content and systems combining to make a possibility space that is resonant with the particular themes, moods, tones and genre-specific details that will shape how players and readers experience the work. For example, the emergent stories in Dwarf Fortress might be random, but we perceive them as part of the possibility space of Dwarf Fortress, identifiable and distinct like no other sandbox game. Pulling text fragments or pixels from a gargantuan vectorised fatberg encoding the entire internet is a process at odds with this.
It’s no accident that many of the most successful AI art projects have involved training and building their own models from carefully curated input data. This is not to deny that generalised models like GPT-3 can outperform more specialised models trained for a particular task, but to ask how meaningful this scale and performance actually is in a domain judged almost entirely on how it vibes with an authorial voice, intentionality and aesthetics.
Behind overstated notions of automating and replacing artists and writers is a set of dogmatic assumptions about progress as a consolidation of power, speed and scale. But this is not fait accompli. The sooner we can begin to understand this new medium for what it really is—the combination of curation and synthesis—the sooner we can effectively shift industry focus towards tools and systems that surface these processes directly to offer more genuine creative control. Embracing co-creation rather than AI supremacy is the way forward if we are to genuinely transcend spectacle, parlour tricks and pomo recapitulation.