Beware the ChatBot, or Musings on a Modern Prometheus

Frankenstein

                            August 17, 2023

                Beware the ChatBot, or Musings on a Modern Prometheus

                        Hello dear readers,
It is that time of year. The heat has not yet let up, but other things are stirring. Invisible bureaucratic forces tug just a bit more every day. They make themselves known through our calendars and inboxes, especially if we are teachers, or students, or parents of students. The school year is upon us here in the United States.
And this year, that means many an e-mail or newsletter trying to cope with ChatGPT. Reader, this is one of those e-mails. (That's your warning. There's still time to close your e-mail and do something else, like read Mary Shelley's Frankenstein or go for a very long walk.)
A few weeks ago a mentor and friend asked me how I planned to address ChatGPT, or how we as faculty ought to engage with this technology as we try to prepare our students to be scholars and thinkers in a world where it exists. This is my sort of question: so far my entire scholarly career has been devoted to trying to understand how corporations, states, and the experts they employ have used, abused, made, and re-made the categories that structure our daily experiences of being human. I've written books about two of the biggest and most important aggregators of personal data in American History: the life insurance industry and the US Census. (Did you know, btw, that Democracy's Data is out in paperback next week?)
Yet, up until I received that e-mail, I had never even opened a ChatGPT account or attempted a conversation. I had in fact been avoiding it. So, this newsletter is an attempt to explain why I was (and am) wary. I hope to aid my fellow teachers (broadly defined) in thinking about how to embrace that wariness in a productive way.

A fair number of you among my readers found me thanks to my best friend, Robin Sloan, who has been generous with his mentions in the past. One of things I admire about Robin is the way he approaches new technological systems with great sophistication, but also a buoyant openness. He will acknowledge the ways that systems fail us---after all, he briefly worked for Twitter, but then was bit by the bug for writing novels, fell in love with the magic of book publishing (an old technology in many ways, but one that produces beautiful art), and left almost every scrolling timeline so he could imagine other ways to share ideas and make connections. He isn’t perfect as a forecaster by any means. (Ahem, Sloan, remember that time you convinced me that cryptocurrencies were not going away?... Let's ignore the part where I believed you.) Robin would never claim to be perfect. I think one of his superpowers is his willingness to really investigate the possibilities for invention opened up by new tech. 
And so, I took it seriously when he wrote this in his June newsletter: 

I see where Sloan is coming from. A few years ago, he put together an artisanal dataset to train a language model. I was visiting for a few days when it went live for the first time, and the "voyages in sentence space" I witnessed were revelatory and delightful. And as Robin recently reminded me when he read a draft of this piece, he was talking about much more than ChatGPT in that June newsletter. There are excellent examples of people trying to do something really different with language models. Check out, for example this 2019 talk by Everest Pipkin about "the role of curation in building generative textual systems." (We'll come back to this.)  So: what I am about to say has more to do with the now dominant generation of large language models controlled by very large companies. The sort of endeavors that really like to brand themselves as doing "AI."
Is it possible to train students to use something like ChatGPT in a manner that is appropriate? Or right? Or ethical? Or, most fundamentally, good?
Those are probably apt questions to ask about any big new system we’re set to engage with. But, for three reasons, I approach really big, corporate large language models (LLMs) suspecting that the answer might be: no.

First, there is the evidence that these large language models generate unsound or harmful results. 

Second, there’s the way that the process of building better large language models contributes to the monopolization of this new means of making knowledge. 

Third, there’s the problem of possible theft, and the degree to which the labor done for us by large language models is unfairly harvested from unwitting and uncredited writers at scale.

I’ll begin with the potential for generating bad knowledge. Any large language model, or any system we now usually just call “AI,” is built from an underlying data set---and the assumptions, biases, and curations that built that set will necessarily be reproduced by the model. In an influential California Law Review article from 2016, Solon Barocas and Andrew D. Selbst explained that the way that models leaned toward injustice inevitably led to illegal or unethical acts of discrimination. In her investigation of Google’s search algorithms, Safiya Umoja Noble illustrated the ways that the system privileged a vision of the world as seen by white, straight, American men; how it guided searchers to white supremacist literature to an alarming degree; and how the FCC or similar regulators failed to protect marginal groups from this powerful new information utility. As Google grew to be a search monopoly, the particular stance it took to the world became all the more important and so potentially dangerous.
After a LLM digests its training data, it returns and reproduces any inherent limitations in new and unexpected ways, generating surprising discoveries and also strange, disturbing, or dangerous errors. In what is far and away the best metaphor for how ChatGPT (probably) works, Ted Chiang calls the system a “blurry jpeg of the web.” The system compresses a vast store of the web’s data in an analogous fashion to how jpegs compress photographic info or MP3s compress music. The result is a certain amount of loss of quality and also a profusion of “artifacts,” which are sometimes called “hallucinations” by those committed to unhinged anthropomorphizing projections.
One response to the growth of all these systems has been to study the origins of the data sets that trained them. That’s undoubtedly good. It’s also really, really difficult: do we know what ChatGPT is built from? Can we know? The short answer seems to be: no. We have, instead, a general idea. 
A July 2023 New York Times Magazine piece about Wikipedia’s response to the chat bots said the training data “included not just Wikipedia but also Google’s patent database, government documents, Reddit’s Q. and A. corpus, books from online libraries and vast numbers of news articles on the web.” But we can’t say for sure because “tech companies have stopped disclosing what data sets go into their A.I. models.” 
(The ChatGPT 3.5 bot, an untrustworthy source as it repeatedly reminds its users, told me recently that 

“specific details about the data sources and proportions used in training GPT-3.5 are proprietary and not publicly disclosed due to considerations like intellectual property and security.” 

That much at least seems to be true.)
In their essay suggesting that data scientists could learn a thing or two from archivists, Eun Seo Jo and Timnit Gebru take an earlier iteration of the GPT model (GPT-2, trained largely on reddit posts and non-Wikipedia pages linked to by those posts) and show how a clear “mission statement” can be written explaining how and why the underlying dataset came into being. An important point of the statement is to make clear that the set’s limitation preclude widespread commercial use—it would simply be too incomplete and too narrowly constrained by one internet community’s input. They then go a step further and write up a mission statement for a new data set that could be built as a complement to the original, one intended to make a more representative and complete set.
We can investigate data sets and call out their problems, but then what comes next? One response has been to find harmful or biased datasets and demand their deletion. As Nanna Bonde Thylstrup explained, this happened in 2019 with the “Brainwash data set” (which had a troubling name, derived apparently from San Francisco’s Brainwash Café where live-cam images were captured and stored) or “80 million tiny images” that were used in computer vision studies, but that had been shown by VU Prabhu and A Birhane to be plagued by distressing and harmful content. Yet Thylstrup notes that removal or deletion are only a first step. You can remove a data set, but getting rid of its traces on the web or the marks it left in models it already trained is much trickier and maybe impossible. (On that note, I attended a meeting of Berkeley grad students that involved clever people studying “machine unlearning.” At first, I thought this was a joke, but it is actually an emerging field arising out of efforts to comply with legal requirements that biased data be removed from models.)

And this brings me my second fear: that our attempts to use and improve large language models will feed the beast, concentrating data, power, and money in a few corporate hands.
A common, and common sense, response to biased data is: well, just add more data. If the current sample is insufficiently inclusive, include more people. (We see this even in the idea behind the complementary data set proposed by Jo and Gebru.) The nasty trick here is that a flawed system can end justifying expanding that system’s reach. Sure, the AI is bad, but let it do more and see more and it will get better. But what if it doesn’t? The cost has been increased surveillance. 
And not just more surveillance: but increased monopolistic power among those few companies capable of aggregating truly massive data sets. Denton et al, in a paper that I think about often, explain the prevailing way of framing the problem like this: 

“discursive concerns about fairness, accountability, transparency, and explainability are often reduced to concerns about sufficient data examples. When failures are attributed to the underrepresentation of a marginalized population within a dataset, solutions are subsumed to a logic of accumulation, the underlying presumption being that larger and more diverse datasets will, in the limit, converge to the (mythical) unbiased dataset.” (9) 

Next, they trace out a troubling implication of this reasoning: 

“A major consequence of making the problem one of unrepresentative data is that those entities who already sit on massive caches of data and computer power will be the only ones who can make models more ‘fair,’ and therefore are the only ones who are well-suited and equipped to engage in the work of critique.” (9) 

Only the Googles or Microsofts can solve the problem if the problem is one of inclusion.
We’ll be destined to turn over more and more control over our data and the interpretation of that data to few corporations. That doesn’t sound good.

My third fear about the current crop of popular large language models is that they are extractive. I think we should take the metaphor of “data mining” more literally and think about these systems in the same light how we might think about the ravages of mountain-top removal.
In another newsletter, I'll get into the particular experience that made me feel this acutely. For now, it's enough to state the undeniable point: the material that trained a system like ChatGPT has millions upon millions of authors, vanishingly few of whom have been compensated or informed or credited.
Is this just fair use to the nth degree? Is the blending and remaking of authorial voices really so different from other artistic practices of collage, quotation, or sampling? Maybe. (Some of this will be left to courts to decide.) Surely the scale we're talking about raises the stakes.
If we could look under the hood, I suspect we’d find that the prevailing large language models have built their power from other people’s words, voices, and energies, without giving to those contributors meaningful control or compensation.
My friend Joanna Radin, looking at the history of an older training data set, has already made a great case that consent alone is not nearly enough for us to venture further in this world of large language models. Nor is the promise of the powerful that your harvested data will be used to help you or your community. 
If I squint, I can imagine a way that a democratically owned and managed system might be safely and justly entrusted with the mass aggregation of data. I mean, I couldn't describe exactly how this inclusive and egalitarian LLM would operate. I suspect it would work more like the census.

And now, having aired my fears, the question is: does this justify limiting or banning ChatGPT and similar large language models in our classrooms? 
Maybe.
But I can also see the counter-argument: isn’t the classroom a good place to help students start to learn about these dangers directly? And more to the point, the problems I’ve outlined are not simply or even primarily data problems. Excessive monopoly power, vast inequality, discriminatory and extractive markets plague "AI" because they plague our society more generally. The best way to fix LLMs would be to attack monopoly, concentrations of private capital, and inequality at its roots. Would I be so worried about ChatGPT if people had an enforceable right to a good paying job, a home, and health care? 
(Some public interest technologists can point to quite convincing examples of possible ways to use LLM chatbots to make for better government. Then again, critics worry with reason that the hype will just make it easier to impose more austerity in government under the guise of turning over services to AI systems.)
I turned to the Feminist Data Manifest-No, wondering what, if any, guidance it might offer. It rejects many of the ways of thinking that I believe are embedded in these systems. But its tenth point is this one: 

“10. We refuse to ‘close the door behind; ourselves. We commit to entering ethically compromised spaces like the academy and industry not to imbricate ourselves into the hierarchies of power but to subvert, undermine, open, make possible

So, if we were to decide it’s better to wrestle with LLM chatbots in classrooms, rather than just let the market teach everyone how to use them, how might all these fears shape our practices? Can we teach with ChatGPT in ways that “subvert, open, make possible?”
Maybe.
First, I think it’s clear that we don’t want our students to feed the beast. Or we don’t want to compel them too. Making students create accounts and driving up account numbers is probably not a great practice. But students could always work in groups. (I thought we might be able to do a shared classroom account, but that would take some work. It appears that one needs to use an e-mail and two-factor authentication to access ChatGPT. That’s hard to do with 30+ people.) Just as importantly, instructors can encourage students to “opt-out” of training the system: according to OpenAI, if you opt-out, it won’t continue to use your data to make the system better. (Is this true? I guess we just have to hope so…) Click here for my own opt-out experience/instructions.
(This step reminds me of how murky my waters are already. My reliance on Google for GMail and Drive means that I’ve already contributed to building massive training sets, and I get students to do the same when they e-mail me or submit papers online. If we listen to music or watch tv on a streaming service, I imagine engaging with some form of machine learning algorithm through the service's recommender engine. Still: why not opt out now when we can?)
Next, the problems of flawed data are precisely those that can be explored in guided practice, supplemented by some reading of the work of those (like many cited above) who have revealed inherent and discriminatory patterns in these systems.
The problem of possible theft is the hardest one. A lot of the academics I hear who are concerned about ChatGPT are most concerned because of the ways it muddies the waters on what counts as plagiarism. That’s one aspect of the concern I have. But I think this is a place where we are invited to work with students to think about why “credit” matters in academic writing in the first place, and to work out the political economy of creative work. If that sounds like a big ask for someone who just wants to get some help writing an essay, well: I guess that’s college for you.
For the time being, I plan to design writing exercises that don't make it that beneficial to use something like ChatGPT in the first place. I suppose I could ban the use of it, but I don't know how I'd enforce it (and I've always been wary of automatic plagiarism checkers, since they seem like yet another way to feed some data beast). So, I'll probably discourage the use of ChatGPT, but ask students who do employ a bot to acknowledge or cite it. That way, at least they’re nodding to the work of others that undergirds their own writing. 

I suppose I might amend my grading rubric to say that an A or B essay has to be significantly better than whatever a chatbot can do. After all, while the writing is surprisingly fluent, I don't think ChatGPT outputs have the spark of something that's really interesting. Dan Cohen makes this pitch in his most recent newsletter, imagining his ideal messages from profs to their students, one that emphasizes the personal purpose of communication. It is the relation of a distinct self to equally distinctive others. We scrutinize our lovers' every comma, he notes, and we search the most important messages for deeper meanings. We can see personality in writing, because it's in there:

You have lived your life, read a quirky selection of books, appreciated certain kinds of art and hated others, assembled an eclectic music playlist. In college, you will do much more of that, and sample entirely new genres of writing, sights, and sounds that will expand your palate.
Out of this distinct set of inputs will arise your distinct set of outputs: your personal style, your unique inflections, the little ornaments of language that help us project ourselves as us and get across our ideas, needs, and wants. ChatGPT was trained on a gigantic but generic mass of inputs — why would you want to reuse the outputs from that?

Cohen explores this line of thinking further in an earlier newsletter titled "Can Engineered Writing Ever Be Great?" Where he lands is similar to where Everest Pipkin landed in the talk I mentioned earlier. Pipkin ponders why the output of massive LLMs tends, on the whole, to be "boring." They make the case that "to work with generative systems is to be a curator" and so too explain that working with hand-built datasets or even giant ones really requires getting to know the rules and poetics of the entirety. That's just really hard to do with a catch-all LLM. In contrast, Pipkin says: "I pick my corpora because I care for them, deeply." Indeed, their underlying rule is a good one for any artist or scholar or person: "work with sources you care about."
Pipkin made a credit page for "I've never picked a protected flower," one of their art works built with generative text. The credits take up 14 pages at the end and name every person who contributed text used to train the model. 
I love the gesture, especially for the way it reminds us of the debt such a model owes to others' creative work. It also feels like a very distant cousin to an experiment in film-making epistemology by my friend, the dazzling documentarian, Penny Lane, who accompanied her film Nuts with "footnotes" expressing the truth content of the film's scenes.

Okay, so some of you have made it this far and you're screaming: "this is all very interesting, but what do I write in my syllabus?!?!?!?" 
Maybe this will help. My dear friend Heather Roller passed along this massive google doc for "Classroom Policies for AI Generative Tools" . It is, um, extensive. I am glad it exists.
(Heather, by the way, is a brilliant historian and author of two award winning books, most recently Contact Strategies: Histories of Native Autonomy in Brazil.)

The photos in this newsletter have nothing to do with ChapGPT. They're just some interesting towers I encountered on our Canadian excursions to Toronto and Montreal.

Thank you to Barbara Welke for stimulating this essay and Robin Sloan for helping me make it better.
Take care all,
Dan

                            Don't miss what's next. Subscribe to Shrouded and Cloaked, a Boring Newsletter: