Autocompleting Inequality: Large language models and the "alignment problem"


Mike Zajko, University of British Columbia

The latest wave of AI hype has been driven by ‘generative AI’ systems exemplified by ChatGPT, which was created by OpenAI’s ‘fine-tuning’ of a large language model (LLM). The process of fine tuning involves using human labor to provide feedback on generative outputs in order to bring these into greater ‘alignment’ with particular values. While these values typically include truthfulness, helpfulness, and non-offensiveness, this research focuses on those that address inequalities between groups, particularly based on gender and race, under the broader heading of ‘AI safety’. While previous sociological analysis has documented the algorithmic reproduction of inequality through various systems, what is notable about the current generation of generative AI is the concerted efforts to build ‘guard rails’ which counteract these tendencies. When asked to comment on marginalized groups, these guard rails direct generative AI systems to affirm fundamental human equalities and push back against derogatory language. As a result, services such as ChatGPT have been criticized for promoting ‘woke’ or liberal values, and their guard rails become sites of struggle as users attempt to ‘jailbreak’ these systems. This article analyzes the fine-tuning of generative AI as a process of social ordering, beginning with the encoding of cultural dispositions into LLMs, their containment and redirection into vectors of ‘safety’, and the subsequent challenge of these guard rails by users. Fine-tuning becomes a means by which some social hierarchies are reproduced, reshaped, and flattened. I analyze documentation provided by leading generative AI developers, including the instructional materials used to coordinate its workforce towards certain goals, datasets recording the interactions between these workers and the chatbots they are responsible for fine-tuning (through ‘reinforcement learning through human feedback’, or RLHF), and documentation accompanying the release of new generative AI systems, which describes the ‘mitigations’ taken to counteract inequality. I show how fine-tuning makes use of human judgement to reshape the algorithmic reproduction of inequality, while also arguing that the most important values driving AI alignment are commercial imperatives and aligning with political economy. To explain how inequalities continue to persist in generative AI despite its fine-tuning, this research builds on a Bourdieusian perspective that has been valuable in connecting the cultural reproduction of social order with machine learning. To explain how generative AI has been ‘tuned’ to avoid reproducing particular inequalities (namely sexism and racism), we can study the work involved in fine-tuning through methods adapted from institutional ethnography. This helps us understand how the human labour required to make AI ‘safe’ is textually mediated and coordinated towards certain goals across time and space. However, to understand what the goals of ‘values’ of fine-tuning are, requires grounding our analysis in political economy. This is because generative AI has been an expensive investment in what is intended as a profit-making enterprise. Commercial exploitation is a primary consideration, and the cultural reproduction of other forms of oppression can actually be a threat to business interests. Therefore, my argument is that AI’s alignment problem has less to do with lofty human values, and more to do with aligning these systems with political economy and whatever is conducive to commercialization. To the extent that these systems are being aligned towards equality, this remains a particular (liberal) form of equality oriented towards equal treatment or neutrality, particularly along lines of gender and race, rather than more radical or transformative alternatives.

This paper will be presented at the following session: