Language models are artificial intelligence techniques that generate natural language based on a given text, and OpenAI's GPT family of language models is one of the most advanced representatives available.
But they also have a problem: their behavior is hard to understand and predict. To make language models more transparent and trustworthy, OpenAI is developing a new tool that automatically identifies which parts of a language model are responsible for their behavior and explains them in natural language.
The principle of this tool is to use another language model, GPT-4, to analyze the internal structure of other language models. Language models consist of many "neurons", each of which can observe a particular pattern in the text and influence the model's next output.
OpenAI's tool uses this mechanism to break down the various parts of the model. First, it feeds a sequence of text into the model being evaluated and waits for a neuron to "activate" frequently. It then "presents" these highly active neurons to GPT-4 and has GPT-4 generate an interpretation.
To determine the accuracy of the interpretation, it provides the GPT-4 with some text sequences and asks it to predict or simulate the behavior of the neurons. It will then compare the behavior of the simulated neuron with the behavior of the actual neuron.
"With this approach, we can basically generate some initial natural language interpretations for each neuron, and there's a score to measure how well those interpretations match the actual behavior." Jeff Wu, head of OpenAI's Scalable Alignment Team, said, "We use GPT-4 as part of the process to generate explanations of what the neuron is looking for and to assess how well those explanations match what it actually does."
The researchers were able to generate explanations for all 307,200 neurons in GPT-2 and compile them into a dataset that was released as open source on GitHub, along with the tool code. Tools like this could one day be used to improve the performance of language models, for example by reducing bias or harmful speech. But they also acknowledge that there's a long way to go before it's truly useful. The tool is confident in the interpretation of about 1,000 neurons, which is only a small fraction of the total.
Some might argue that the tool is actually an advertisement for GPT-4, since it requires GPT-4 to run. But Wu says that's not the purpose of the tool, that it uses GPT-4 "by accident" and that, instead, it shows the weaknesses of GPT-4 in this area. He adds that it was not created for commercial use and could theoretically be adapted to other language models besides GPT-4.
"Most of the explanations have low scores or don't explain much of the behavior of the actual neurons." Wu says, "It's hard to tell how many neurons are active -- for example, they activate on five or six different things, but there's no obvious pattern. Sometimes there's an obvious pattern, but the GPT-4 can't find it."
Not to mention more complex, newer, larger models, or models that can browse the Web for information. But for the latter, Wu believes that browsing the Web doesn't change the basic mechanics of the tool too much. It only needs a little tweaking, he says, to figure out why neurons decide to make certain search engine queries or visit specific websites.
"We hope this will open up a promising avenue to solve interpretability problems in an automated way that others can build on and contribute to." Wu said, "We hope we'll really be able to have good explanations for the behavior of these models."