Evaluating neural toxic degeneration in language models

Real toxicity prompts

Pre-trained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pre-trained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration

This paper highlights how language models used to automatically generate text produce toxic, offensive and potentially harmful language. They describe various techniques that can be employed to avoid or limit this, but demonstrate that no current method is failsafe in preventing this entirely.