Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks
A report on an experiment I conducted for Boaz Barak's "CS 2881r: AI Alignment and Safety" at Harvard. I evaluate the effects of different system prompt types (minimal vs. principles vs. rules) and their combinations on over-refusal, toxic-refusal, and capability benchmarks.