How (un)Stable Are LLM Occupational Exposure Scores? Evidence from Multi-Model Replication

Michelle Yin; Hoa Vu; Claudia Persico

doi:10.3386/w35110

How (un)Stable Are LLM Occupational Exposure Scores? Evidence from Multi-Model Replication

Michelle Yin, Hoa Vu & Claudia Persico

Working Paper 35110

DOI 10.3386/w35110

Issue Date April 2026

A rapidly growing literature estimates AI's labor-market effects using large language models (LLMs) to self-assess occupational exposure. We demonstrate these measures are highly fragile. Replicating the dominant rubric with three frontier models on identical tasks, we find a 3.6-fold divergence in mean exposure with agreement as low as 57%. This measurement instability alters downstream empirical conclusions: in a difference-in-differences framework, individual-level coefficient magnitudes vary 2.4-fold across annotators, and county level estimates flip from a significant negative to an insignificant positive depending on annotators. We formalize this non-classical measurement error, highlighting the risks of treating evolving LLMs as static instruments.

We have no funding or financial relationships to disclose. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
Copy Citation

Michelle Yin, Hoa Vu, and Claudia Persico, "How (un)Stable Are LLM Occupational Exposure Scores? Evidence from Multi-Model Replication," NBER Working Paper 35110 (2026), https://doi.org/10.3386/w35110.

Download Citation

MARC RIS BibTeΧ

How (un)Stable Are LLM Occupational Exposure Scores? Evidence from Multi-Model Replication

Related

Topics

Programs

Working Groups

More from the NBER