Artificial Jagged Intelligence: When AI Benchmarks Misstate Deployment Value
Organisations increasingly select and deploy artificial intelligence systems on the strength of public benchmarks. A benchmark, however, scores a system on a single distribution of tasks, whereas each organisation meets its own. Because AI performance is uneven across tasks, a property called artificial jagged intelligence, these distributions diverge, and a system that looks reliable on average can fail on the tasks a given workflow uses most. We model this gap and show that it is not noise but a predictable exposure effect: deployment loss exceeds benchmark loss exactly when the tasks an organisation uses most are those the system handles worst. This single mechanism links managerial choices usually studied in isolation. It governs when to roll out a system, where to direct scarce reliability investment, whether to audit one’s own task mix before committing, and when to verify outputs after deployment. Better information about the workflow redirects investment towards targeted fixes whose value a public benchmark hides. The same logic explains why a single benchmark score is not enough: providers should report performance by task category so that organisations can reweight it for their own use.
-
-
Copy CitationJoshua S. Gans, "Artificial Jagged Intelligence: When AI Benchmarks Misstate Deployment Value," NBER Working Paper 34712 (2026), https://doi.org/10.3386/w34712.Download Citation
-