Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information

Lin William Cong; Tengyuan Liang; Xiao Zhang; Wu Zhu

doi:10.3386/w33168

Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information

Lin William Cong, Tengyuan Liang, Xiao Zhang & Wu Zhu

Working Paper 33168

DOI 10.3386/w33168

Issue Date November 2024

We introduce a general approach for analyzing large-scale text-based data, combining the strengths of neural network language processing and generative statistical modeling to create a factor structure of unstructured data for downstream regressions typically used in social sciences. We generate textual factors by (i) representing texts using vector word embedding, (ii) clustering the vectors using Locality-Sensitive Hashing to generate supports of topics, and (iii) identifying relatively interpretable spanning clusters (i.e., textual factors) through topic modeling. Our data-driven approach captures complex linguistic structures while ensuring computational scalability and economic interpretability, plausibly attaining certain advantages over and complementing other unstructured data analytics used by researchers, including emergent large language models. We conduct initial validation tests of the framework and discuss three types of its applications: (i) enhancing prediction and inference with texts, (ii) interpreting (non-text-based) models, and (iii) constructing new text-based metrics and explanatory variables. We illustrate each of these applications using examples in finance and economics such as macroeconomic forecasting from news articles, interpreting multi-factor asset pricing models from corporate filings, and measuring theme-based technology breakthroughs from patents. Finally, we provide a flexible statistical package of textual factors for online distribution to facilitate future research and applications.

We are especially grateful to Agostino Capponi, Gerard Hoberg, and Gustavo Schwenkler for detailed comments and direction. We thank Chunrong Ai, Kwan Chen, Tony Cookson, Tarek Hassan, Shiyang Huang, Sanya Kohli, Kai Li, Alejandro Lopez-Lira， Nadya Malenko, Alan Moreira, Deniz Okat, Lubos Pastor, Lauren Sutioso, George Tauchen, Baozhong Yang, Weiyi Zhao, and seminar and conference participants at the AEA/CERNA Joint Meeting, Ansatz Capital, Conference on Big Data, Machine Learning and AI in Economics, Baidu Du Xiaoman Financial, DataYes/KDD China AI x FinTech Workshop, Erasmus University (Rotterdam), Financial Intermediation Research Society Annual Conference (Savannah), Global Digital Economy Summit for Small and Medium Enterprises (DES2020), Guanghua International Symposium, University of Hong Kong, Hong Kong University of Science and Technology, INQUIRE Europe Autumn Seminar (Krakow), IIF International Research Conference & Award Summit (Delhi), JD.com JDD (Financial Arm), Kenan Institute Frontiers of Entrepreneurship Conference, Nanyang Technological University, New Technologies in Finance Conference (Columbia GSB), 1st NY Fed FinTech Research Conference, Singapore Management University, Tilburg University, the Second Toronto FinTech Conference, and the Zhongnan University of Economics and Law for their feedback and suggestions. Michael Fortunato, Fujie Wang, Oliver Xie, and Guanyu Zhou provided excellent research and programming assistance. Thanks also go to Shuyan Huang, Chloe Shin, Raj Shukla, Ellis Soodak, Jiashu Sun, and Connie Xu for their research assistance. We gratefully acknowledge the financial support from the Ewing Marion Kauffman Foundation, the Becker Friedman Institute of Economics, the Fama-Miller Center for Research in Finance, INQUIRE Europe, the Kenan Institute of Private Enterprise, and the Risk Institute at OSU Fisher College of Business (while Cong was a fellow at the institute). The contents of this publication are solely the responsibility of the authors. Please send correspondence to Cong at will.cong@cornell.edu. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
Copy Citation

Lin William Cong, Tengyuan Liang, Xiao Zhang, and Wu Zhu, "Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information," NBER Working Paper 33168 (2024), https://doi.org/10.3386/w33168.

Download Citation

MARC RIS BibTeΧ

Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information

Related

Topics

Programs

More from the NBER