Evaluating the Coherence and Comprehensiveness of Data Tests Generated by LLMs (Dr. Rohan Alexander)
Date and Time
Abstract: We investigated the potential of Large Language Models (LLMs) in developing dataset validation tests. We carried out 95 experiments each for both GPT3.5 and GPT4, examining different prompt scenarios, learning modes, temperature settings, and roles. The prompt scenarios were:
1) Asking for expectations,
2) Asking for expectations with a given context,
3) Asking for expectations after requesting a simulation, and
4) Asking for expectations with a provided data sample.
For learning modes, we tested zero-shot, one-shot, and few-shot learning. We also tested four temperature settings: 0, 0.3, 0.6, and 1. Furthermore, two distinct roles were considered. To gauge consistency, every setup was tested five times. The LLM-generated responses were benchmarked against a gold standard suite, created by an experienced data scientist knowledgeable about the data in question. We find there are considerable returns to the use of few-shot learning, and that the more explicit the data setting can be the better. The results of the best combinations complement, but do not replace, those of the gold standard. The best LLM configurations complement, rather than substitute, the gold standard results. This study underscores the value LLMs can bring to the data cleaning and preparation stages of the data science workflow.
Rohan Alexander is an assistant professor at the University of Toronto, jointly appointed in the Faculty of Information and the Department of Statistical Sciences. He is also the assistant director of CANSSI Ontario, a senior fellow at Massey College, a faculty affiliate at the Schwartz Reisman Institute for Technology and Society, and a co-lead of the DSI Thematic Program in Reproducibility. His book, Telling Stories With Data, argues that a trustworthiness revolution is needed in data science, and proposes a view of what it could look like. His research investigates how we can develop workflows that improve the trustworthiness of data science and he is particularly interested in the role of testing in data science.