BrockNLP Lab Presenting Two Papers at COLING 2025

Congratulations to the teams behind two accepted publications at COLING 2025! The BrockNLP Lab will be presenting two oral presentations at COLING 2025 in Abu Dhabi, UAE:

Can We Afford The Perfect Prompt? Balancing Cost and Accuracy with the Economical Prompting Index - Tyler McDonald (Undergraduate Researcher), Anthony Colosimo (Undergraduate Researcher), Yifeng Li, and Ali Emami (Director)

Abstract:

As prompt engineering research rapidly evolves, evaluations beyond accuracy are crucial for developing cost-effective techniques. We present the Economical Prompting Index (EPI), a novel metric that combines accuracy scores with token consumption, adjusted by a user-specified cost concern level to reflect different resource constraints. Our study examines 6 advanced prompting techniques, including Chain-of-Thought, Self-Consistency, and Tree of Thoughts, across 10 widely-used language models and 4 diverse datasets. We demonstrate that approaches such as Self-Consistency often provide statistically insignificant gains while becoming cost-prohibitive. For example, on high-performing models like Claude 3.5 Sonnet, the EPI of simpler techniques like Chain-of-Thought (0.72) surpasses more complex methods like Self-Consistency (0.64) at slight cost concern levels. Our findings suggest a reevaluation of complex prompting strategies in resource-constrained scenarios, potentially reshaping future research priorities and improving cost-effectiveness for end-users.

NYT-Connections: A Deceptively Simple Text Classification Task That Stumps System-1 Thinkers - Angel Yahir Loredo Lopez (Mitacs Globalink Intern 2024), Tyler McDonald (Undergraduate Researcher), and Ali Emami (Director)

Abstract:

Large Language Models (LLMs) have shown impressive performance on various benchmarks, yet their ability to engage in deliberate reasoning remains questionable. We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive “System 1” thinking, isolating fundamental reasoning skills. We evaluated six recent LLMs, a simple machine learning heuristic, and humans across three configurations: single-attempt, multiple attempts without hints, and multiple attempts with contextual hints. Our findings reveal a significant performance gap: even top-performing LLMs like GPT-4 fall short of human performance by nearly 30%. Notably, advanced prompting techniques such as Chain-of-Thought and Self-Consistency show diminishing returns as task difficulty increases. NYT-Connections uniquely combines linguistic isolation, resistance to intuitive shortcuts, and regular updates to mitigate data leakage, offering a novel tool for assessing LLM reasoning capabilities.

Their work has been accepted to the upcoming COLING 2025 conference to be held in Abu Dhabi, United Arab Emirates from January 19th to 26th.

Tyler McDonald
Tyler McDonald
MSc. Student - Computer Science

My research focuses on the use and deployment of regional LLMs, the evaluation of deliberate reasoning in LLMs, and the cost efficacy of recent LLM research.