reasoning evaluation 5
- A Systematic Evaluation of Large Language Models on Out-of-Distribution Logical Reasoning Tasks
- Are Large Language Models Really Good Logical Reasoners
- Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
- Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4
- A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity