Come Now, Let Us Reason Together
Zero-Shot AgentInstruct leverages big LLMs to provide domain-specific models with a step-by-step reasoning process to improve accuracy.
If you surveyed a large group of people about their opinions regarding large language models (LLMs), what you would find is that…well, it’s complicated. On the one hand, these powerful artificial intelligence algorithms have incredible reasoning capabilities and a knack for understanding natural language. On the other hand, LLMs are well-known for their tendency to very confidently tell lies (or more politely put, hallucinate), and the cost and energy consumption that goes into building and operating these models is frequently astronomical.
For reasons such as these, some people love LLMs, while others see them as a fad that they wish would just go away. But if researchers at Washington University in St. Louis and UC Berkeley have their way, we might be able to have the best of both worlds — models that are more accurate and consume far less energy and computational horsepower. Maybe we really can have our cake and eat it too?
When a single LLM is trained with the goal of handling any imaginable task, the training costs skyrocket. It may also be the case that as one task area is improved, others simultaneously get worse. Tired of playing Whac-A-Mole all day, engineers have started to develop smaller, purpose-built LLMs that are fine-tuned for specific tasks. But since these pint-sized models do not have the broad knowledge of a general-purpose model, they can have some problems with clear reasoning.
The research team’s solution, called Zero-Shot AgentInstruct, seeks to overcome these issues through collaboration between multiple models. Their approach begins with a large, general-purpose LLM, which is prompted to produce step-by-step instructions to complete a task. It may not have the domain knowledge necessary to carry out the task — at least not with sufficient accuracy — but the generalized reasoning capabilities of such a model do give it understanding about how the task should be carried out.
The instructions generated by the initial algorithm are then used to prompt a much smaller, domain-specific LLM to answer the user’s prompt. With very clear instructions about how to carry out the task, the answer can be much more accurate and targeted. Furthermore, the smaller model consumes much less energy and computational power than a large, general-purpose model would to answer a complex question.
This all sounds great in theory, but we need to know if it works out as well in practice, so the team evaluated Zero-Shot AgentInstruct. The evaluation was conducted using 29 datasets that included 53 subsets spanning tasks such as generation, classification, and reasoning. Task-specific instructions were then generated and fed into three prominent LLMs: Vicuna, Llama-2-chat, and GPT-3.5 Turbo. Results showed that Zero-Shot AgentInstruct led to an average performance improvement of 17.8 percent across these models. It was noted that reasoning in math and logic, in particular, benefited greatly from this approach.
It is important to mention that Zero-Shot AgentInstruct is not perfect, and does make mistakes from time to time. But the model does output step-by-step reasoning that leads to the result, so it is always possible to check the outputs if they seem suspect. In any case, Zero-Shot AgentInstruct is helping to push the limits of what is possible with smaller models, and that is a development that we can all get behind.