The Mighty LLM Match-Up

Pinning the Top 5 LLMs against each other on Accuracy, Reliability, Customization, Cost, and Safety—the ARCCS criteria.

Nov 01, 2023

Not all LLMs are created equal.

Since ChatGPT there has been an explosion in available LLMs. There is Claude, Llama, Falcon, and PaLM to name a few.

Each one differs in size, parameters, and capabilities. And offer a unique balance of strengths and weaknesses depending on the intended application.

So how do you choose?

Use the ARCCS framework to evaluate LLMs

Here are the dimensions I consider when I evaluate LLMs

Accuracy: An LLM is only as good as its last answer. Here I look for two things. First, does it produce better quality results than the rest? Second, are the answers consistent?
Reliability: As your application grows, you need a reliable LLM. Non-functional factors such as speed of response (inference time), usability, and availability start to matter.
Customization: Customization is another key consideration. Some LLMs allow for customization and fine-tuning for your specific need, while others don’t.
Cost: Last but not the least is cost. Some LLMs are more expensive than others, and you must choose an LLM that fits your budget.
Safety & Security: LLMs are a black box. So pick a model that will help mitigate bias and abusive content risks. As well as one that prioritizes data privacy and protects sensitive information.

How the Top LLMs Rank on ARCCS

Here is how they all stack up,

ChatGPT-4 is the most robust and balanced LLM. It can deliver highly accurate and detailed responses to a diverse range of prompts. It also rates high in reliability, with proven performance in speed, scalability, and ease of use. And while it can get expensive, it is the best game in town (as of now).
Claude 2 is a good conservative option, that maintains a consistent personality and style. Its strength lies in its ability to respond to safe and natural dialogue. I use it for reviewing all my blog posts (including this one).
Llama 2 is a top free open-source LLM. It is good for answering questions, writing summaries, debugging codes, and generating texts. With Meta behind it, this one could give ChatGPT a run for its money.
PaLM 2 is capable of performing complex tasks such as translations, multilingual instructions, and coding. Given that this is trained on Google’s data, you would expect it to be top of the class. However, it lags behind other LLMs in accuracy and consistency. And, what hurts the most, it has limited availability.
Falcon performs well on tests when compared to ChatGPT and Llama. It is predominantly optimized for performance and efficiency. But with a model that is half the size of the competition. Many worry about accuracy and consistency issues.

ChatGPT Leads the Pack, For Now.

ChatGPT remains the LLM leader for now given its exceptional accuracy, reliability, and customizability. However, hungry contenders like Claude and Llama are rapidly improving and could dethrone it soon.

The competition for the best LLM is far from over.

In part 2 of this article, we will dig deeper into each of the LLMs.

Happy building!!

The AI Empowered Product Manager

Discussion about this post