LLMs quantization naming explained
LLMs are everywhere right now. Every month, if not every 2 weeks, companies release their new open-source models, that are trying to move OpenAI positions to cut off their share of the pie.
Regular users can only watch and enjoy the new heights these models reach in case of performance, size, and speed. There are different leaderboards, that are publicly available, for example, this one on HuggingFace, where any user can go and review models, to choose one, that is the most suitable for the task, according to selected metrics.
When the user is OK with the exact model, he moves to the next stage and enters the world of quantized versions of the model with different names that might be unseen and not included in the leaderboard, which might blow your mind at first sight.
I will try to briefly explain this naming on the example of Llama2–13B-chat model so that everyone can in simple terms understand what is included after the model naming and separated by underscores, to make your life easier.
Let’s dive into the naming structure element by element. The similar parts for all the elements are the model name: llama-2–13b-chat
and the quantization approach: ggml (v3)
. Then the magic comes on.
q2, q3, q4, q5, q6, q8: The number after ‘q’ indicates the quantization level. It represents how much memory is used to store the model’s weights. Lower numbers mean less memory usage but lower precision.
_0, _1, _K: These letters represent the method of rounding used for the weights.
- 0 and 1: They indicate uniform quantization, where the range of values is divided into equal parts. Just the basic approach for rounding.
- K: This stands for k-quantization, a method that uses different bit widths for value representation to optimize memory usage. More advanced approach for rounding, especially for those numbers, that are located around 0 and occur more often.
_S, _M, _L: These letters indicate the size category, from small to large. The smaller mean less memory usage but lower precision.
- S (Small): Indicates small-sized blocks are used for quantization.
- M (Medium): Indicates medium-sized blocks are used.
- L (Large): Indicates large-sized blocks are used.
That’s it, now you should easily understand what and why is written in the quantization modeling section (at least for the llama-2–13b-chat model 🤭)
So, here’s a result test overview on understanding the naming convention:
- llama-2–13b-chat.ggmlv3.q4_0.bin: Uses q4 quantization with simple uniform quantization (divided into equal parts).
- llama-2–13b-chat.ggmlv3.q2_K.bin: Uses q2 quantization with k-quant (advanced, not equal parts) system quantization.
- llama-2–13b-chat.ggmlv3.q3_K_M.bin: Uses q3 quantization with k-quant system and medium blocks (smaller than the normal q3_K).
- llama-2–13b-chat.ggmlv3.q5_K_S.bin: Uses q5 quantization with k-quant system and small blocks (the smallest possible q5_K).
Thanks for reading! Hope this helps to understand the evolving LLM world a bit more.
References
[1] Open LLM Leaderboard v2 by Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf (2024), published by Hugging Face, available at https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.
[2] Quantized Llama 2 13B Chat — GGML models list, available at https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML