The AI Glossary: Every Term You Need to Know in 2025

Entering the AI space feels like learning a new language. Everyone throws around RAG, RLHF, GGUF, MoE, MCP like you’re supposed to know what they mean. This glossary cuts through the noise. Grouped by topic, no fluff, just what each term means and why it matters.

Model Training

Pretraining — The first and most expensive step. Feed a model trillions of tokens from the internet so it learns grammar, facts, and reasoning by predicting the next word. Costs millions in GPU compute. Produces a base model — powerful but unable to follow instructions. Think of it as reading the entire internet and learning to autocomplete sentences.

Fine-Tuning — Specializing a pretrained model for a specific task. The most common form is instruction tuning (training on Q&A pairs so the model learns to follow instructions and chat). This is what turns a base model into a chat model.

LoRA (Low-Rank Adaptation) — A cheap fine-tuning technique. Instead of updating all weights, it adds small trainable adapter layers. Lets you fine-tune a 70B model on a single consumer GPU. See github.com/microsoft/LoRA.

QLoRA — LoRA + quantization. Loads the base model in 4-bit precision to further cut memory, then applies LoRA adapters. Fine-tuning on a budget.

RLHF (Reinforcement Learning from Human Feedback) — The model generates multiple responses, humans rank them, and the model learns to prefer higher-ranked ones. How models become “helpful and harmless.”

DPO (Direct Preference Optimization) — A simpler alternative to RLHF. Directly trains on human preference data without needing a separate reward model. More stable, cheaper. See arxiv.org/abs/2305.18290.

GRPO (Group Relative Policy Optimization) — The technique behind reasoning models like DeepSeek-R1. Generates a group of outputs, rewards the best ones relative to the group. Produces models that can think step-by-step before answering.

RL (Reinforcement Learning) — Teaching through trial, error, and reward. In the LLM world, RL is what enables reasoning models — models that generate a chain of thought before answering. The reasoning process gets reinforced when the final answer is correct.

Distillation — Training a smaller model to mimic a larger one. The small model (student) learns from the large model’s (teacher) outputs. Example: DeepSeek-R1-Distill-Qwen-7B is a 7B model trained to reproduce the reasoning of the full 671B DeepSeek-R1. You get much of the teacher’s capability at a fraction of the cost.

SFT (Supervised Fine-Tuning) — Training the model on high-quality input-output examples curated by humans. The step between pretraining and RLHF. Teaches the model the format and style of conversations.

Model Architecture

Transformer — The architecture behind virtually every modern LLM. Processes all tokens simultaneously using self-attention, letting the model understand relationships between any parts of the text. The “T” in GPT. Introduced in the 2017 paper “Attention Is All You Need”.

MoE (Mixture of Experts) — Splits the model into multiple “expert” networks. A router picks which experts handle each token. E.g., Mixtral 8x7B has 8 experts but only activates 2 per token — quality of a 47B model at the speed of a 13B model. The tradeoff: the full model must fit in memory even though only a fraction runs per token.

Total Parameters — The full size of the model on disk. For MoE models, this includes all experts combined. For a 8x7B MoE model, total is ~47B parameters.

Active Parameters — How many parameters are actually used per token during inference. In MoE models, active params are much less than total. In dense (non-MoE) models, they’re the same. Active parameters determine inference speed; total parameters determine memory usage.

Dense Model — A standard model where every parameter is used for every token. The opposite of MoE. Most models under 30B are dense.

Attention — The mechanism that lets transformers understand relationships between tokens. Each token “looks at” every other token to decide how much to weight it. This is why LLMs can maintain context and understand nuance across long texts.

RoPE (Rotary Position Embedding) — How transformers understand the order of tokens. Encodes position information by rotating the token embeddings. Used by Llama, Qwen, Mistral, and most modern models. Different “rope” frequency settings affect how well the model handles long contexts.

GQA (Grouped-Query Attention) — An optimization that reduces memory and compute by sharing key/value heads across multiple query heads in the attention mechanism. Used by Llama 2+ and most modern models. The alternative is MHA (Multi-Head Attention) which uses separate heads for everything.

Model Capabilities & Modalities

Multimodal — A model that processes more than just text. Can handle images, audio, or video alongside text. Examples: GPT-4o, Claude, Gemini, Qwen3-VL. Typically adds a vision encoder to the text transformer.

Vision — The ability to understand images and screenshots. Vision models can describe images, read text in screenshots, analyze charts, and answer questions about visual content. Achieved by pairing a text transformer with a vision encoder like CLIP or SigLIP.

VL (Vision-Language) — A model natively trained on both images and text. Denoted by “VL” in model names (e.g., Qwen2.5-VL, InternVL). Not the same as a text model with a bolted-on vision adapter — VL models are trained end-to-end on image-text pairs.

Audio / Speech — Models that can listen to and/or generate audio. Input: transcribe speech, understand voice commands (Whisper). Output: generate natural-sounding speech from text (ElevenLabs, Bark, XTTS). Some models like GPT-4o and Gemini handle both directions natively.

Code Generation — A model trained heavily on source code that can write, explain, and debug code. Examples: DeepSeek-Coder, Qwen2.5-Coder, StarCoder, CodeLlama. Often evaluated on HumanEval and SWE-Bench.

Embedding — A vector (list of numbers) representing the meaning of text. Used to compare documents by similarity — the basis of RAG search. Embeddings for similar texts end up close together in vector space. Model families often have dedicated embedding models (e.g., BGE, E5, GTE).

Reasoning / Thinking — Models that generate a chain of thought before answering, similar to how humans “think out loud.” Denoted by <think> tags in output. They work through problems step-by-step, which dramatically improves performance on math, logic, and complex multi-step tasks. Examples: DeepSeek-R1, QwQ, o1, o3, Grok-4.3.

Thinking Budget — A parameter that controls how long a reasoning model “thinks” before answering. Higher budget = more tokens spent reasoning = better answers on hard problems but slower and more expensive. Some APIs expose this as “reasoning effort” (low/medium/high).

Tool Use / Function Calling — The model can output structured requests to invoke external functions (search, calculator, API calls). Essential for agents. See Anthropic’s tool use docs and OpenAI’s function calling docs.

Structured Output — Forcing the model to respond in a specific format (JSON, XML, matching a JSON Schema). Essential for extracting data, generating API payloads, or any programmatic use. Supported natively by OpenAI, Anthropic, and local models via constrained decoding in llama.cpp.

JSON Mode — A lighter version of structured output. Tells the model to respond in valid JSON without enforcing a specific schema. Faster to set up but less reliable than full schema-based structured output.

Core Concepts

Token — The basic unit of text a model processes. Can be a word, part of a word, or a character. English text averages ~1.3 tokens per word. The tokenizer breaks text into these pieces before the model sees them. Try tiktokenizer to explore how text gets tokenized.

Context Window / Context Length — How many tokens the model can process at once (input + output). 1 token ≈ ¾ of an English word. 8K context ≈ 6,000 words. 128K ≈ 96,000 words. Longer is better for RAG and long conversations, but uses more memory and compute.

Logits — The raw scores a model produces for every possible next token before they’re converted to probabilities. Higher logit = more likely token. Sampling parameters (temperature, top-p) manipulate logits.

Parameters — The “weights” or “connections” in a neural network. More parameters generally means more capability but also more memory and compute. The “B” in model names (7B, 70B) refers to billions of parameters.

FLOPS — Floating-point operations per second. Measures compute power. Training a large model requires exaFLOPS (10^18 operations). Inference is measured in tokens per second.

Hallucination — When a model generates confident but incorrect information. Not a bug in the traditional sense — the model is doing exactly what it was trained to do (predict likely text), but it doesn’t have a concept of “truth.” Mitigated by RAG, grounding, and tool use.

System Prompt — Instructions given to the model before the conversation starts. Sets behavior, tone, and constraints. E.g., “You are a helpful coding assistant. Respond concisely.”

Few-Shot / Zero-Shot — Zero-shot: the model handles a task with no examples. Few-shot: you provide a few input/output examples in the prompt to show the model what you want. More examples = better performance on specific formats.

Chain-of-Thought (CoT) — Prompting the model to reason step-by-step before giving a final answer. Dramatically improves performance on math, logic, and reasoning tasks. Reasoning models do this internally (in <think> tags); for other models, you add “think step by step” to your prompt.

Tokenizer — The component that converts text into tokens and back. Different models use different tokenizers. A code-optimized tokenizer will split code into fewer tokens than a general-purpose one. BPE (Byte Pair Encoding) and SentencePiece are the most common tokenization methods.

Workflow Patterns

RAG (Retrieval-Augmented Generation) — Fetches relevant documents from a knowledge base and injects them into the prompt. The model then generates an answer based on those documents. Pipeline: embed query → vector search → stuff documents into prompt → generate. How “chat with your PDFs” works. Cheaper and more updatable than fine-tuning for knowledge tasks.

Vector Database — A database optimized for storing and searching embeddings. Used in RAG to find documents similar to a query. Examples: Pinecone, Chroma, Weaviate, Qdrant, Milvus.

Agent — An LLM that autonomously decides what actions to take to accomplish a goal. Unlike a chatbot that just responds, an agent can plan, use tools, iterate, and self-correct. The loop: task → think → act → observe → repeat until done. Frameworks: LangChain, CrewAI, AutoGen.

MCP (Model Context Protocol) — An open standard by Anthropic for connecting AI models to external tools and data sources. Think USB-C for AI — write an MCP server once, use it with any MCP-compatible client (Claude Desktop, LM Studio, IDEs). See modelcontextprotocol.io.

Prompt Engineering — Crafting inputs to get better outputs from a model. Includes system prompts, few-shot examples, chain-of-thought (“think step by step”), and role assignment.

Agentic Coding — Using an AI agent to write, edit, and debug code autonomously. The agent reads files, runs tests, fixes errors, and iterates. Examples: Claude Code, Cursor, Copilot Workspace, Aider. Powered by tool calling + file system access.

Speculative Decoding — A speed optimization that uses a small “draft” model to generate candidate tokens, then a larger “verification” model checks them in parallel. If the draft is correct (which it often is), you get the speed of the small model with the quality of the large one. Supported by LM Studio and llama.cpp.

Guardrails — Safety filters that sit between the user and the model. They block harmful content, enforce output formats, and prevent the model from going off-topic. Examples: Guardrails AI, NeMo Guardrails.

Evaluation & Benchmarks

MMLU — Tests knowledge across 57 academic subjects (STEM, humanities, law, medicine). The standard general-knowledge benchmark. See arxiv.org/abs/2009.03300.

HumanEval — Coding benchmark. The model writes Python functions from docstrings, tested against unit tests. Measures practical coding ability.

SWE-Bench — Real-world software engineering benchmark. Models must resolve actual GitHub issues from popular open-source repos. Much harder than HumanEval — tests end-to-end coding ability.

GSM8K — Grade-school math word problems. Tests multi-step arithmetic reasoning.

GPQA — Graduate-level science questions. Hard enough that even PhD experts struggle. See arxiv.org/abs/2311.12022.

ARC-AGI — Abstract visual reasoning puzzles. Tests whether models can solve novel patterns they haven’t seen before. Considered a measure of genuine reasoning ability. See arcprize.org.

LiveBench — Continuously updated benchmark with new questions to prevent models from memorizing the test (data contamination). See livebench.ai.

Chatbot Arena (LMSYS) — Crowd-sourced Elo rating system where models are compared head-to-head by human evaluators. The most trusted leaderboard for real-world chat quality. See chat.lmsys.org.

Elo Rating — A ranking system (borrowed from chess) where models are scored based on pairwise comparisons. Higher Elo = better. GPT-4 class is ~1250, Claude 3.5 Sonnet class is ~1270.

Data Contamination — When benchmark test data leaks into the model’s training data. The model “memorizes” the answers instead of actually learning to solve them. Makes benchmark scores unreliable. LiveBench and ARC-AGI are designed to resist this.

Model Formats

GGUF — The standard format for running models on consumer hardware via llama.cpp. Contains quantized weights. Works on CPU, GPU, or both. What LM Studio, Ollama, and most local AI tools use.

Safetensors — HuggingFace’s format for storing full-precision weights. Replaces the older PyTorch .bin format. Used for training and fine-tuning. The source format before converting to GGUF or other compressed formats.

MLX — Apple’s format optimized for Apple Silicon (M1/M2/M3/M4 chips). Leverages unified memory for fast inference on Macs.

GPTQ — Quantization format optimized for Nvidia GPU inference. Faster than GGUF on GPU, but less portable.

AWQ — Activation-aware quantization. Similar to GPTQ but preserves accuracy better by considering how the model actually uses each weight during inference.

EXL2 — Format used by ExLlamaV2. Optimized for maximum speed on Nvidia GPUs. Popular for running large models fast.

ONNX — Microsoft’s cross-platform format. Good for running models on edge devices and non-GPU hardware via ONNX Runtime.

Quantization

Quantization — Reducing model size by lowering weight precision. Full precision = 16 bits per parameter. Quantization compresses to 8-bit, 4-bit, or lower. Cuts RAM requirements dramatically with minimal quality loss.

Q4_K_M — The go-to quantization level. ~4.8 bits per weight. Roughly 75% smaller than full precision with barely perceptible quality loss. The default choice for most use cases.

Q5_K_M — Slightly higher quality than Q4 at ~40% more size. Use if you have the RAM and want better output.

Q8_0 — 8-bit quantization. Near-perfect quality at half the size of FP16. Use when quality matters more than memory savings.

Q2_K / Q3_K — Heavy compression. Noticeable quality loss. Only for running very large models on very limited hardware.

F16 / BF16 — Full precision. FP16 is float16, BF16 is bfloat16 (better for training). Not quantized — used as the baseline for comparison.

K-quant (K_M, K_S, K_L) — Per-block scaling quantization that preserves quality better than simple uniform compression. M = medium (best balance), S = small (more compression), L = large (less compression).

IQ (Imatrix Quantization) — Uses an importance matrix calculated from calibration data to determine which weights matter most and compress others more aggressively. IQ4_XS gives better quality than plain Q4 at similar size.

Format	Size (7B model)	Size (70B model)	Quality
F16	14 GB	140 GB	Full precision
Q8_0	7 GB	70 GB	Near-perfect
Q6_K	6 GB	58 GB	Excellent
Q5_K_M	5 GB	48 GB	Very good
Q4_K_M	4.4 GB	41 GB	Good (recommended)
Q4_K_S	3.9 GB	37 GB	Decent
Q3_K_M	3.5 GB	33 GB	Okay
Q2_K	2.9 GB	28 GB	Noticeable loss

Latest Open-Source Models (May 2025)

Top trending models on HuggingFace and OpenRouter, with their capabilities and hardware requirements.

Model	Params	Type	Capabilities	Min Hardware (Q4)
DeepSeek-V4-Pro	862B (MoE)	Reasoning	Text, Code, Thinking	Multi-GPU / Cloud
DeepSeek-V4-Flash	158B (MoE)	Reasoning	Text, Code, Thinking	96 GB VRAM
Qwen3.6-27B	28B	Vision	Text, Image, Vision	20 GB VRAM
Qwen3.6-35B-A3B	35B (MoE)	General	Text, Code, 3B active	24 GB VRAM
Kimi-K2.6	1.1T (MoE)	Vision	Text, Image, Vision, Code	Multi-GPU / Cloud
Gemma-4-31B-it	31B	Vision	Text, Image, Vision	24 GB VRAM
Mistral Medium 3.5	128B	Vision	Text, Image, Vision, Code	Multi-GPU / Cloud
Granite 4.1 8B	8B	General	Text, Code, Tools, 12 langs	8 GB VRAM
Nemotron-3 Nano 30B	30B (MoE)	Reasoning	Text, Audio, Vision, 3B active	24 GB VRAM
GLM-5.1	~130B (MoE)	Reasoning	Text, Code, Thinking	Multi-GPU / Cloud
DeepSeek-R1	671B (MoE)	Reasoning	Text, Code, Thinking	Multi-GPU / Cloud
Mixtral 8x7B	47B (MoE)	General	Text, Code, 13B active	24 GB VRAM

Common suffixes:

Instruct / Chat / -it — Instruction-tuned, ready to chat
Base — Raw pretrained model (for fine-tuning, not chatting)
Distill — A smaller model trained to mimic a larger one
-Q4_K_M / -GGUF — Quantized for local inference
-VL — Vision-Language variant (handles images)
-Coder — Code-specialized variant

Size classes at a glance:

0.5B – 3B — Runs anywhere, including phones and Raspberry Pi. Limited capability.
7B – 9B — Laptop class (8 GB VRAM). Good for general tasks and coding.
12B – 15B — Needs 12 GB+ VRAM. Noticeable quality jump over 7B.
32B – 35B — Sweet spot for quality vs. accessibility. Needs 24 GB VRAM.
70B+ — Near frontier quality. Multi-GPU or 48 GB+ unified memory.

LM Studio Settings

The inference parameters in the right sidebar, explained. See LM Studio docs.

Temperature — Controls randomness. Scales the model’s confidence before picking the next word.

0.0 – 0.3: Focused, deterministic. Best for code, facts, data extraction.
0.4 – 0.7: Balanced. Default for general chat.
0.8 – 1.0: Creative, unpredictable. Best for brainstorming, storytelling.
1.0+: Increasingly incoherent. Avoid.

Top P (Nucleus Sampling) — Only considers tokens whose cumulative probability reaches P. At 0.9, the bottom 10% of unlikely tokens are discarded. Lower = more focused. 1.0 = disabled.

Top K — Only considers the K most probable tokens. Top K = 40 means pick from the top 40 options. Lower = more predictable.

Min P — Filters tokens below a probability threshold relative to the top token. Adapts to the model’s confidence — strict when sure, permissive when uncertain. Often works better than Top P.

Repetition Penalty — Reduces probability of tokens already used. 1.0 = off. 1.1 – 1.2 = prevents looping. Above 1.3 = text sounds forced.

Frequency Penalty — Like repetition penalty but proportional — the more a token appears, the harder it gets penalized. Smoother than flat repetition penalty.

Max Tokens — Maximum length of the model’s response. 1 token ≈ ¾ word. Default 2048 is fine for most tasks. Use 4096+ for code generation.

CPU Threads — How many CPU cores to use for inference. Set to your physical core count (not logical threads). On Mac: use total thread count. In hybrid GPU/CPU mode: lower is fine since GPU handles most work.

GPU Offload — How many model layers run on the GPU. “Max” = everything on GPU (fastest, needs enough VRAM). Lower values split between GPU and CPU (slower, but lets you run bigger models).

Quick presets:

Use Case	Temp	Top P	Rep Penalty	Max Tokens
Coding / Facts	0.2	0.9	1.1	4096
General Chat	0.6	0.95	1.1	2048
Creative Writing	0.8	0.95	1.15	4096
Reasoning / Math	0.1	0.9	1.05	8192
Summarization	0.3	0.9	1.1	1024
Brainstorming	0.9	0.98	1.1	2048

Inference & Serving

vLLM — High-throughput serving engine for LLMs. Uses PagedAttention to manage KV cache memory efficiently. The go-to for production deployments. See github.com/vllm-project/vllm.

KV Cache — Key-Value cache. Stores previously computed attention results so the model doesn’t recompute them for every new token. Without it, inference would be impossibly slow for long contexts. The KV cache is often the biggest memory consumer during inference.

Throughput — How many tokens per second the model can generate. Depends on hardware, model size, quantization, and batch size. A 7B Q4 model on a modern GPU does 30-80 tokens/sec.

TTFT (Time To First Token) — How long you wait before the model starts generating. Important for chat UX — lower is better. Depends on prompt length, model size, and hardware.

Batching — Processing multiple requests simultaneously. Continuous batching (dynamic batching) mixes requests at different stages to maximize GPU utilization. Critical for serving multiple users.

OpenAI-Compatible API — The de facto standard API format for LLMs. Originally from OpenAI, now supported by LM Studio, Ollama, vLLM, Together AI, and most providers. Same /v1/chat/completions endpoint everywhere. See OpenAI API reference.

HuggingFace — The GitHub of AI. Hosts models, datasets, and spaces (demos). Where most open-source models are published and downloaded. See huggingface.co.

OpenRouter — An API aggregator that gives you access to hundreds of models (open and closed) through a single API. Compare prices, capabilities, and switch models easily. See openrouter.ai.

How It All Connects

Build a coding assistant: pick a Q4_K_M GGUF model from HuggingFace → load in LM Studio at temperature 0.2 → connect MCP servers for filesystem access → the model uses tool calling to read files → generates structured output (JSON) → optionally add RAG with a vector database for codebase context.

Every term in that pipeline is in this glossary. Bookmark it. Come back when something doesn’t click.