AI and Machine Learning Vocabulary: LLM, RAG, Embeddings Explained

Plain-English definitions of 35 AI and machine learning terms: LLM, RAG, embeddings, tokens, hallucination, fine-tuning, prompt engineering, vector database, and more.

AI and machine learning have their own dense vocabulary — and since 2022, much of it has moved from research papers into everyday engineering conversations. If you work with AI tools, LLMs, or ML pipelines, you need to know this language. This guide explains the most important terms clearly, without assuming a mathematics background.


Large Language Models (LLMs)

LLM (Large Language Model)

A Large Language Model is a type of AI model trained on massive amounts of text data, capable of generating, translating, summarising, and answering questions in natural language. Examples: GPT-4, Claude, Gemini, Llama.

“We use an LLM to power the support chatbot.”

Token

In the context of LLMs, a token is a unit of text the model processes. A token is roughly 3–4 characters or about ¾ of a word in English. “Hello, world!” ≈ 4 tokens.

Why it matters: LLMs have a context window — a limit on how many tokens they can process at once. Pricing for cloud LLM APIs is usually per token.

Context Window

The context window is the total amount of text (in tokens) that an LLM can take in and consider in one request. Larger context windows allow more background information, longer conversations, or bigger documents.

“The model has a 128k context window — it can process the entire codebase at once.”

Prompt

A prompt is the input you give to an LLM. In a chat interface, it is your message. In an API call, it is the text you send to the model.

Prompt Engineering

Prompt engineering is the practice of designing and refining prompts to get better outputs from LLMs. Techniques include: providing examples (few-shot prompting), assigning a role (“You are a senior engineer…”), and structuring instructions clearly.

Hallucination

Hallucination is when an LLM confidently generates information that is factually incorrect, made up, or inconsistent with the context. A named but nonexistent function, a citation to a nonexistent paper, or wrong API documentation are all hallucinations.

“The model hallucinated a nonexistent library method — always verify LLM-generated code.”

System Prompt

A system prompt is a special instruction given to the LLM before the user’s input, setting its behaviour, role, or constraints. Usually not visible to end users.

Fine-Tuning

Fine-tuning is the process of continuing to train a pre-trained model on a smaller, specific dataset to specialise its behaviour. A general LLM can be fine-tuned on medical records, legal texts, or company-specific data.

RAG (Retrieval-Augmented Generation)

RAG is a technique that enhances LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt. Instead of relying on the model’s trained memory, RAG retrieves fresh, relevant context before generating a response.

“We use RAG so the chatbot can answer questions about our internal documentation — it searches the latest docs before generating an answer.”


Embedding

An embedding is a numerical representation of text (or images, or other data) as a vector of floating-point numbers. Semantically similar texts have vectors that are close together in mathematical space. LLMs and search systems use embeddings to understand meaning, not just keywords.

“We embed user queries and search for the most similar document embeddings.”

Vector Database

A vector database is a database optimised for storing and searching embeddings. It can find vectors that are most similar to a query vector — enabling semantic search. Examples: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension).

Semantic search finds results based on meaning, not just keyword matching. It uses embeddings to find documents that are conceptually similar to the query, even if they do not share exact words.


Model Training Concepts

Training Data

Training data is the dataset used to train a machine learning model. The quality and size of training data heavily influence model quality.

Overfitting

Overfitting occurs when a model learns the training data too specifically — including its noise — and performs poorly on new, unseen data.

Underfitting

Underfitting means the model has not learned enough from the training data and performs poorly even on training examples.

Inference

Inference is the process of using a trained model to make predictions on new data. Training is done once (or periodically); inference happens every time a user makes a request.

Ground Truth

Ground truth is the verified correct answer used to evaluate model predictions. In supervised learning, training data includes ground truth labels.

Label / Annotation

A label (or annotation) is the correct output associated with a training example. In image classification, a label might be “cat” or “dog.” Humans often annotate training data manually.

Supervised vs. Unsupervised Learning

  • Supervised learning: the model learns from labelled examples (input → known output)
  • Unsupervised learning: the model finds patterns in unlabelled data (e.g., clustering)
  • Reinforcement learning: the model learns by receiving rewards or penalties for its actions

Evaluation & Performance

Accuracy

Accuracy = the proportion of correct predictions out of all predictions. But accuracy alone is misleading when classes are imbalanced (e.g., 99% of emails are not spam).

Precision and Recall

  • Precision = of all predicted positives, how many were actually positive? (Avoids false positives)
  • Recall = of all actual positives, how many did the model find? (Avoids false negatives)

F1 Score

The F1 score is the harmonic mean of precision and recall — a single number that balances both.

Benchmark

A benchmark is a standardised test used to measure model performance. LLM benchmarks: MMLU, HumanEval (coding), HellaSwag, BIG-bench.


AI Infrastructure & Tools

GPU / TPU

  • GPU (Graphics Processing Unit) — originally for rendering, now the standard hardware for training and running neural networks
  • TPU (Tensor Processing Unit) — Google’s custom chip, optimised specifically for machine learning

Model Weights

Model weights (or parameters) are the numerical values learned during training. When someone says “a 7B model,” they mean a model with 7 billion parameters.

Checkpoint

A checkpoint is a saved snapshot of model weights during training. Used to resume training and to evaluate model quality at different training stages.

Pipeline

In ML, a pipeline is a sequence of data processing steps — preprocessing, transformation, model inference, and post-processing — usually automated.

MLOps

MLOps (Machine Learning Operations) applies DevOps practices to machine learning: versioning models, automating training and deployment, monitoring model performance in production.


Practical Terms for Engineers Using LLM APIs

TermMeaning
TemperatureControls randomness (0 = deterministic, 1+ = creative)
Top-p (nucleus sampling)Alternative to temperature for controlling output diversity
Max tokensThe maximum output length
Stop sequenceA string that tells the model to stop generating
Few-shotProviding examples in the prompt
Zero-shotNo examples — just the instruction
Chain-of-thoughtPrompting the model to reason step by step
StreamingReceiving output token by token as it is generated
API rate limitMax requests per minute/hour/day from the model provider