Internal Working of Prometheus: How Metrics Flow from Ingestion to Storage
SREsite reliability engineeringprometheus

Internal Working of Prometheus: How Metrics Flow from Ingestion to Storage


When we talk about observability in modern distributed systems, Prometheus is almost always at the center of the discussion. It's simple, open-source, and purpose-built for time series metrics. But while most people know how to query Prometheus using PromQL or visualize metrics on Grafana, very few take the time to understand what’s happening under the hood.

In this post, I’ll take you through the internal architecture of Prometheus metric ingestion and storage, covering the WAL (Write-Ahead Log), Head Block, Time Series Database (TSDB), and the immutable blocks that form the backbone of Prometheus persistence. By the end, you’ll have a clear mental model of how Prometheus transforms millions of incoming metrics into structured blocks that can be efficiently queried for dashboards, alerts, and analysis.


Why This Matters

Before diving into the internals, it’s worth asking: why should an SRE or DevOps engineer care about the internal working of Prometheus?

The answer is simple: performance, reliability, and scalability.

  • Understanding WAL and Head Block helps when tuning retention, storage, and durability.

  • Knowing how blocks are flushed explains disk space usage and how to debug it.

  • Grasping the ingestion pipeline helps optimize scrape intervals, federation, and remote writes.

  • It also provides clarity when designing HA (High Availability) setups or long-term storage integrations (Thanos, Cortex, Mimir).


Prometheus Architecture (High-Level)

At a very high level, Prometheus follows a pull-based model for metric ingestion:

  1. Prometheus scrapes targets (applications, exporters, services) over HTTP endpoints (/metrics).

  2. The scraped data (text exposition format) is parsed into structured time series.

  3. These samples are first written into the Write-Ahead Log (WAL) for durability.

  4. Then they’re stored in memory in the Head Block for fast querying and aggregation.

  5. Periodically, the Head Block is compacted and flushed into immutable on-disk blocks.

  6. Queries run over a combination of Head Block + persisted blocks.

Here’s a simple flow diagram:

+-------------------+         +------------------+         +------------------+
|    Target / App   |  --->   |   Prometheus     |  --->   |  Grafana / Alert |
| (metrics exposed) |         |  (scrape + TSDB) |         |   (visualize)    |
+-------------------+         +------------------+         +------------------+
                                   |
                                   v
                      +----------------------------+
                      | Ingestion Pipeline         |
                      |                            |
                      |  WAL -> Head Block -> Disk |
                      +----------------------------+

This looks straightforward, but the devil is in the details. Let’s dive into those details.


Time Series in Prometheus

Prometheus internally organizes metrics as time series. A time series is uniquely identified by:

  • Metric name (e.g., http_requests_total)

  • Label set (e.g., {method="GET", handler="/login"})

Each time series has a stream of samples:

(timestamp, value)

Example:

http_requests_total{method="GET", handler="/login"} => 
  (1692548800, 201) 
  (1692548810, 212)
  (1692548820, 225)

So, when Prometheus scrapes data, it parses each metric line into these structured time series.


Step 1: Ingestion and Write-Ahead Log (WAL)

When Prometheus scrapes data, the very first place it goes is the WAL (Write-Ahead Log).

Why? Because durability is critical. If Prometheus crashes or the node reboots, we don’t want to lose scraped metrics that were in memory but not yet flushed to disk. The WAL acts like a journal in databases.

  • The WAL lives in:

    <data-dir>/wal
    
    
  • Data is written in binary format for efficiency.

  • Each record contains samples, series, and metadata.

  • WAL files are segmented (~128 MB each).

So, whenever a new metric sample arrives:

  1. Prometheus appends it to the WAL.

  2. Then it’s available in the Head Block for queries.

If Prometheus restarts, it can replay the WAL to reconstruct the Head Block.

WAL Workflow Diagram

Scraped Metric ---> WAL Append ---> Head Block (in memory)
                       |
                       +--> Persisted on disk for crash recovery

Think of the WAL as the safety net.


Step 2: Head Block (In-Memory Storage)

The Head Block is the in-memory representation of the most recent time series data.

  • It holds recent samples (typically last 2 hours).

  • It’s very fast for queries and aggregations.

  • Data is organized into chunks of samples.

But here’s the catch:

  • Keeping everything in memory is expensive.

  • Memory is volatile — if Prometheus restarts, we’d lose it (hence WAL replay).

That’s why Prometheus periodically cuts and flushes the Head Block into a persisted on-disk block.

Head Block Flow

+---------------------------+
|        Head Block         |
| (recent samples in RAM)   |
|                           |
|  + Fast queries           |
|  + Chunk management       |
|  + Holds ~2h of data      |
+---------------------------+

Step 3: Compaction and On-Disk Blocks

Every 2 hours, Prometheus performs a process called compaction.

  • It takes the Head Block.

  • Compresses and writes it to disk as a new immutable block.

  • Clears the memory for fresh samples.

Each block is stored under:

<data-dir>/chunks/
<data-dir>/wal/
<data-dir>/01GZABCXYZ/   # example block

A block directory contains:

  • chunks/ → raw metric samples.

  • index → an index file mapping labels → series → chunks.

  • meta.json → metadata about block (time range, compaction level).

  • tombstones → records of deleted series.

Block Diagram

<data-dir>/01FZ4XY123/
   ├── chunks/       # actual metric samples
   ├── index         # mapping of series to chunks
   ├── meta.json     # block metadata
   └── tombstones    # deletions

Blocks are immutable. Prometheus never modifies them — only creates new blocks during compaction. This ensures consistency and makes replication/federation easier.


Step 4: Block Compaction & Retention

Prometheus doesn’t just write raw blocks — it also compacts old blocks into larger ones.

  • Small 2-hour blocks → merged into 10-hour → merged into longer ranges.

  • This reduces the number of blocks on disk, improving query performance.

Prometheus also enforces retention policies:

  • By default, 15 days.

  • Configurable with --storage.tsdb.retention.time=30d or similar.

Expired blocks are deleted to free space.


Querying Path

When you run a PromQL query, Prometheus merges results from:

  1. Head Block (in-memory recent samples)

  2. Immutable on-disk blocks

This hybrid approach balances speed (RAM) and durability (disk).


End-to-End Flow Diagram

Here’s a consolidated diagram of the lifecycle:

              +-------------------+
              |   Scrape Target   |
              +-------------------+
                        |
                        v
              +-------------------+
              | Parse & Ingest    |
              +-------------------+
                        |
                        v
              +-------------------+
              |       WAL         |  <-- crash recovery log
              +-------------------+
                        |
                        v
              +-------------------+
              |    Head Block     |  <-- in-memory ~2h
              +-------------------+
                        |
                        v
              +-------------------+
              |   TSDB Blocks     |  <-- immutable, on disk
              |  (2h, 10h, etc)   |
              +-------------------+

Why WAL + Head + Blocks?

At this point, you might ask: why so many layers?

The answer is that Prometheus needs to balance:

  • Durability (WAL ensures no data loss on crash).

  • Speed (Head Block in memory for recent fast queries).

  • Scalability (Immutable blocks reduce locking/contention).

  • Efficiency (Compaction reduces query overhead).

It’s a classic example of database design trade-offs.

Simple Analogy

Think of Prometheus ingestion like writing a diary:

  • WAL → Quick notes on a sticky pad (temporary but ensures nothing is lost).

  • Head Block → Fresh entries in your open notebook (fast access to current info).

  • Immutable Blocks → Archived notebooks on your shelf (read-only, structured, indexed).

This layered system ensures you can always reconstruct history, even if your sticky notes or active notebook get lost.