Skip to content

Getting Started with Apache Kafka

Introduction

Apache Kafka® has become the universal foundation on which modern data systems are built. Whether you're building data pipelines for analytics, connecting microservices, or moving data across systems, Kafka is likely to be a foundational part of your infrastructure.

This tutorial introduces you to the core concepts of Apache Kafka, from understanding events to grasping how Kafka stores, partitions, and replicates data across a distributed cluster.

What You'll Learn

  • Understanding events vs. things
  • How Kafka topics work as immutable logs
  • Partitioning for scalability
  • Brokers and cluster architecture
  • Replication for fault tolerance

Prerequisites

  • Basic understanding of distributed systems concepts
  • Familiarity with command-line interfaces
  • No prior Kafka experience required

Understanding Events: The Foundation of Kafka

From Things to Events

Mental Model Shift

When you think about data, you're probably inclined to think about data as tables—representations of things in the world: items in inventory, IoT devices, user accounts, etc.

Kafka encourages you to shift your thinking from **things** to **events**—moments in time when something happens.

Examples of Events:

  • An item getting sold
  • A driver using their turn signal
  • A user clicking on your site
  • A temperature sensor reporting a reading
graph LR
    A[Traditional Database] --> B[Tables store THINGS]
    B --> C[Update rows when things change]

    D[Apache Kafka] --> E[Topics store EVENTS]
    E --> F[Append events as they happen]

Real-Time Processing

Real-Time Focus

Events happen at a specific point in time, and Kafka is focused on processing them in real time. This means:

- ✅ Process events as they occur
- ✅ Don't store them for batch processing later
- ✅ React immediately to what's happening now

However, Kafka **can remember events**—it stores them for future replay and analysis.

Example: Smart Thermostat

Consider a network of smart thermostats in homes. In a traditional database:

Traditional Approach (Tables):

sensor_id location temperature timestamp
42 kitchen 22°C 2024-12-05 10:00
42 kitchen 24°C 2024-12-05 10:15

Problem: Lost Context

When the temperature updates, we lose context—we don't remember what the temperature used to be.

Kafka Approach (Events Log):

{"sensor_id": 42, "location": "kitchen", "temperature": 22, "timestamp": "2024-12-05T10:00:00Z"}  // (1)!
{"sensor_id": 42, "location": "kitchen", "temperature": 24, "timestamp": "2024-12-05T10:15:00Z"}  // (2)!
  1. First reading at 10:00 - preserved forever
  2. Second reading at 10:15 - appended to log

Benefits of Event Logs

We preserve the entire history of temperature changes, enabling insights like:

- How quickly does the kitchen heat up?
- What time of day does it heat up?
- Historical trends and patterns

Kafka Topics: Immutable Event Logs

What is a Topic?

In Kafka, a topic is where we store messages. Unlike traditional database tables, topics use logs—ordered sequences of immutable records.

graph TD
    A[Database Table] -->|Update| B[Row changes, history lost]

    C[Kafka Topic] -->|Append| D[New event added to end]
    D --> E[Previous events remain unchanged]

Key Characteristics of Topics

!!! note "Core Properties" 1. Immutable Messages: Once written, messages cannot be changed 2. Append-Only: New messages are added to the end 3. Ordered Sequence: Messages maintain their order within the log 4. Replayable: Messages can be read multiple times

Topic Structure Example

graph LR
    subgraph "Topic: thermostat_readings"
    A[Message 0<br/>sensor:42<br/>temp:22°C] --> B[Message 1<br/>sensor:43<br/>temp:20°C]
    B --> C[Message 2<br/>sensor:42<br/>temp:24°C]
    C --> D[Message 3<br/>sensor:44<br/>temp:21°C]
    end

Message Anatomy

Every message in a Kafka topic consists of:

graph TB
    subgraph Message
    A[Key<br/>Identifier: sensor_id, user_id]
    B[Value<br/>Actual data payload]
    C[Timestamp<br/>When event occurred]
    D[Headers<br/>Optional metadata]
    E[Offset<br/>Position in partition]
    end

Components:

  • Key (optional): Identifier that the event relates to (e.g., sensor_id: 42) // (1)!
  • Value (required): The actual event data (JSON, Avro, Protobuf, etc.) // (2)!
  • Timestamp: When the event was produced or received // (3)!
  • Headers (optional): Key-value pairs for lightweight metadata // (4)!
  • Offset: Sequential ID starting at 0, incrementing by 1 for each message // (5)!

  • Used for partition assignment - messages with same key go to same partition

  • Can be any serialized format: JSON, Avro, Protocol Buffers, plain text, etc.
  • Can be producer-set (event time) or broker-set (ingestion time)
  • Useful for tracing, routing, or adding context without modifying the value
  • Unique within a partition - used by consumers to track reading position

Example Message

{
  "key": 42,
  "value": {
    "sensor_id": 42,
    "location": "kitchen",
    "temperature": 22,
    "unit": "celsius",
    "read_at": "2024-12-05T10:00:00Z"
  },
  "timestamp": 1733396400000,
  "headers": {
    "source": "iot-gateway-01",
    "version": "1.0"
  },
  "offset": 1234567
}

Topics Are Not Queues

Critical Distinction

Queue: When you read a message, it's gone—nobody else can read it

**Kafka Topic**: Messages remain available for multiple consumers to read independently
graph TB
    subgraph "Queue Behavior"
    Q1[Message 1] --> C1[Consumer reads]
    C1 --> X1[Message deleted]
    end

    subgraph "Kafka Topic Behavior"
    M1[Message 1] --> CR1[Consumer A reads]
    M1 --> CR2[Consumer B reads]
    M1 --> CR3[Consumer C reads]
    M1 -.-> S[Message remains]
    end
  • Queue: When you read a message, it's gone—nobody else can read it
  • Kafka Topic: Messages remain available for multiple consumers to read independently

This enables:

!!! success "Key Advantages" - Replayability: Read messages again - Multiple consumers: Different applications can process the same events - Fault tolerance: If a consumer fails, it can resume from where it left off

Transforming Immutable Data

Since messages are immutable, how do we transform data?

Answer: Create new topics with transformed data.

graph LR
    A[Topic: thermostat_readings<br/>All sensor data] -->|Filter| B[Topic: hot_locations<br/>Temperature > 25°C only]

Example transformation using stream processing:

-- Filter to only hot readings
INSERT INTO hot_locations
SELECT * FROM thermostat_readings
WHERE temperature > 25;

Log Compaction and Retention

Kafka offers flexible data management:

Retention Policies:

graph TB
    A[Retention Strategies]
    A --> B[Time-based<br/>Keep for 7 days]
    A --> C[Size-based<br/>Keep latest 1GB]
    A --> D[Compaction<br/>Keep latest value per key]
    A --> E[Forever<br/>Keep all messages]

Log Compaction is particularly useful when you only care about the latest state:

graph LR
    subgraph "Before Compaction"
    A1[Key:42, Temp:20] --> A2[Key:43, Temp:22] --> A3[Key:42, Temp:24] --> A4[Key:43, Temp:21]
    end

    subgraph "After Compaction"
    B1[Key:42, Temp:24] --> B2[Key:43, Temp:21]
    end

Partitions: Scaling Kafka Horizontally

Why Partition?

Single Partition Limitations

If a topic were stored entirely on one machine, it would be limited by:

- ❌ That machine's storage capacity
- ❌ That machine's processing power
- ❌ That machine's network bandwidth

Partitioning solves this by splitting a topic into multiple logs distributed across different machines.

graph TB
    subgraph "Single Partition - Limited"
    A[Topic on one machine] --> B[❌ Storage limit]
    A --> C[❌ Processing limit]
    A --> D[❌ Bandwidth limit]
    end

    subgraph "Multiple Partitions - Scalable"
    E[Topic split across machines] --> F[✅ Distributed storage]
    E --> G[✅ Parallel processing]
    E --> H[✅ High throughput]
    end

Partition Structure

graph TB
    subgraph "Topic: thermostat_readings"
    T[Topic] --> P0[Partition 0]
    T --> P1[Partition 1]
    T --> P2[Partition 2]

    P0 --> M00[Offset 0: sensor:42]
    P0 --> M01[Offset 1: sensor:45]

    P1 --> M10[Offset 0: sensor:43]
    P1 --> M11[Offset 1: sensor:46]

    P2 --> M20[Offset 0: sensor:44]
    P2 --> M21[Offset 1: sensor:47]
    end

Key Points:

!!! info "Partition Characteristics" - Each partition is an ordered, immutable log - Partitions are independent of each other - A topic can have hundreds or thousands of partitions - Kafka supports up to 2 million partitions with KRaft

Message Ordering

graph TB
    A[Message Ordering]
    A --> B[Within Partition<br/>✅ Strict ordering guaranteed]
    A --> C[Across Partitions<br/>❌ No global ordering]

Within a partition: Messages are read in the exact sequence they were written.

Across partitions: No ordering guarantee between partitions.

Design Consideration

If order matters for your use case, ensure related messages share the same key—they'll go to the same partition and maintain order.

Partition Assignment Strategies

1. With Message Key (Hash-Based)

graph LR
    M1[Message<br/>Key: 42] --> H1[Hash function] --> MOD1[hash % 3 = 0] --> P0[Partition 0]
    M2[Message<br/>Key: 43] --> H2[Hash function] --> MOD2[hash % 3 = 1] --> P1[Partition 1]
    M3[Message<br/>Key: 44] --> H3[Hash function] --> MOD3[hash % 3 = 2] --> P2[Partition 2]
    M4[Message<br/>Key: 42] --> H4[Hash function] --> MOD4[hash % 3 = 0] --> P0

Process:

  1. Hash the message key
  2. Apply modulo with number of partitions: hash(key) % num_partitions
  3. Route message to the resulting partition

Result: All messages with the same key go to the same partition, preserving order for that key.

Practical Example

All messages with sensor_id=42 → Partition 0 All messages with sensor_id=43 → Partition 1

This ensures all temperature readings from sensor 42 are processed in chronological order.

2. Without Message Key (Round-Robin)

graph LR
    M1[Message 1<br/>No key] --> P0[Partition 0]
    M2[Message 2<br/>No key] --> P1[Partition 1]
    M3[Message 3<br/>No key] --> P2[Partition 2]
    M4[Message 4<br/>No key] --> P0[Partition 0]
    M5[Message 5<br/>No key] --> P1[Partition 1]

Process: Messages are distributed evenly across partitions in a round-robin fashion.

Result:

Benefits

✅ Even load distribution across partitions

Trade-off

❌ No ordering guarantee (messages from same source can go to different partitions)

Choosing Partition Count

Partition Count Guidelines

Consider these factors:

| Factor | Consideration |
|--------|--------------|
| **Throughput** | More partitions = higher throughput |
| **Parallelism** | Max consumers = number of partitions |
| **Latency** | Too many partitions can increase latency |
| **Storage** | Each partition consumes disk and memory |

**Rule of thumb:** Start with the number of brokers, then adjust based on workload.

Brokers: The Kafka Cluster Infrastructure

What is a Broker?

A broker is a Kafka server that:

  • Stores partition data
  • Handles read and write requests
  • Manages replication
  • Coordinates with other brokers
graph TB
    subgraph "Kafka Broker"
    B[Broker Process<br/>JVM Application]
    B --> S[Local Storage<br/>Partition data]
    B --> N[Network Handler<br/>Client requests]
    B --> R[Replication Manager<br/>Data sync]
    end

Broker Deployment Options

Deployment Flexibility

Brokers can run on:

- 🖥️ Physical servers (bare metal)
- ☁️ Cloud instances (AWS EC2, Azure VMs, GCP Compute)
- 🐳 Containers (Docker, Kubernetes)
- 🔌 IoT devices (Raspberry Pi, edge computing)

**Typical setup:** Brokers have access to **local SSD storage** for high-performance I/O.

Kafka Cluster Architecture

Multiple brokers form a Kafka cluster:

graph TB
    subgraph "Kafka Cluster"
    B1[Broker 1<br/>ID: 1] --> P0[Partition 0]
    B1 --> P3[Partition 3]

    B2[Broker 2<br/>ID: 2] --> P1[Partition 1]
    B2 --> P4[Partition 4]

    B3[Broker 3<br/>ID: 3] --> P2[Partition 2]
    B3 --> P5[Partition 5]
    end

    C1[Producer] --> B1
    C1 --> B2
    C1 --> B3

    B1 --> CS1[Consumer]
    B2 --> CS1
    B3 --> CS1

Topic Distribution Across Brokers

Example with 2 topics:

graph TB
    subgraph "Cluster with 3 Brokers"
    B1[Broker 1]
    B2[Broker 2]
    B3[Broker 3]
    end

    subgraph "Topic A - 3 Partitions"
    B1 --> TA0[Partition 0]
    B2 --> TA1[Partition 1]
    B3 --> TA2[Partition 2]
    end

    subgraph "Topic B - 2 Partitions"
    B1 --> TB0[Partition 0]
    B2 --> TB1[Partition 1]
    end

Note: Topics can have different numbers of partitions based on their scaling needs.

Broker Responsibilities

  1. Handle Client Requests
graph LR
P[Producer] -->|Write Request| B[Broker]
B -->|Acknowledgment| P
B -->|Read Request| C[Consumer]
C -->|Acknowledgment| B
  1. Manage Storage
  2. Write messages to partition logs
  3. Maintain indexes for fast lookups
  4. Clean up old messages based on retention policies

  5. Coordinate Replication

  6. Sync data between replica brokers
  7. Manage leader elections
  8. Ensure data consistency

Metadata Management: KRaft

Modern Kafka (4.0+) uses KRaft (Kafka Raft) for metadata management:

graph TB
    subgraph "Legacy Architecture (< 4.0)"
    K1[Kafka Brokers] <--> Z[Apache ZooKeeper<br/>External dependency]
    end

    subgraph "Modern Architecture (4.0+)"
    K2[Kafka Brokers<br/>with KRaft] --> M[Built-in Metadata<br/>No external dependency]
    end

KRaft Benefits:

!!! success "Why KRaft Matters" - ✅ No external ZooKeeper dependency - ✅ Simplified operations - ✅ Faster metadata updates - ✅ Better scalability (supports 2M+ partitions) - ✅ Built on Raft consensus protocol

Breaking Change

As of Kafka 4.0, ZooKeeper is no longer used or supported.


Replication: Ensuring Fault Tolerance

Why Replication?

Hardware Failure is Inevitable

Disk and server failures will happen. Replication protects against data loss by maintaining multiple copies of each partition.

graph LR
    A[Without Replication] --> B[Broker fails] --> C[❌ Data lost forever]

    D[With Replication] --> E[Broker fails] --> F[✅ Data available from replicas]

Replication Factor

The replication factor defines how many copies of each partition to maintain.

Example: Replication Factor = 3

graph TB
    subgraph "Partition 0 - Replicated 3x"
    L[Leader Replica<br/>Broker 1]
    F1[Follower Replica<br/>Broker 2]
    F2[Follower Replica<br/>Broker 3]
    end

    L -.->|Replicates to| F1
    L -.->|Replicates to| F2

Leader and Followers

For each partition replica set:

  • 1 Leader: Handles all reads and writes
  • n-1 Followers: Sync data from the leader
sequenceDiagram
    participant P as Producer
    participant L as Leader (Broker 1)
    participant F1 as Follower (Broker 2)
    participant F2 as Follower (Broker 3)

    P->>L: Write message
    L->>F1: Replicate message
    L->>F2: Replicate message
    F1->>L: Ack
    F2->>L: Ack
    L->>P: Write successful

Leader Election and Failover

When a broker fails, Kafka automatically elects a new leader:

graph TB
    subgraph "Normal Operation"
    L1[Leader<br/>Broker 1] --> F1[Follower<br/>Broker 2]
    L1 --> F2[Follower<br/>Broker 3]
    end

    subgraph "Broker 1 Fails"
    X[❌ Broker 1<br/>Down]
    F1B[Follower<br/>Broker 2]
    F2B[Follower<br/>Broker 3]
    end

    subgraph "After Leader Election"
    L2[New Leader<br/>Broker 2] --> F3[Follower<br/>Broker 3]
    L2 --> F4[New Follower<br/>Broker 4<br/>Replacing Broker 1]
    end

Process:

  1. Leader (Broker 1) fails
  2. Remaining followers detect failure
  3. One follower (Broker 2) is elected as new leader
  4. Clients automatically connect to new leader
  5. Cluster creates new replica on another broker to restore replication factor

Read and Write Patterns

Default Behavior

graph TB
    P[Producer] -->|Writes| L[Leader Replica]
    L -->|Reads| C[Consumer]

    L -.->|Replicates| F1[Follower<br/>Broker 2]
    L -.->|Replicates| F2[Follower<br/>Broker 3]
  • Writes: Always go to the leader
  • Reads: By default, read from the leader

Follower Reads (Optional)

For improved latency, consumers can read from the nearest replica:

graph TB
    P[Producer<br/>US East] -->|Write| L[Leader<br/>US East]

    L -.->|Replicate| F1[Follower<br/>EU West]
    L -.->|Replicate| F2[Follower<br/>Asia]

    L -->|Read| C1[Consumer<br/>US East]
    F1 -->|Read| C2[Consumer<br/>EU West]
    F2 -->|Read| C3[Consumer<br/>Asia]

Benefits:

!!! success "Follower Read Advantages" - ✅ Lower latency for geographically distributed consumers - ✅ Reduced load on leader broker

Trade-off

⚠️ Potential for slightly stale data (eventual consistency)

Replication Guarantees

Durability & Fault Tolerance

Kafka provides strong durability guarantees:

| Guarantee | Description |
|-----------|-------------|
| **At least once** | Messages are never lost |
| **Ordering** | Maintained within each partition |
| **Durability** | Configurable via `acks` parameter |
| **Fault tolerance** | Survives f broker failures with replication factor f+1 |

**Example:**

- Replication factor = 3
- Can tolerate 2 broker failures
- Data remains available and consistent

Putting It All Together

Complete Kafka Architecture

graph TB
    subgraph "Kafka Cluster"
    B1[Broker 1<br/>with KRaft]
    B2[Broker 2<br/>with KRaft]
    B3[Broker 3<br/>with KRaft]

    B1 <--> B2
    B2 <--> B3
    B3 <--> B1
    end

    subgraph "Topic: thermostat_readings"
    P0L[Partition 0<br/>Leader on B1]
    P0F1[Partition 0<br/>Follower on B2]
    P0F2[Partition 0<br/>Follower on B3]

    P1L[Partition 1<br/>Leader on B2]
    P1F1[Partition 1<br/>Follower on B1]
    P1F2[Partition 1<br/>Follower on B3]
    end

    B1 --> P0L
    B2 --> P0F1
    B3 --> P0F2

    B2 --> P1L
    B1 --> P1F1
    B3 --> P1F2

    PROD[IoT Gateway<br/>Producer] -->|Write events| B1
    PROD -->|Write events| B2

    B1 -->|Read events| CONS[Analytics App<br/>Consumer]
    B2 -->|Read events| CONS

Data Flow Summary

sequenceDiagram
    participant IoT as IoT Device
    participant P as Producer
    participant B1 as Broker 1 (Leader)
    participant B2 as Broker 2 (Follower)
    participant B3 as Broker 3 (Follower)
    participant C as Consumer

    IoT->>P: Temperature reading: 22°C
    P->>P: Serialize to bytes
    P->>P: Hash key (sensor_id=42)
    P->>P: Select partition (hash % 3 = 0)
    P->>B1: Write to Partition 0
    B1->>B2: Replicate message
    B1->>B3: Replicate message
    B2->>B1: Ack replication
    B3->>B1: Ack replication
    B1->>P: Ack write successful

    Note over C: Consumer polls for new messages
    C->>B1: Fetch from Partition 0
    B1->>C: Return messages
    C->>C: Deserialize and process
    C->>B1: Commit offset

Key Takeaways

!!! summary "Events vs Things" - ✅ Kafka focuses on events (things that happen) rather than things (objects)
- ✅ Events capture when and what happened
- ✅ Enables real-time processing and historical analysis

!!! summary "Topics" - ✅ Immutable, append-only logs of events
- ✅ Messages are never modified, only added
- ✅ Support multiple independent consumers
- ✅ Configurable retention and compaction policies

!!! summary "Partitions" - ✅ Enable horizontal scalability
- ✅ Distribute load across multiple brokers
- ✅ Maintain strict ordering within each partition
- ✅ Use hashing to route messages with same key to same partition

!!! summary "Brokers" - ✅ Server processes that store and serve data
- ✅ Form a distributed cluster
- ✅ Handle client read/write requests
- ✅ Use KRaft for built-in metadata management (no ZooKeeper)

!!! summary "Replication" - ✅ Protects against data loss
- ✅ Configurable replication factor (typically 3)
- ✅ Leader handles writes, followers sync data
- ✅ Automatic failover when brokers fail
- ✅ Supports follower reads for lower latency


What's Next?

Continue Your Kafka Journey

Now that you understand the core concepts, you're ready to:

1. [:fontawesome-solid-paper-plane: **Produce Messages**](../how-to/kafka-produce-messages.md) - Learn how to write data to Kafka
2. [:fontawesome-solid-download: **Consume Messages**](../how-to/kafka-consume-messages.md) - Learn how to read data from Kafka
3. [:fontawesome-solid-diagram-project: **Use Schema Registry**](../how-to/kafka-use-schema-registry.md) - Manage data schemas
4. [:fontawesome-solid-link: **Connect External Systems**](../how-to/kafka-connect-systems.md) - Integrate Kafka with databases and other systems
5. [:fontawesome-solid-stream: **Process Streams**](../how-to/kafka-stream-processing.md) - Transform data in real-time

Additional Resources

!!! note "Further Learning" - Apache Kafka Official Documentation - Kafka Improvement Proposals (KIPs) - Confluent Developer Resources - Kafka: The Definitive Guide (Book)


Course Source

This tutorial is based on the Apache Kafka 101 course from Confluent.