Getting Started with Apache Kafka¶

Introduction¶

Apache Kafka® has become the universal foundation on which modern data systems are built. Whether you're building data pipelines for analytics, connecting microservices, or moving data across systems, Kafka is likely to be a foundational part of your infrastructure.

This tutorial introduces you to the core concepts of Apache Kafka, from understanding events to grasping how Kafka stores, partitions, and replicates data across a distributed cluster.

What You'll Learn¶

Understanding events vs. things
How Kafka topics work as immutable logs
Partitioning for scalability
Brokers and cluster architecture
Replication for fault tolerance

Prerequisites¶

Basic understanding of distributed systems concepts
Familiarity with command-line interfaces
No prior Kafka experience required

Understanding Events: The Foundation of Kafka¶

From Things to Events¶

Mental Model Shift

When you think about data, you're probably inclined to think about data as tables—representations of things in the world: items in inventory, IoT devices, user accounts, etc.

Kafka encourages you to shift your thinking from **things** to **events**—moments in time when something happens.

Examples of Events:

An item getting sold
A driver using their turn signal
A user clicking on your site
A temperature sensor reporting a reading

graph LR
    A[Traditional Database] --> B[Tables store THINGS]
    B --> C[Update rows when things change]

    D[Apache Kafka] --> E[Topics store EVENTS]
    E --> F[Append events as they happen]

Real-Time Processing¶

Real-Time Focus

Events happen at a specific point in time, and Kafka is focused on processing them in real time. This means:

- ✅ Process events as they occur
- ✅ Don't store them for batch processing later
- ✅ React immediately to what's happening now

However, Kafka **can remember events**—it stores them for future replay and analysis.

Example: Smart Thermostat¶

Consider a network of smart thermostats in homes. In a traditional database:

Traditional Approach (Tables):

sensor_id	location	temperature	timestamp
42	kitchen	22°C	2024-12-05 10:00
42	kitchen	24°C	2024-12-05 10:15

Problem: Lost Context

When the temperature updates, we lose context—we don't remember what the temperature used to be.

Kafka Approach (Events Log):

{"sensor_id": 42, "location": "kitchen", "temperature": 22, "timestamp": "2024-12-05T10:00:00Z"}  // (1)!
{"sensor_id": 42, "location": "kitchen", "temperature": 24, "timestamp": "2024-12-05T10:15:00Z"}  // (2)!

First reading at 10:00 - preserved forever
Second reading at 10:15 - appended to log

Benefits of Event Logs

We preserve the entire history of temperature changes, enabling insights like:

- How quickly does the kitchen heat up?
- What time of day does it heat up?
- Historical trends and patterns

Kafka Topics: Immutable Event Logs¶

What is a Topic?¶

In Kafka, a topic is where we store messages. Unlike traditional database tables, topics use logs—ordered sequences of immutable records.

graph TD
    A[Database Table] -->|Update| B[Row changes, history lost]

    C[Kafka Topic] -->|Append| D[New event added to end]
    D --> E[Previous events remain unchanged]

Key Characteristics of Topics¶

!!! note "Core Properties" 1. Immutable Messages: Once written, messages cannot be changed 2. Append-Only: New messages are added to the end 3. Ordered Sequence: Messages maintain their order within the log 4. Replayable: Messages can be read multiple times

Topic Structure Example¶

graph LR
    subgraph "Topic: thermostat_readings"
    A[Message 0<br/>sensor:42<br/>temp:22°C] --> B[Message 1<br/>sensor:43<br/>temp:20°C]
    B --> C[Message 2<br/>sensor:42<br/>temp:24°C]
    C --> D[Message 3<br/>sensor:44<br/>temp:21°C]
    end

Message Anatomy¶

Every message in a Kafka topic consists of:

graph TB
    subgraph Message
    A[Key<br/>Identifier: sensor_id, user_id]
    B[Value<br/>Actual data payload]
    C[Timestamp<br/>When event occurred]
    D[Headers<br/>Optional metadata]
    E[Offset<br/>Position in partition]
    end

Components:

Key (optional): Identifier that the event relates to (e.g., sensor_id: 42) // (1)!
Value (required): The actual event data (JSON, Avro, Protobuf, etc.) // (2)!
Timestamp: When the event was produced or received // (3)!
Headers (optional): Key-value pairs for lightweight metadata // (4)!
Offset: Sequential ID starting at 0, incrementing by 1 for each message // (5)!
Used for partition assignment - messages with same key go to same partition
Can be any serialized format: JSON, Avro, Protocol Buffers, plain text, etc.
Can be producer-set (event time) or broker-set (ingestion time)
Useful for tracing, routing, or adding context without modifying the value
Unique within a partition - used by consumers to track reading position

Example Message¶

{
  "key": 42,
  "value": {
    "sensor_id": 42,
    "location": "kitchen",
    "temperature": 22,
    "unit": "celsius",
    "read_at": "2024-12-05T10:00:00Z"
  },
  "timestamp": 1733396400000,
  "headers": {
    "source": "iot-gateway-01",
    "version": "1.0"
  },
  "offset": 1234567
}

Topics Are Not Queues¶

Critical Distinction

Queue: When you read a message, it's gone—nobody else can read it

**Kafka Topic**: Messages remain available for multiple consumers to read independently

graph TB
    subgraph "Queue Behavior"
    Q1[Message 1] --> C1[Consumer reads]
    C1 --> X1[Message deleted]
    end

    subgraph "Kafka Topic Behavior"
    M1[Message 1] --> CR1[Consumer A reads]
    M1 --> CR2[Consumer B reads]
    M1 --> CR3[Consumer C reads]
    M1 -.-> S[Message remains]
    end

Queue: When you read a message, it's gone—nobody else can read it
Kafka Topic: Messages remain available for multiple consumers to read independently

This enables:

!!! success "Key Advantages" - Replayability: Read messages again - Multiple consumers: Different applications can process the same events - Fault tolerance: If a consumer fails, it can resume from where it left off

Transforming Immutable Data¶

Since messages are immutable, how do we transform data?

Answer: Create new topics with transformed data.

graph LR
    A[Topic: thermostat_readings<br/>All sensor data] -->|Filter| B[Topic: hot_locations<br/>Temperature > 25°C only]

Example transformation using stream processing:

-- Filter to only hot readings
INSERT INTO hot_locations
SELECT * FROM thermostat_readings
WHERE temperature > 25;

Log Compaction and Retention¶

Kafka offers flexible data management:

Retention Policies:

graph TB
    A[Retention Strategies]
    A --> B[Time-based<br/>Keep for 7 days]
    A --> C[Size-based<br/>Keep latest 1GB]
    A --> D[Compaction<br/>Keep latest value per key]
    A --> E[Forever<br/>Keep all messages]

Log Compaction is particularly useful when you only care about the latest state:

graph LR
    subgraph "Before Compaction"
    A1[Key:42, Temp:20] --> A2[Key:43, Temp:22] --> A3[Key:42, Temp:24] --> A4[Key:43, Temp:21]
    end

    subgraph "After Compaction"
    B1[Key:42, Temp:24] --> B2[Key:43, Temp:21]
    end

Partitions: Scaling Kafka Horizontally¶

Why Partition?¶

Single Partition Limitations

If a topic were stored entirely on one machine, it would be limited by:

- ❌ That machine's storage capacity
- ❌ That machine's processing power
- ❌ That machine's network bandwidth

Partitioning solves this by splitting a topic into multiple logs distributed across different machines.

graph TB
    subgraph "Single Partition - Limited"
    A[Topic on one machine] --> B[❌ Storage limit]
    A --> C[❌ Processing limit]
    A --> D[❌ Bandwidth limit]
    end

    subgraph "Multiple Partitions - Scalable"
    E[Topic split across machines] --> F[✅ Distributed storage]
    E --> G[✅ Parallel processing]
    E --> H[✅ High throughput]
    end

Partition Structure¶

graph TB
    subgraph "Topic: thermostat_readings"
    T[Topic] --> P0[Partition 0]
    T --> P1[Partition 1]
    T --> P2[Partition 2]

    P0 --> M00[Offset 0: sensor:42]
    P0 --> M01[Offset 1: sensor:45]

    P1 --> M10[Offset 0: sensor:43]
    P1 --> M11[Offset 1: sensor:46]

    P2 --> M20[Offset 0: sensor:44]
    P2 --> M21[Offset 1: sensor:47]
    end

Key Points:

!!! info "Partition Characteristics" - Each partition is an ordered, immutable log - Partitions are independent of each other - A topic can have hundreds or thousands of partitions - Kafka supports up to 2 million partitions with KRaft

Message Ordering¶

graph TB
    A[Message Ordering]
    A --> B[Within Partition<br/>✅ Strict ordering guaranteed]
    A --> C[Across Partitions<br/>❌ No global ordering]

Within a partition: Messages are read in the exact sequence they were written.

Across partitions: No ordering guarantee between partitions.

Design Consideration

If order matters for your use case, ensure related messages share the same key—they'll go to the same partition and maintain order.

Partition Assignment Strategies¶

1. With Message Key (Hash-Based)¶

graph LR
    M1[Message<br/>Key: 42] --> H1[Hash function] --> MOD1[hash % 3 = 0] --> P0[Partition 0]
    M2[Message<br/>Key: 43] --> H2[Hash function] --> MOD2[hash % 3 = 1] --> P1[Partition 1]
    M3[Message<br/>Key: 44] --> H3[Hash function] --> MOD3[hash % 3 = 2] --> P2[Partition 2]
    M4[Message<br/>Key: 42] --> H4[Hash function] --> MOD4[hash % 3 = 0] --> P0

Process:

Hash the message key
Apply modulo with number of partitions: hash(key) % num_partitions
Route message to the resulting partition

Result: All messages with the same key go to the same partition, preserving order for that key.

Practical Example

All messages with sensor_id=42 → Partition 0 All messages with sensor_id=43 → Partition 1

This ensures all temperature readings from sensor 42 are processed in chronological order.

2. Without Message Key (Round-Robin)¶

graph LR
    M1[Message 1<br/>No key] --> P0[Partition 0]
    M2[Message 2<br/>No key] --> P1[Partition 1]
    M3[Message 3<br/>No key] --> P2[Partition 2]
    M4[Message 4<br/>No key] --> P0[Partition 0]
    M5[Message 5<br/>No key] --> P1[Partition 1]

Process: Messages are distributed evenly across partitions in a round-robin fashion.

Result:

Benefits

✅ Even load distribution across partitions

Trade-off

❌ No ordering guarantee (messages from same source can go to different partitions)

Choosing Partition Count¶

Partition Count Guidelines

Consider these factors:

| Factor | Consideration |
|--------|--------------|
| **Throughput** | More partitions = higher throughput |
| **Parallelism** | Max consumers = number of partitions |
| **Latency** | Too many partitions can increase latency |
| **Storage** | Each partition consumes disk and memory |

**Rule of thumb:** Start with the number of brokers, then adjust based on workload.

Brokers: The Kafka Cluster Infrastructure¶

What is a Broker?¶

A broker is a Kafka server that:

Stores partition data
Handles read and write requests
Manages replication
Coordinates with other brokers

graph TB
    subgraph "Kafka Broker"
    B[Broker Process<br/>JVM Application]
    B --> S[Local Storage<br/>Partition data]
    B --> N[Network Handler<br/>Client requests]
    B --> R[Replication Manager<br/>Data sync]
    end

Broker Deployment Options¶

Deployment Flexibility

Brokers can run on:

- 🖥️ Physical servers (bare metal)
- ☁️ Cloud instances (AWS EC2, Azure VMs, GCP Compute)
- 🐳 Containers (Docker, Kubernetes)
- 🔌 IoT devices (Raspberry Pi, edge computing)

**Typical setup:** Brokers have access to **local SSD storage** for high-performance I/O.

Kafka Cluster Architecture¶

Multiple brokers form a Kafka cluster:

graph TB
    subgraph "Kafka Cluster"
    B1[Broker 1<br/>ID: 1] --> P0[Partition 0]
    B1 --> P3[Partition 3]

    B2[Broker 2<br/>ID: 2] --> P1[Partition 1]
    B2 --> P4[Partition 4]

    B3[Broker 3<br/>ID: 3] --> P2[Partition 2]
    B3 --> P5[Partition 5]
    end

    C1[Producer] --> B1
    C1 --> B2
    C1 --> B3

    B1 --> CS1[Consumer]
    B2 --> CS1
    B3 --> CS1

Topic Distribution Across Brokers¶

Example with 2 topics:

graph TB
    subgraph "Cluster with 3 Brokers"
    B1[Broker 1]
    B2[Broker 2]
    B3[Broker 3]
    end

    subgraph "Topic A - 3 Partitions"
    B1 --> TA0[Partition 0]
    B2 --> TA1[Partition 1]
    B3 --> TA2[Partition 2]
    end

    subgraph "Topic B - 2 Partitions"
    B1 --> TB0[Partition 0]
    B2 --> TB1[Partition 1]
    end

Note: Topics can have different numbers of partitions based on their scaling needs.

Broker Responsibilities¶

Handle Client Requests

graph LR
P[Producer] -->|Write Request| B[Broker]
B -->|Acknowledgment| P
B -->|Read Request| C[Consumer]
C -->|Acknowledgment| B

Manage Storage
Write messages to partition logs
Maintain indexes for fast lookups
Clean up old messages based on retention policies
Coordinate Replication
Sync data between replica brokers
Manage leader elections
Ensure data consistency

Metadata Management: KRaft¶

Modern Kafka (4.0+) uses KRaft (Kafka Raft) for metadata management:

graph TB
    subgraph "Legacy Architecture (< 4.0)"
    K1[Kafka Brokers] <--> Z[Apache ZooKeeper<br/>External dependency]
    end

    subgraph "Modern Architecture (4.0+)"
    K2[Kafka Brokers<br/>with KRaft] --> M[Built-in Metadata<br/>No external dependency]
    end

KRaft Benefits:

!!! success "Why KRaft Matters" - ✅ No external ZooKeeper dependency - ✅ Simplified operations - ✅ Faster metadata updates - ✅ Better scalability (supports 2M+ partitions) - ✅ Built on Raft consensus protocol

Breaking Change

As of Kafka 4.0, ZooKeeper is no longer used or supported.

Replication: Ensuring Fault Tolerance¶

Why Replication?¶

Hardware Failure is Inevitable

Disk and server failures will happen. Replication protects against data loss by maintaining multiple copies of each partition.

graph LR
    A[Without Replication] --> B[Broker fails] --> C[❌ Data lost forever]

    D[With Replication] --> E[Broker fails] --> F[✅ Data available from replicas]

Replication Factor¶

The replication factor defines how many copies of each partition to maintain.

Example: Replication Factor = 3

graph TB
    subgraph "Partition 0 - Replicated 3x"
    L[Leader Replica<br/>Broker 1]
    F1[Follower Replica<br/>Broker 2]
    F2[Follower Replica<br/>Broker 3]
    end

    L -.->|Replicates to| F1
    L -.->|Replicates to| F2

Leader and Followers¶

For each partition replica set:

1 Leader: Handles all reads and writes
n-1 Followers: Sync data from the leader

sequenceDiagram
    participant P as Producer
    participant L as Leader (Broker 1)
    participant F1 as Follower (Broker 2)
    participant F2 as Follower (Broker 3)

    P->>L: Write message
    L->>F1: Replicate message
    L->>F2: Replicate message
    F1->>L: Ack
    F2->>L: Ack
    L->>P: Write successful

Leader Election and Failover¶

When a broker fails, Kafka automatically elects a new leader:

graph TB
    subgraph "Normal Operation"
    L1[Leader<br/>Broker 1] --> F1[Follower<br/>Broker 2]
    L1 --> F2[Follower<br/>Broker 3]
    end

    subgraph "Broker 1 Fails"
    X[❌ Broker 1<br/>Down]
    F1B[Follower<br/>Broker 2]
    F2B[Follower<br/>Broker 3]
    end

    subgraph "After Leader Election"
    L2[New Leader<br/>Broker 2] --> F3[Follower<br/>Broker 3]
    L2 --> F4[New Follower<br/>Broker 4<br/>Replacing Broker 1]
    end

Process:

Leader (Broker 1) fails
Remaining followers detect failure
One follower (Broker 2) is elected as new leader
Clients automatically connect to new leader
Cluster creates new replica on another broker to restore replication factor

Read and Write Patterns¶

Default Behavior¶

graph TB
    P[Producer] -->|Writes| L[Leader Replica]
    L -->|Reads| C[Consumer]

    L -.->|Replicates| F1[Follower<br/>Broker 2]
    L -.->|Replicates| F2[Follower<br/>Broker 3]

Writes: Always go to the leader
Reads: By default, read from the leader

Follower Reads (Optional)¶

For improved latency, consumers can read from the nearest replica:

graph TB
    P[Producer<br/>US East] -->|Write| L[Leader<br/>US East]

    L -.->|Replicate| F1[Follower<br/>EU West]
    L -.->|Replicate| F2[Follower<br/>Asia]

    L -->|Read| C1[Consumer<br/>US East]
    F1 -->|Read| C2[Consumer<br/>EU West]
    F2 -->|Read| C3[Consumer<br/>Asia]

Benefits:

!!! success "Follower Read Advantages" - ✅ Lower latency for geographically distributed consumers - ✅ Reduced load on leader broker

Trade-off

⚠️ Potential for slightly stale data (eventual consistency)

Replication Guarantees¶

Durability & Fault Tolerance

Kafka provides strong durability guarantees:

| Guarantee | Description |
|-----------|-------------|
| **At least once** | Messages are never lost |
| **Ordering** | Maintained within each partition |
| **Durability** | Configurable via `acks` parameter |
| **Fault tolerance** | Survives f broker failures with replication factor f+1 |

**Example:**

- Replication factor = 3
- Can tolerate 2 broker failures
- Data remains available and consistent

Putting It All Together¶

Complete Kafka Architecture¶

graph TB
    subgraph "Kafka Cluster"
    B1[Broker 1<br/>with KRaft]
    B2[Broker 2<br/>with KRaft]
    B3[Broker 3<br/>with KRaft]

    B1 <--> B2
    B2 <--> B3
    B3 <--> B1
    end

    subgraph "Topic: thermostat_readings"
    P0L[Partition 0<br/>Leader on B1]
    P0F1[Partition 0<br/>Follower on B2]
    P0F2[Partition 0<br/>Follower on B3]

    P1L[Partition 1<br/>Leader on B2]
    P1F1[Partition 1<br/>Follower on B1]
    P1F2[Partition 1<br/>Follower on B3]
    end

    B1 --> P0L
    B2 --> P0F1
    B3 --> P0F2

    B2 --> P1L
    B1 --> P1F1
    B3 --> P1F2

    PROD[IoT Gateway<br/>Producer] -->|Write events| B1
    PROD -->|Write events| B2

    B1 -->|Read events| CONS[Analytics App<br/>Consumer]
    B2 -->|Read events| CONS

Data Flow Summary¶

sequenceDiagram
    participant IoT as IoT Device
    participant P as Producer
    participant B1 as Broker 1 (Leader)
    participant B2 as Broker 2 (Follower)
    participant B3 as Broker 3 (Follower)
    participant C as Consumer

    IoT->>P: Temperature reading: 22°C
    P->>P: Serialize to bytes
    P->>P: Hash key (sensor_id=42)
    P->>P: Select partition (hash % 3 = 0)
    P->>B1: Write to Partition 0
    B1->>B2: Replicate message
    B1->>B3: Replicate message
    B2->>B1: Ack replication
    B3->>B1: Ack replication
    B1->>P: Ack write successful

    Note over C: Consumer polls for new messages
    C->>B1: Fetch from Partition 0
    B1->>C: Return messages
    C->>C: Deserialize and process
    C->>B1: Commit offset

Key Takeaways¶

!!! summary "Events vs Things" - ✅ Kafka focuses on events (things that happen) rather than things (objects)
- ✅ Events capture when and what happened
- ✅ Enables real-time processing and historical analysis

!!! summary "Topics" - ✅ Immutable, append-only logs of events
- ✅ Messages are never modified, only added
- ✅ Support multiple independent consumers
- ✅ Configurable retention and compaction policies

!!! summary "Partitions" - ✅ Enable horizontal scalability
- ✅ Distribute load across multiple brokers
- ✅ Maintain strict ordering within each partition
- ✅ Use hashing to route messages with same key to same partition

!!! summary "Brokers" - ✅ Server processes that store and serve data
- ✅ Form a distributed cluster
- ✅ Handle client read/write requests
- ✅ Use KRaft for built-in metadata management (no ZooKeeper)

!!! summary "Replication" - ✅ Protects against data loss
- ✅ Configurable replication factor (typically 3)
- ✅ Leader handles writes, followers sync data
- ✅ Automatic failover when brokers fail
- ✅ Supports follower reads for lower latency

What's Next?¶

Continue Your Kafka Journey

Now that you understand the core concepts, you're ready to:

1. [:fontawesome-solid-paper-plane: **Produce Messages**](../how-to/kafka-produce-messages.md) - Learn how to write data to Kafka
2. [:fontawesome-solid-download: **Consume Messages**](../how-to/kafka-consume-messages.md) - Learn how to read data from Kafka
3. [:fontawesome-solid-diagram-project: **Use Schema Registry**](../how-to/kafka-use-schema-registry.md) - Manage data schemas
4. [:fontawesome-solid-link: **Connect External Systems**](../how-to/kafka-connect-systems.md) - Integrate Kafka with databases and other systems
5. [:fontawesome-solid-stream: **Process Streams**](../how-to/kafka-stream-processing.md) - Transform data in real-time

Additional Resources¶

!!! note "Further Learning" - Apache Kafka Official Documentation - Kafka Improvement Proposals (KIPs) - Confluent Developer Resources - Kafka: The Definitive Guide (Book)

Course Source

This tutorial is based on the Apache Kafka 101 course from Confluent.