Kafka Introduction

OordaMage
Jan 25
2 min read

Updated: Mar 11

Apache Kafka is an open-source distributed streaming platform used for building real-time data pipelines and streaming applications.

The unit of data in Kafka is a message. A message is an array of bytes, meaning the data in the message has no specific format or meaning to Kafka. Messages are written to Kafka in batches. A batch is just a collection of messages, all of which are produced by the same topic and partition.

Keys are optional metadata used for more controlled message partitioning.

Message Schemas: Schemas are imposed on message content for better understanding. Examples include:

- JSON

- XML

- Apache Avro

# Kafka: Use Cases

- Messaging system

- Activity Tracking

- Gather Metrics from many different locations

- Application Logs Gathering

- Stream Processing

# Kafka Topics:

Kafka topics are streams of data, similar to a table in a database.

Kafka topics include various message formats, and the sequence of messages is called a data stream. Note that you cannot query topics, but you can use Kafka producers to send data and Kafka consumers to read the data from the topic.

Topics are split into **partitions**, and messages within each partition are ordered. Each message within a partition gets an incremental ID, called an **offset**. Partitions provide redundancy and scalability. Each partition can be hosted on different servers. Partitions can also be replicated so that multiple servers can store a copy of the same partition in case one server fails.

Each partition is made up of multiple segments, each with a range of offsets. The last segment is an active segment, where the data is being written to.

Several settings can alter segment properties:

- **log.segment.bytes:** Default 1GB. If the data in a segment exceeds 1GB, the segment is committed, and a new segment is created.

- **log.segment.ms:** The time Kafka will wait before committing the segment if not full (default 1 week).

Kafka topics are immutable. Once data is written to a partition, it cannot be changed. Data is stored for a limited time (one week by default).

Segments come with two indexes:

- An offset-to-position index: Helps Kafka find where to read from to find a message.

- A timestamp-to-offset index: Helps Kafka find messages with a timestamp.

- Offset only has meaning for a specific partition.

- Offset 3 in partition 0 doesn't represent the same data as offset 3 in partition 1.

- Order is guaranteed only within a partition, not across partitions.

- Data is randomly assigned to a partition unless a key is provided.

Kafka Introduction

# Kafka Topics:

Recent Posts

Comments