Kafka: Exploring the Enigmatic Realms of Existence

Sayantan Samanta
26 min readJun 8, 2023

--

Diving deep into a top streaming platform

Welcome, fellow seekers of knowledge and curious minds, to the realm of Kafka!🀩⏭

So guys, fasten your seatbelts and let the Kafkaesque adventure begin!πŸ”₯

APACHE KAFKA

Apache Kafka is a distributed streaming platform designed to handle high-throughput, fault-tolerant, and scalable real-time data streams.

Kafka consists of servers and clients communicating via a high-performance TCP network protocol.

It allows you to publish and subscribe to streams of records, similar to a message queue or enterprise messaging system, but with additional capabilities to store, process, and analyze data streams. In short, it moves massive amounts of data β€” not just from point A to B, but from points A to Z and anywhere else you need, all at the same time.

Apache Kafka is an alternative to a traditional enterprise messaging system. It started out as an internal system developed by LinkedIn to handle 1.4 trillion messages per day, but now it’s an open-source (from 2011) data streaming solution. with applications for a variety of enterprise needs.

It is now being leveraged by some big companies, such as Uber, Airbnb, Netflix, Yahoo, Udemy, and more than 35% of the Fortune 500 companies and more than 80% of all Fortune 100 companies.

The project is now maintained by the Apache Software Foundation and has become one of the most widely used technologies for handling real-time data streams.

Kafka can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.

Kafka Architecture

Why Kafka Matters

The platform is typically used to build real-time streaming data pipelines that support streaming analytics and mission-critical use cases with guaranteed ordering, no message loss, and exactly-once processing.

Apache Kafka is massively scalable because it allows data to be distributed across multiple servers.

Since Apache Kafka minimizes the need for point-to-point integrations for data sharing in certain applications, it can reduce latency to milliseconds.

This means data is available to users faster, which can be advantageous in use cases that require real-time data availability, such as IT operations and e-commerce. It can also distribute and replicate partitions across many servers, which protects against server failure.

Which domains use Kafka

Apache Kafka is being widely used across various industries and domains due to its robust features and real-time streaming capabilities.

Kafka is everywhere!! Believe me

1. Finance and Banking:

  • Real-time transaction processing and fraud detection.
  • Market data streaming and analysis.
  • Risk management and compliance.
  • Companies β€” PayPal, Goldman Sachs, Capital One

2. E-commerce and Retail:

  • Real-time inventory management and supply chain optimization.
  • Personalized recommendations and customer behavior tracking.
  • Order tracking and fulfillment.
  • Companies β€” Amazon, Walmart, Alibaba Group

3. Social Media and Entertainment:

  • Real-time data streaming and analytics for user engagement and content delivery.
  • Social media sentiment analysis and trending topic detection.
  • Real-time ad bidding and targeting.
  • Companies β€” Twitter, Netflix, Spotify

4. Telecommunications:

  • Call detail record (CDR) processing and analytics.
  • Network monitoring and anomaly detection.
  • Real-time billing and charging.
  • Companies β€” AT&T, Verizon, Vodafone

5. IoT (Internet of Things):

  • Sensor data ingestion, processing, and analytics.
  • Smart home automation and monitoring.
  • Industrial IoT for predictive maintenance and asset tracking.
  • Companies β€” Bosch, General Electric, Philips

6. Healthcare:

  • Real-time patient monitoring and alerting systems.
  • Health data integration and interoperability.
  • Clinical research and analytics.
  • Companies β€” Philips Healthcare, Cerner Corporation, Epic Systems

7. Gaming:

  • Real-time player telemetry and analytics.
  • Multiplayer game synchronization.
  • In-game events and notifications.
  • Companies β€” Electronic Arts (EA), Unity Technologies, Epic Games

8. Transportation and Logistics:

  • Real-time tracking and management of vehicles and shipments.
  • Supply chain visibility and optimization.
  • Traffic monitoring and route optimization.
  • Companies β€” Uber, FedEx, DHL

9. Energy and Utilities:

  • Real-time monitoring of power grids and smart meters.
  • Predictive maintenance for infrastructure and equipment.
  • Energy consumption analytics and optimization.
  • Companies β€” General Electric (GE), Enel, Schneider Electric

10. Government and Public Sector:

  • Real-time data processing for smart city initiatives.
  • Public safety and emergency response systems.
  • Citizen engagement and feedback platforms.
  • Companies β€” United States Department of Defense, NASA, City of Chicago

As you can see, these examples demonstrate the broad adoption of Kafka across different domains by both industry giants and innovative startups.

Things you must know before getting started

Events

An event represents a fact that happened in the past. Events are immutable and never stay in one place. They always travel from one system to another system, carrying the state changes that happened.

Streams

An event stream represents related events in motion.

Commit log

A commit log (also referred to as a write-ahead log or a transaction log) is a central component of Kafka. It is a durable, append-only log of all records that have been published to Kafka. The commit log is used to store data for both producers and consumers.

(P.S. You can configure Kafka topics to remove these messages if there are too many or after a period of time)

It’s read from left to right and guarantees item ordering.

Sample illustration of a commit log

Kafka actually stores all of its messages to disk (more on that later), and having them ordered in the structure lets it take advantage of sequential disk reads.

  • Reads and writes are a constant time O(1) (knowing the record ID), which compared to other structure’s O(log N) operations on disk is a huge advantage, as each disk seek is expensive
  • Reading and writing don’t affect each other. Writing wouldn’t lock reading and vice versa (as opposed to balanced trees).

These two points have huge performance benefits since the data size is completely decoupled from performance. Kafka has the same performance whether you have 100 KB or 100 TB of data on your server.

Topic

Kafka organizes data streams into topics, which are similar to categories or channels to which data is published. Producers write data to specific topics, and consumers subscribe to topics to read and process the data.

Kafka Topics

In Kafka’s universe, a topic is a materialized event stream. In other words, a topic is a stream at rest.

There are two types of topics: compacted and regular. πŸ‘‡

Records in compacted topics do not expire based on time or space bounds. Newer topic messages update older messages that possess the same key and Apache Kafka does not delete the latest message unless deleted by the user.

For regular topics, records can be configured to expire, deleting old data to free storage space.

Topic and partitions

Partitions

Kafka topics are divided into multiple partitions to allow for parallel processing and scalability. Each partition is an ordered, immutable sequence of records. The records within a partition are assigned a unique offset, which represents the position of the record in the partition.

How Kafka Partitioning Works

Having broken a topic up into partitions, we need a way of deciding which messages to write to which partitions. Typically, if a message has no key, subsequent messages will be distributed round-robin among all the topic’s partitions. In this case, all partitions get an even share of the data, but the ordering of records can not be guaranteed within a given partition.

The key takeaway is to use a partition key to put related events together in the same partition in the exact order in which they were sent.

Specifying a partition key enables keeping related events together in the same partition and in the exact order in which they were sent.

If the message does have a key, then the destination partition will be computed from a hash of the key. This allows Kafka to guarantee that messages having the same key always land in the same partition, and therefore are always in order.

For example, if you are producing events that are all associated with the same customer, using the customer ID as the key guarantees that all of the events from a given customer will always arrive in order. This creates the possibility that a very active key will create a larger and more active partition, but this risk is small in practice and is manageable when it presents itself. It is often worth it in order to preserve the ordering of keys.

Producers and Consumers

Producers are responsible for publishing data to Kafka topics, while consumers subscribe to topics and process the published data. Producers and consumers can be deployed across both the same and different systems and can scale independently (in the case of different systems) to handle large volumes of data.

Producer and Consumer are different service
A single service can act as Producer and Consumer both

Kafka appends messages to the partitions as they arrive from the producers. By default, it uses a round-robin partitioner to spread messages uniformly across partitions.

Producers can modify this behavior to create logical streams of messages. For example, in a multi-tenant application, we might want to create logical message streams according to every message’s tenant ID. In an IoT scenario, we might want to have each producer’s identity map to a specific partition constantly. Making sure all messages from the same logical stream map to the same partition guarantees their delivery in order to consumers.

Kafka Producers

Consumers consume messages by maintaining an offset (or index) to these partitions and reading them sequentially.

A single consumer can consume multiple topics, and consumers can scale up to the number of partitions available.

Kafka Producers

As a result, when creating a topic, one should carefully consider the expected throughput of messaging on that topic. A group of consumers working together to consume a topic is called a consumer group.

Kafka’s API typically handles the balancing of partition processing between consumers in a consumer group and the storing of consumers’ current partition offsets.

The Apache Kafka Streams API allows writing Java applications that pull data from Topics and write results back to Apache Kafka. External stream processing systems such as Apache Spark, Apache Apex, Apache Flink, Apache NiFi, and Apache Storm can also be applied to these message streams.

Replication

It would not do if we stored each partition on only one broker. Whether brokers are bare metal servers or managed containers, they and their underlying storage are susceptible to failure, so we need to copy partition data to several other brokers to keep it safe. Those copies are called follower replicas, whereas the main partition is called the leader replica. When you produce data to the leader β€” in general, reading and writing are done to the leader β€” the leader and the followers work together to replicate those new writings to the followers.

So in short, Replicas are copies of the partition. They are used to guarantee that no data is lost in the case of a breakdown or a scheduled shutdown.

This happens automatically, and while you can tune some settings in the producer to produce varying levels of durability guarantees, this is not usually a process you have to think about as a developer building systems on Kafka. All you really need to know as a developer is that your data is safe and that if one node in the cluster dies, another will take over its role.

Brokers

Kafka brokers form the backbone of the Kafka system.

Kafka employs a distributed architecture with a cluster of servers called brokers. Each broker holds a subset of records that belongs to the entire cluster.

Brokers are responsible for storing and handling the published data. They maintain a log of records and replicate data across multiple brokers for fault tolerance.

Cluster and Brokers

Cluster

Multiple brokers are combined to form a Kafka cluster, providing fault tolerance and horizontal scalability.

We will discuss the clustering concepts later and will dive deep in it. So stay tuned.

Offsets and Consumer Groups

Kafka maintains an offset, which is a unique identifier for each message within a partition.

The offset of a message works as a consumer-side cursor at this point.

Consumers use offsets to keep track of the messages they have consumed, enabling fault-tolerant and scalable processing.

Advancing and remembering the last read offset within a partition is the responsibility of the consumer. Kafka has nothing to do with it.

Consumer groups allow multiple consumers to work together to process data from a topic, with each consumer in the group reading from a different partition.

Offset is exactly like Index in KDS (Kinesis Data Stream). πŸ˜‰

Using Offset, the consumer can go to any data( be it the 1st one, 2nd one, or nth one) or message that is inside the respective partition allocated to that consumer.

Consumer-offset are important because if a consumer goes down, after its up it would know from which offset to read data.

Offsets can be committed in 3 ways as below :

At most once: Offset are committed as soon as message is received. If processing goes wrong then message will be lost. (Not preferred way).

At least once: Offset are committed after message is processed. If processing goes wrong then message will be read again. (Preferred way).

Only once: This is used only for Kafka to Kafka workflows.

Before Proceeding further, let’s clear some basic doubts and confusions.

FOR FUN πŸ˜‚

πŸ“Œ Doubt 1: Topic in Kafka and Shards in KDS - both are similar?

No, a topic in Kafka is not the same as a shard in KDS.

A topic in Kafka is a logical grouping of data records. It is a way to organize data so that it can be easily consumed by applications. A shard in KDS is a physical partition of data within a stream. It is the smallest unit of data that can be processed by KDS.

There are a few key differences between topics and shards.

  • Number of records: A topic can contain any number of records, while a shard can only contain a maximum of 1,000,000 records.
  • Data distribution: Records are distributed evenly across all shards in a topic, while records in a shard are not necessarily distributed evenly.
  • Data access: Applications can access data in a topic using any number of shards, while applications can only access data in a shard using the shard’s ID.

In general, topics are a more flexible way to organize data than shards. However, shards can provide better performance for applications that need to process large amounts of data quickly.

So, the question arises β€”

Is there anything in Kafka, that is similar to KDS Shards?😳

YES, THERE IS!! πŸ‘‰πŸ˜ƒ

Partitions in Kafka and Shards in Kinesis are similar in that they are both logical divisions of a topic or stream.

However, there are some key differences between the two.

  • Number of partitions: A Kafka topic can have any number of partitions, while a Kinesis stream has a fixed number of shards.
  • Data distribution: In Kafka, data is distributed across partitions based on the partition key. In Kinesis, data is distributed across shards evenly.
  • Replication: In Kafka, each partition is replicated to a number of brokers. In Kinesis, each shard is replicated to three availability zones.
  • Durability: In Kafka, data is persisted to disk before it is acknowledged as written. In Kinesis, data is not persisted to disk until it is read by a consumer.

Overall, partitions in Kafka are more flexible and customizable than shards in Kinesis. However, shards in Kinesis are more durable and reliable.

πŸ“ŒDoubt 2: Consumers come to Kafka to fetch the data or Kafka goes to the consumer and gives the data?

Let’s understand properly.🎯

Consumers are the ones who come to Kafka to fetch the data.

Here’s how.

When a producer publishes data to a Kafka topic, the data is stored in a durable way and can be consumed by any number of consumers.

Consumers use a pull model to consume data from Kafka. This means that the consumer periodically sends a request to a Kafka broker in order to fetch data from it. The broker then returns a batch of records to the consumer. The consumer can then process these records as needed.

Kafka does not push data to consumers. This is because Kafka is designed to be a scalable and distributed platform. If Kafka pushed data to consumers, it would have to keep track of all of the consumers and their current positions in the data stream. This would be a very complex and expensive task.

Using Offset, the consumer can go to any data( be it the 1st one, 2nd one, or nth one) or message that is inside the respective partition allocated to that consumer.

By having consumers pull data from Kafka, Kafka can scale horizontally and be more efficient.

Kafka and API

Based on the documentation, we can derive that the Apache Kafka Architecture has five core APIs

The Producer API allows an application to publish a stream of records (data) to one or more Kafka topics.

The Consumer API allows an application to subscribe to one or more topics and process data produced to them.

The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table which allow us to ensure the integration of Relational Databases data to the Kafka Cluster and automatically monitor the changes and pull those changes onto the Kafka cluster.

The Admin API allows managing and inspecting topics, brokers and other Kafka objects

Any application can play the role of the above mention concepts as the Kafka Cluster is highly flexible.

Best Apache Kafka Use Cases: What is Kafka Used for?

There are several compelling reasons to use Kafka in various data-centric applications and architectures. Here are some key benefits and use cases for using Kafka:

Various use-cases

Website Activity Tracking

This was the original use case for Kafka. LinkedIn needed to rebuild its user activity tracking pipeline as a set of real-time publish-subscribe feeds. Activity tracking is often very high volume, as each user page view generates many activity messages (events):

  • user clicks
  • registrations
  • likes
  • time spent on certain pages
  • orders
  • environmental changes
  • and so on

These events can be published (produced) to dedicated Kafka topics. Each feed is available for (consumed by) any number of use cases, such as loading into a data lake or warehouse for offline processing and reporting.

Other applications subscribe to topics, receive the data, and process it as needed (monitoring, analysis, reports, newsfeeds, personalization, and so on).

Example scenario: An online e-commerce platform could use Kafka for tracking user activities in real-time. Each user activity, such as product views, cart additions, purchases, reviews, search queries, and so on, could be published as an event to specific Kafka topics. These events can then be written to storage or consumed by in real-time by various microservices for recommendations, personalized offers, reporting, and fraud detection.

Real-Time Data Streaming

Kafka is designed for handling high-throughput, low-latency, and real-time data streaming. It enables the ingestion, processing, and delivery of data in near real-time, making it ideal for use cases that require real-time analytics, event-driven architectures, and real-time decision-making.

Scalability and Fault Tolerance

Any system can be scaled horizontally or vertically. Vertical scalability means adding more resources like CPU, Memory to the same nodes and incurs a high operational cost. Horizontal scalability can be achieved by simply adding a few more nodes in the cluster which increases the capacity demands. Kafka scales horizontally means we can add new nodes/brokers in the cluster whenever we run out of capacity/space.

Horizontal scaling becomes much cheaper after a certain threshold

Kafka is highly scalable and fault-tolerant. It can handle high-volume data streams and can be distributed across multiple brokers, allowing for horizontal scaling and high availability. Kafka’s replication mechanism ensures data durability and fault tolerance, making it reliable for mission-critical applications.

Log Aggregation

Kafka’s ability to handle high-throughput data streams makes it ideal for log aggregation. It can collect logs from various sources, such as application servers, network devices, and databases, and store them in a centralized Kafka cluster. This data can then be consumed in real-time or stored for batch processing and analysis.

Data Integration and Pipelines

Kafka acts as a central data hub, enabling seamless integration and data flow between various systems and applications. It provides connectors that facilitate the ingestion of data from external sources and the export of data to other systems. Kafka’s distributed architecture allows for building robust and scalable data pipelines for data processing, transformation, and synchronization.

Event-Driven Architecture

Kafka is well-suited for event-driven architectures, where events are produced, consumed, and processed asynchronously. It enables the decoupling of components, making systems more modular, scalable, and flexible. Event-driven architectures powered by Kafka can support real-time data processing, event sourcing, and event-driven microservices.

Microservices and Distributed Systems

Kafka plays a crucial role in building distributed systems and microservices architectures. It facilitates communication and data sharing between microservices by acting as a messaging backbone. Kafka ensures reliable message delivery, scalability, and loose coupling between services, allowing for the development of scalable and resilient distributed systems.

Data Streaming and Analytics

Kafka’s real-time streaming capabilities make it valuable for streaming data processing and analytics. It allows for continuous data ingestion, processing, and analysis, enabling real-time monitoring, alerting, and decision-making. Kafka integrates well with popular stream processing frameworks like Apache Flink, Apache Spark, and Kafka Streams, providing powerful tools for real-time data processing and analytics.

Change Data Capture (CDC)

Change Data Capture is a technique used to capture and propagate data changes from databases to other systems in real-time. Kafka, with its ability to handle large volumes of data and ensure fault tolerance, is commonly used as the underlying platform for CDC solutions. By capturing database changes as events in Kafka topics, other systems can consume and react to those changes in real-time.

Internet of Things (IoT)

Kafka is used extensively in IoT applications for handling high-volume, real-time data generated by IoT devices. It enables ingestion, processing, and analysis of sensor data, machine data, and telemetry data. Kafka’s ability to handle massive data streams and support real-time analytics is crucial for IoT use cases such as smart cities, asset tracking, and industrial monitoring.

Now let’s dive deep into the Concepts of Kafka Cluster and Zookeeper

Kafka Cluster Infrastructure With Zookeeper

Kafka relies on Apache ZooKeeper as its distributed coordination service. ZooKeeper plays a critical role in the Kafka infrastructure by managing the configuration, synchronization, and leadership election processes among the Kafka brokers in a cluster.

Apache Kafka and Apache Zookeeper

Let’s first understand what is the Kafka Cluster and what is the need for making a cluster.

Kafka Cluster diagram

A Kafka cluster is just a set of Kafka brokers coupled with Zookeeper and Schema registry (optional).

Here is the flow of how exactly the cluster works.

One big Kafka cluster is created with multiple brokers. In the diagram, there are only 3 brokers or nodes β€” B1, B2, and B3.

The producer ingests the message the in Kafka cluster, here message consists of three components.

Main Data β€” that is called Payload

Time Stamp: when the data got captured by the producer

Partition key and Topic Name.

The producer has to decide one more thing that is β€”

β€œHow many replicas of the message are needed?”

Suppose the message is crucial and gets stored in the B2 broker and if somehow the B2 broker falters and shuts down or terminates, the message will be lost.

So, we want there will be another Broker that has the exact same message as a replica of the original msg sent by the Producer.

So basically Producer has to tell how many replicas it needs for the message β€” more technically, has to tell Replication Factor.

Replication Factor

The replication factor in Kafka is the number of copies of a topic partition that are stored on different brokers in a Kafka cluster. When a producer writes data to a topic, it is written to the leader broker for that partition. The leader broker then replicates the data to the follower brokers for that partition.

Recommended Practice β€” The replication factor can be set for each topic in a Kafka cluster. The default replication factor is 1, which means that each topic partition is stored on a single broker. However, it is recommended to set the replication factor to at least 2, which means that each topic partition is stored on two different brokers.

In our example, we have a cluster with 3 brokers ( As per the diagram Kafka Cluster Image(2) )

P.S. Don’t get confused as there is shown in the diagram that message is getting stored in Broker 2 β€” actually it was before we set the Replication Factor as 2. After setting the RF as 2 β€” We have taken Broker 1 and 3 to have the replicas with them and Broker 2 has no replica with it.

And one more thing to add just to make things a little bit clear β€” messages get stored only in the respective partitions and partitions are there in the respective topics.

So having the replicas of the message means having the message along with the partition and the topic themselves.

As we have RF as 2, each partition will be stored on 2 different brokers. This means that if broker 1 fails, the data for that partition will still be in 3.

So consumers can pull the message from 3 (In case 1 gets terminated or fails and vice-versa )

Special Note:

πŸ“ŒReplication Factor is per topic wise

So, what does it mean?

As discussed above, The replication factor is the number of copies of a topic's data that are stored on different brokers in a Kafka cluster. This means that if one broker fails, the data for that topic will still be available on the other brokers.

The replication factor is configured per topic. This means that you can have different replication factors for different topics. For example, you might have a topic with a replication factor of 3, which means that there are 3 copies of the data for that topic stored on 3 different brokers.

You might also have a topic with a replication factor of 1, which means that there is only 1 copy of the data for that topic stored on 1 broker.

The default replication factor for a topic is 1.

However, setting the replication factor to 3 or more for production environments is recommended. But when in Development and Testing Environments 2 replicas can be used if not 3.

Multiple Broker are here β€” 3 individual system, Broker A, B, and C.

Producer produces the data that is m1. RF 2 means a total of 2 copies, one of the copies is stored in B and C.

Message is stored in the partition and the partition belongs to topic

Sales is taken as our topic and the Partition key is IN.

One of the partitions gets elected as a Leader through the β€œLeader Election process” β€”partition IN that is in the B is the leader, now in c there is also the same partition β€” IN, that is the follower.

Next time, one more msg gonna come up (let’s say m2 ). It has the same partition key so it will go to topic sales and partition key known as IN.

Initially topic set 2, Replication always 2 β€” msg2 again will be there in the partition of both B and C.

On the fly, it is not recommended to change the RF

Here comes a very important point β€” Producer doesn’t send the messages to both B and C, instead Producer sends the data to one replica β€” That is the leader partition β€” Here in B.

Once the leader partition gets the msg, the broker node takes the msg and copy in the partition of the C.

This is what we mean by co-ordination here.

For this coordination to be successful, every broker should know about each other. B should know which partition is where and which topics the other brokers (A and C here in our case ) are handling.

So, for better co-ordination, kafka uses one third party tool on a huge level that is Apache ZooKeeper.

It is a wonderful co-ordination program. Zookeeper helps the brokers to co-ordinate for the metadata especially.

Zookeeper determines the state. That means, it notices, if the Kafka Broker is alive, always when it regularly sends heartbeats requests. Also, while the Broker is the constraint to handle replication, it must be able to follow replication needs.

It is basically a β€œWatch List” concept.

If B goes down β€” B was the Leader broker of our cluster.

ZooKeeper will allocate some other broker to be Leader or Partition in C can also become follower to Leader.

There are many more contribution of ZooKeeper to kafka, will discuss later in detail.

Special Note:

πŸ“Œ On the fly, it is not recommended to change the RF

ReasonπŸ‘‡πŸŽ―

The replication factor can be changed on the fly. However, there are some cautions that should be taken when doing so.

  • Data loss: If the replication factor is decreased, there is a risk of data loss. This is because there will be fewer copies of the data, and if one of the copies is lost, the data will be lost as well.
  • Performance: Changing the replication factor can also affect performance. If the replication factor is increased, more data will need to be replicated, which can slow down the system.
  • Consistency: Changing the replication factor can also affect consistency. If the replication factor is decreased, it is possible that some data will be replicated to some nodes but not to others. This can lead to inconsistencies in the data.

In some cases, the benefits of changing the replication factor may outweigh the risks. For example, if you need to increase the replication factor to protect against data loss, then the benefits of doing so may outweigh the risks. However, if you are only changing the replication factor for performance reasons, then it may be better to wait until the system is not under load to make the change.

Here is which things we have to take care of before changing the Replication Factor:

Make sure that you have a backup of your data. This will protect you in case of data loss.

Change the replication factor gradually. This will help to minimize the impact on performance.

Monitor the system after you have changed the replication factor. This will help you to identify any problems that may have occurred.

It is also important to note that not all systems allow you to change the replication factor on the fly. Some systems require you to stop the system before you can change the replication factor.

So, It is always better to plan all the things beforehand.

So, let’s back to the main discussion.

πŸ“ŒHow the Brokers co-ordinate with each other?

Kafka use the third party tool β€” ZooKeeper.

kafka and ZooKeeper

What is ZooKeeper

Apache ZooKeeper plays a very important role in system architecture as it works in the shadow of more exposed Big Data tools, as Apache Spark or Apache Kafka. In other words, Apache Zookeeper is a distributed, open-source configuration, synchronization service along with naming registry for distributed applications.

History of Origin of Zookeeper

Originally, the ZooKeeper framework was built at β€œYahoo!” in the year of 2007. Because it helps to access their applications in an easy manner. Further, for organized services used by Hadoop, HBase; it became a standard and other distributed frameworks.

Role of ZooKeeper in Kafka

a. Kafka Brokers

Below given are the roles of ZooKeeper in Kafka Broker:

i. State
Zookeeper determines the state. That means, it notices, if the Kafka Broker is alive, always when it regularly sends heartbeats requests. Also, while the Broker is the constraint to handle replication, it must be able to follow replication needs.

ii. Quotas
In order to have different producing and consuming quotas, Kafka Broker allows some clients. This value is set in ZK under /config/clients path. Also, we can change it in bin/kafka-configs.sh script.

iii. Replicas
However, for each topic, Zookeeper in Kafka keeps a set of in-sync replicas (ISR). Moreover, if somehow previously selected leader node fails then on the basis of currently live nodes Apache ZooKeeper will elect the new leader.

iv. Nodes and Topics Registry
Basically, Zookeeper in Kafka stores nodes and topic registries. It is possible to find there all available brokers in Kafka and, more precisely, which Kafka topics are held by each broker, under /brokers/ids and /brokers/topics zNodes, they’re stored. In addition, when it’s started, Kafka broker create the register automatically.

b. Kafka Consumers

i. Offsets
ZooKeeper is the default storage engine, for consumer offsets, in Kafka’s 0.9.1 release. However, all information about how many messages Kafka consumer consumes by each consumer is stored in ZooKeeper.

ii. Registry
Consumers in Kafka also have their own registry as in the case of Kafka Brokers. However, the same rules apply to it, ie. as ephemeral zNode, it’s destroyed once the consumer goes down and the registration process is made automatically by the consumer.

Overall ZooKeeper has contributed in β€”

  • Leader election: When a new message is published to a topic, ZooKeeper ensures that only the leader for that topic can process the message. This is done by using the watch list concept. When a client registers a watch on a node that represents a topic, ZooKeeper will notify the client if the leader for that topic changes. This allows the client to update its own state and to start processing messages from the new leader.
  • Broker registration: When a new broker joins the cluster, it registers itself with ZooKeeper. This is done by creating a new node in ZooKeeper that represents the broker. The broker’s registration information, such as its IP address and port number, is stored in this node. This information is used by other brokers to route messages to the new broker.
  • Topic configuration: ZooKeeper stores the configuration for Kafka topics. This includes information such as the number of partitions for the topic, the replication factor for the topic, and the topic’s retention policy. This information is used by Kafka brokers to store and process messages. When a change is made to the configuration for a topic, ZooKeeper will notify all of the brokers that are subscribed to that topic. This allows the brokers to update their own state and to start processing messages according to the new configuration.
  • Quotas: ZooKeeper can be used to enforce quotas on Kafka topics. This can be done by creating a node in ZooKeeper that represents the quota for a topic. The quota information, such as the maximum number of messages that can be consumed per second, is stored in this node. This information is used by Kafka brokers to enforce the quota for the topic.

Conclusion

By leveraging its key features such as replication, distributed architecture, and seamless fault recovery, Kafka provides reliable and durable data streaming capabilities. As the industry continues to embrace real-time data processing, Apache Kafka stands as a proven solution for building robust and scalable data streaming architectures. Its enduring legacy is a testament to its value in enabling organizations to harness the power of real-time data.

Thanks for reading this article, It took me a lot of time to do the research and understand all the things.

Hope you will like it too.

I will discuss its case study for different companies in the next article soon.

Happy learning! 😊

For more content like this, follow me on Medium and LinkedIn too.

For understanding AWS KDS basics got to my blogπŸ‘‡πŸ˜Š

My Contact Info:

πŸ“©Email:- sayantansamanta098@gmail.com
LinkedIn:-
https://www.linkedin.com/in/sayantan-samanta/

You can appreciate my efforts on my LinkedIn PostπŸ‘‡πŸ˜ŠπŸ’š

ReferencesπŸ‘‡

Processing guarantees in Kafka

Demystifying Kafka

Kafka: The Key to Unlocking Industry Use Case Success

Apache Kafka in Depth

Understanding Kafka Topic Partitions

Overview of Kafka and ZooKeeper Architecture

Kafka Architecture and Its Fundamental Concepts

Getting started with Apache Kafka and ZooKeeper

Role of Apache ZooKeeper in Kafka β€” Monitoring & Configuration

--

--