There are a few decent resources out there for learning Kafka, but really it comes down to the Apache Documentation and Michael Knoll’s publications. While these are both excellent, I still think there could better information out there to help developers get started. Hopefully this post can help.
Why use Apache Kafka?
There are many use cases, and some of those are discussed in Kafka’s documentation. The benefits of Kafka are many: scalability, speed, durability. That’s all great, but here’s my biggest reason for using it: it serves as a central data bus for all streaming data. This is especially important when you may not know in advance who will be producing data, and who will be consuming that data.
Kafka is nothing more than a streaming log system. Think of it as
tail -f in UNIX speak. In Linux, if there is a process that is producing some log output, it is very common to run
tail -f <filename> on the file to track the log file updates as they happen. A Kafka topic is exactly that, its just a log file that lives in the Kafka broker ecosystem. The big difference is that instead of tailing a single file on a single server, you can consume from a topic from anywhere that has access to Kafka. That topic could also have multiple producers writing to them from many different places. For example:
You have a distributed application that lives across more than one server. This application has some output. Where should that output be written to? The usual choices are to stdout, a flat file, or a database. But what if you don’t have a database set up or don’t know which one to use at first? What if you need to do some additional processing on this distributed data before sending to a database? You could write all the data out to flat files, do some processing on it, and then ingest the data into the database. But then you have to worry about managing all the data between the servers.
Enter Kafka. What if of each server node writing out the data to a different place, they all wrote their data to a common Kafka topic? This is the power of Kafka. In Kafka speak, when data is written to a topic it is a producer. Now – if someone wants to read that data stream, they have one single place to go. The client reading the data is called a consumer.
Inside the Kafka Topic
Two other built in features of Kafka are parallelism and redundancy. Kafka handles this by giving each topic a certain number of partitions and replicas.
Partitions: A single piece of a Kafka topic. The number of partitions is configurable on a per topic basis. More partitions allow for great parallelism when reading from the topics. The number of partitions determines how many consumers you have in a consumer-group. For example, if a topic has 3 partitions, you can have 3 consumers in a consumer-group balancing consuming between the partitions. In this way you have a parallelism of 3. This partition number is somewhat hard to determine until you know how fast you are producing data vs. how fast you are consuming the data. If you have a topic that you know will be high volume, I would ere on the side of more partitions. This also allows room for growth. Aim for between 10 - 50 partitions to start.
Replicas: These are copies of the partitions. They are never written to or read from. Their only purpose is for data redundancy. If your topic has
n-1 brokers can fail before there is any data loss. Additionally, you cannot have a topic a replication factor greater than the number of brokers that you have. For example, you have 5 Kafka brokers, you could have a topic with a maximum replication factor of 5, and 5-1=4 brokers could go down before there is any data loss.
Offsets: An “offset” is just a pointer to a location in the logfile or “topic”. Each client or “consumer” has their own “consumer-group” that is used to track the offset where they are in the topic. The actual offset values are stored in a special Kafka topic called “_consumer_offsets”. Why is it called a “consumer-group” and not just a “consumer”? This is because Kafka supports balanced consuming – meaning that you can have more than one consumer reading from a topic in a round-robin fashion to increase parallelism.
Leaders and In Sync Replicas (ISRs): Once your topic has been created, you can use Kafka’s built in tool
./kafka-topics.sh --describe -z <zookeeper-node>:2181 to run to describe the topics on your Kafka cluster. You might see something like this:
Topic: test.cleaned_firehose PartitionCount:3 ReplicationFactor:3 Configs: Topic: test.cleaned_firehose Partition: 0 Leader: 4 Replicas: 4,5,1 Isr: 1,4,5 Topic: test.cleaned_firehose Partition: 1 Leader: 5 Replicas: 5,1,2 Isr: 1,2,5 Topic: test.cleaned_firehose Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Each partition has a broker leader, and the replicas simply “follow” the leader and duplicate the data. If a broker that is a leader does down, Kafka will automatically elect a new broker leader by default. Note that if you have consumers consuming on a topic that temporarily loses their leader, they may need to be re-connect to fetch the new meta data from the cluster.
The biggest problem I’ve encountered is with brokers randomly going down and then becoming unavailable for leader election. I haven’t gotten to the bottom of this issue but I’m hopeful that some of this stuff has been fixed in 0.9. Rebooting the Kafka broker fixes this problem most of the time.
Another common problem I enconter when using kafka is that a broker goes down, Kafka elects a new leader, and the consumer does’t get the message that there is a new leader in town. This results in the dreaded
NotLeaderForPartition errors. This can be solved by updating the metadata for the Kafka consumer. In the case of a python client, it appears that neither
pykafka can handle this situation. Therefore, the error needs to be caught, and the consumer needs to be re-created.
Tips and Tricks
kafkacat on github for a nice CLI non-JVM based tool for checking Kafka topics or consuming/producing topics.
Kafka is a great tool – but it is still in development, API’s are in flux, and new features are still being added. As of this writing Kafka 0.9.0 has just come out, which introduces a new consumer API (although the old one is still supported) and a security protocol. Before 0.9.0 you can had control access to Kafka via a whitelist or some other VPN firewall.
One of the most lacking areas of Kafka is any kind of built in monitoring or “health status” support. When things go wrong, its very hard to figure out the root cause, and Kafka will often still being “running” but you’ll see ERROR messages spewing out of the logs. Some kind of built in status check API would be very useful for monitoring the tool and figuring out what’s going on. There are some OK open source solutions out there for monitoring consumer lag, offsets, and broker status, but they aren’t sufficent to solve this problem.