Apache Kafka
What is Kafka? distributed event streaming platform
Simply Kafka is distributed Log
Log (data structure)
Basic of Kafka* Append-only* Write Optimized data structure
Basic concept:
DistributedPartitioned
Kafka 탄생배경
From Jay Kreps Blog
Against the limits of our monolithic, centralized database and needed to start the transition. We built, deployed, and run to this day a distributed graph database, a distributed search backend, a Hadoop installation, and a first and second-generation key-value store. The things we were building had a very simple concept at the heart: the log
.
Logs sometimes called write-ahead logs or commit logs or transaction logs.
A log is perhaps the simplest possible storage abstraction. It is an append-only, totally ordered sequence of records ordered by time. It looks like this:
Records are appended to the end of the log and read proceed left-to-right. Each entry is assigned a unique sequential log entry number.
A log is not all that different from a file or a table. A file is an array of bytes, a table is an array of records, and a log is really just a kind of table or file where the records are sorted by time.
logs have a specific purpose: they record what happened and when.
For distributed data systems this is, in many ways, the very heart of the problem.
Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using Syslog or log4j. For clarity, I will call this “application logging”.
The biggest difference is that text logs are meant to be primarily for humans to read and the “journal” or “data logs” I’m describing are built for programmatic access.
Logs in distributed systems
The two problems a log solves- ordering changes and distributing data
Log-structured data flow
The log is the natural data structure for handling data flow between systems.
It is worth emphasizing that the log is still just the infrastructure.
We used a few tricks in Kafka to support this kind of scale:
1. Partitioning the log
2. Optimizing throughput by batching reads and writes3. Avoiding needless data copies in order to allow horizontal scaling we chop up our log into
partitions:
Each partition is a totally ordered log, but there is no global ordering between partitions (other than perhaps some wall-clock time you might include in your messages).