kafka-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jun...@apache.org
Subject svn commit: r1325999 - /incubator/kafka/site/design.html
Date Fri, 13 Apr 2012 22:43:04 GMT
Author: junrao
Date: Fri Apr 13 22:43:04 2012
New Revision: 1325999

URL: http://svn.apache.org/viewvc?rev=1325999&view=rev
Log:
minor update to design doc

Modified:
    incubator/kafka/site/design.html

Modified: incubator/kafka/site/design.html
URL: http://svn.apache.org/viewvc/incubator/kafka/site/design.html?rev=1325999&r1=1325998&r2=1325999&view=diff
==============================================================================
--- incubator/kafka/site/design.html (original)
+++ incubator/kafka/site/design.html Fri Apr 13 22:43:04 2012
@@ -203,7 +203,7 @@ Kafka does two unusual things with respe
 </p>
 <h3>Consumer state</h3>
 <p>
-Kafka also maintains this state about what has been consumed to the client. This provides
an easy out for some simple cases, and has a few side benefits. In the simplest cases the
consumer may simply be entering some aggregate value into a centralized, transactional OLTP
database. In this case the consumer can store the state of what is consumed in the same transaction
as the database modification. This solves a distributed consensus problem, by removing the
distributed part! A similar trick works for some non-transactional systems as well. A search
system can store its consumer state with its index segments. Though it may provide no durability
guarantees, this means that the index is always in sync with the consumer state: if an unflushed
index segment is lost in a crash, the indexes can always resume consumption from the latest
checkpointed offset. Likewise our Hadoop load job which does parallel loads from Kafka, does
a similar trick. Individual mappers write the offset o
 f the last consumed message to HDFS at the end of the map task. If a job fails and gets restarted,
each mapper simply restarts from the offsets stored in HDFS.
+In Kafka, the consumers are responsible for maintaining state information (offset) on what
has been consumed. Typically, the Kafka consumer library writes their state data to zookeeper.
However, it may be beneficial for consumers to write state data into the same datastore where
they are writing the results of their processing.  For example, the consumer may simply be
entering some aggregate value into a centralized transactional OLTP database. In this case
the consumer can store the state of what is consumed in the same transaction as the database
modification. This solves a distributed consensus problem, by removing the distributed part!
A similar trick works for some non-transactional systems as well. A search system can store
its consumer state with its index segments. Though it may provide no durability guarantees,
this means that the index is always in sync with the consumer state: if an unflushed index
segment is lost in a crash, the indexes can always resume consumpt
 ion from the latest checkpointed offset. Likewise our Hadoop load job which does parallel
loads from Kafka, does a similar trick. Individual mappers write the offset of the last consumed
message to HDFS at the end of the map task. If a job fails and gets restarted, each mapper
simply restarts from the offsets stored in HDFS.
 </p>
 <p>
 There is a side benefit of this decision. A consumer can deliberately <i>rewind</i>
back to an old offset and re-consume data. This violates the common contract of a queue, but
turns out to be an essential feature for many consumers. For example, if the consumer code
has a bug and is discovered after some messages are consumed, the consumer can re-consume
those messages once the bug is fixed.



Mime
View raw message