kafka-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jun...@apache.org
Subject [2/2] kafka git commit: KAFKA-2809; Improve documentation linking
Date Mon, 16 Nov 2015 22:14:22 GMT
KAFKA-2809; Improve documentation linking

Often it is useful to link to a specific header within the documentation. Especially when referencing docs in the mailing lists.

This adds anchors and links for all headers in the docs.

Author: Grant Henke <granthenke@gmail.com>

Reviewers: Jun Rao <junrao@gmail.com>

Closes #498 from granthenke/doc-links


Project: http://git-wip-us.apache.org/repos/asf/kafka/repo
Commit: http://git-wip-us.apache.org/repos/asf/kafka/commit/6cbd9759
Tree: http://git-wip-us.apache.org/repos/asf/kafka/tree/6cbd9759
Diff: http://git-wip-us.apache.org/repos/asf/kafka/diff/6cbd9759

Branch: refs/heads/trunk
Commit: 6cbd97597ccf456a4f01f19553da5a03e12c9366
Parents: 5fc4546
Author: Grant Henke <granthenke@gmail.com>
Authored: Mon Nov 16 14:14:17 2015 -0800
Committer: Jun Rao <junrao@gmail.com>
Committed: Mon Nov 16 14:14:17 2015 -0800

----------------------------------------------------------------------
 docs/api.html            |  8 ++--
 docs/configuration.html  | 14 +++----
 docs/connect.html        | 34 +++++++--------
 docs/design.html         | 72 ++++++++++++++++----------------
 docs/documentation.html  | 16 ++++----
 docs/ecosystem.html      |  6 +--
 docs/implementation.html | 55 ++++++++++++-------------
 docs/introduction.html   | 20 ++++-----
 docs/migration.html      |  8 ++--
 docs/ops.html            | 96 +++++++++++++++++++++----------------------
 docs/quickstart.html     | 34 +++++++--------
 docs/security.html       | 40 +++++++++---------
 docs/upgrade.html        | 12 +++---
 docs/uses.html           | 20 ++++-----
 14 files changed, 217 insertions(+), 218 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/api.html
----------------------------------------------------------------------
diff --git a/docs/api.html b/docs/api.html
index 9b739da..8d79b20 100644
--- a/docs/api.html
+++ b/docs/api.html
@@ -17,7 +17,7 @@
 
 Apache Kafka includes new java clients (in the org.apache.kafka.clients package). These are meant to supplant the older Scala clients, but for compatability they will co-exist for some time. These clients are available in a seperate jar with minimal dependencies, while the old Scala clients remain packaged with the server.
 
-<h3><a id="producerapi">2.1 Producer API</a></h3>
+<h3><a id="producerapi" href="#producerapi">2.1 Producer API</a></h3>
 
 We encourage all new development to use the new Java producer. This client is production tested and generally both faster and more fully featured than the previous Scala client. You can use this client by adding a dependency on the client jar using the following example maven co-ordinates (you can change the version numbers with new releases):
 <pre>
@@ -36,7 +36,7 @@ For those interested in the legacy Scala producer api, information can be found
 here</a>.
 </p>
 
-<h3><a id="highlevelconsumerapi">2.2 High Level Consumer API</a></h3>
+<h3><a id="highlevelconsumerapi" href="#highlevelconsumerapi">2.2 High Level Consumer API</a></h3>
 <pre>
 class Consumer {
   /**
@@ -108,7 +108,7 @@ public interface kafka.javaapi.consumer.ConsumerConnector {
 </pre>
 You can follow
 <a href="https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example" title="Kafka 0.8 consumer example">this example</a> to learn how to use the high level consumer api.
-<h3><a id="simpleconsumerapi">2.3 Simple Consumer API</a></h3>
+<h3><a id="simpleconsumerapi" href="#simpleconsumerapi">2.3 Simple Consumer API</a></h3>
 <pre>
 class kafka.javaapi.consumer.SimpleConsumer {
   /**
@@ -144,7 +144,7 @@ class kafka.javaapi.consumer.SimpleConsumer {
 For most applications, the high level consumer Api is good enough. Some applications want features not exposed to the high level consumer yet (e.g., set initial offset when restarting the consumer). They can instead use our low level SimpleConsumer Api. The logic will be a bit more complicated and you can follow the example in
 <a href="https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example" title="Kafka 0.8 SimpleConsumer example">here</a>.
 
-<h3><a id="newconsumerapi">2.4 New Consumer API</a></h3>
+<h3><a id="newconsumerapi" href="#newconsumerapi">2.4 New Consumer API</a></h3>
 As of the 0.9.0 release we have added a replacement for our existing simple and high-level consumers. This client is considered beta quality. You can use this client by adding a dependency on the client jar using the following example maven co-ordinates (you can change the version numbers with new releases):
 <pre>
 	&lt;dependency&gt;

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/configuration.html
----------------------------------------------------------------------
diff --git a/docs/configuration.html b/docs/configuration.html
index abaff63..2dfc757 100644
--- a/docs/configuration.html
+++ b/docs/configuration.html
@@ -17,7 +17,7 @@
 
 Kafka uses key-value pairs in the <a href="http://en.wikipedia.org/wiki/.properties">property file format</a> for configuration. These values can be supplied either from a file or programmatically.
 
-<h3><a id="brokerconfigs">3.1 Broker Configs</a></h3>
+<h3><a id="brokerconfigs" href="#brokerconfigs">3.1 Broker Configs</a></h3>
 
 The essential configurations are the following:
 <ul>
@@ -32,7 +32,7 @@ Topic-level configurations and defaults are discussed in more detail <a href="#t
 
 <p>More details about broker configuration can be found in the scala class <code>kafka.server.KafkaConfig</code>.</p>
 
-<a id="topic-config">Topic-level configuration</a>
+<a id="topic-config" href="#topic-config">Topic-level configuration</a>
 
 Configurations pertinent to topics have both a global default as well an optional per-topic override. If no per-topic configuration is given the global default is used. The override can be set at topic creation time by giving one or more <code>--config</code> options. This example creates a topic named <i>my-topic</i> with a custom max message size and flush rate:
 <pre>
@@ -147,7 +147,7 @@ The following are the topic-level configurations. The server's default configura
     </tr>
 </table>
 
-<h3><a id="producerconfigs">3.2 Producer Configs</a></h3>
+<h3><a id="producerconfigs" href="#producerconfigs">3.2 Producer Configs</a></h3>
 
 Below is the configuration of the Java producer:
 <!--#include virtual="producer_config.html" -->
@@ -157,7 +157,7 @@ Below is the configuration of the Java producer:
     here</a>.
 </p>
 
-<h3><a id="consumerconfigs">3.3 Consumer Configs</a></h3>
+<h3><a id="consumerconfigs" href="#consumerconfigs">3.3 Consumer Configs</a></h3>
 The essential consumer configurations are the following:
 <ul>
         <li><code>group.id</code>
@@ -327,9 +327,9 @@ The essential consumer configurations are the following:
 
 <p>More details about consumer configuration can be found in the scala class <code>kafka.consumer.ConsumerConfig</code>.</p>
 
-<h3><a id="newconsumerconfigs">3.4 New Consumer Configs</a></h3>
+<h3><a id="newconsumerconfigs" href="#newconsumerconfigs">3.4 New Consumer Configs</a></h3>
 Since 0.9.0.0 we have been working on a replacement for our existing simple and high-level consumers. The code can be considered beta quality. Below is the configuration for the new consumer:
 <!--#include virtual="consumer_config.html" -->
 
-<h3><a id="connectconfigs">3.5 Kafka Connect Configs</a></h3>
-<!--#include virtual="connect_config.html" -->
\ No newline at end of file
+<h3><a id="connectconfigs" href="#connectconfigs">3.5 Kafka Connect Configs</a></h3>
+<!--#include virtual="connect_config.html" -->

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/connect.html
----------------------------------------------------------------------
diff --git a/docs/connect.html b/docs/connect.html
index 8791ab0..0a1a867 100644
--- a/docs/connect.html
+++ b/docs/connect.html
@@ -15,7 +15,7 @@
   ~ limitations under the License.
   ~-->
 
-<h3><a id="connect_overview">8.1 Overview</a></h3>
+<h3><a id="connect_overview" href="#connect_overview">8.1 Overview</a></h3>
 
 Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define <i>connectors</i> that move large collections of data into and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export job can deliver data from Kafka topics into secondary storage and query systems or into batch systems for offline analysis.
 
@@ -29,11 +29,11 @@ Kafka Connect features include:
     <li><b>Streaming/batch integration</b> - leveraging Kafka's existing capabilities, Kafka Connect is an ideal solution for bridging streaming and batch data systems</li>
 </ul>
 
-<h3><a id="connect_user">8.2 User Guide</a></h3>
+<h3><a id="connect_user" href="#connect_user">8.2 User Guide</a></h3>
 
 The quickstart provides a brief example of how to run a standalone version of Kafka Connect. This section describes how to configure, run, and manage Kafka Connect in more detail.
 
-<h4>Running Kafka Connect</h4>
+<h4><a id="connect_running" href="#connect_running">Running Kafka Connect</a></h4>
 
 Kafka Connect currently supports two modes of execution: standalone (single process) and distributed.
 
@@ -64,7 +64,7 @@ The difference is in the class which is started and the configuration parameters
 Note that in distributed mode the connector configurations are not passed on the command line. Instead, use the REST API described below to create, modify, and destroy connectors.
 
 
-<h4>Configuring Connectors</h4>
+<h4><a id="connect_configuring" href="#connect_configuring">Configuring Connectors</a></h4>
 
 Connector configurations are simple key-value mappings. For standalone mode these are defined in a properties file and passed to the Connect process on the command line. In distributed mode, they will be included in the JSON payload for the request that creates (or modifies) the connector.
 
@@ -84,7 +84,7 @@ Sink connectors also have one additional option to control their input:
 For any other options, you should consult the documentation for the connector.
 
 
-<h4>REST API</h4>
+<h4><a id="connect_rest" href="#connect_rest">REST API</a></h4>
 
 Since Kafka Connect is intended to be run as a service, it also supports a REST API for managing connectors. By default this service runs on port 8083. The following are the currently supported endpoints:
 
@@ -98,13 +98,13 @@ Since Kafka Connect is intended to be run as a service, it also supports a REST
     <li><code>DELETE /connectors/{name}</code> - delete a connector, halting all tasks and deleting its configuration</li>
 </ul>
 
-<h3><a id="connect_development">8.3 Connector Development Guide</a></h3>
+<h3><a id="connect_development" href="#connect_development">8.3 Connector Development Guide</a></h3>
 
 This guide describes how developers can write new connectors for Kafka Connect to move data between Kafka and other systems. It briefly reviews a few key concepts and then describes how to create a simple connector.
 
-<h4>Core Concepts and APIs</h4>
+<h4><a id="connect_concepts" href="#connect_concepts">Core Concepts and APIs</a></h4>
 
-<h5>Connectors and Tasks</h5>
+<h5><a id="connect_connectorsandtasks" href="#connect_connectorsandtasks">Connectors and Tasks</a></h5>
 
 To copy data between Kafka and another system, users create a <code>Connector</code> for the system they want to pull data from or push data to. Connectors come in two flavors: <code>SourceConnectors</code> import data from another system (e.g. <code>JDBCSourceConnector</code> would import a relational database into Kafka) and <code>SinkConnectors</code> export data (e.g. <code>HDFSSinkConnector</code> would export the contents of a Kafka topic to an HDFS file).
 
@@ -113,24 +113,24 @@ To copy data between Kafka and another system, users create a <code>Connector</c
 With an assignment in hand, each <code>Task</code> must copy its subset of the data to or from Kafka. In Kafka Connect, it should always be possible to frame these assignments as a set of input and output streams consisting of records with consistent schemas. Sometimes this mapping is obvious: each file in a set of log files can be considered a stream with each parsed line forming a record using the same schema and offsets stored as byte offsets in the file. In other cases it may require more effort to map to this model: a JDBC connector can map each table to a stream, but the offset is less clear. One possible mapping uses a timestamp column to generate queries incrementally returning new data, and the last queried timestamp can be used as the offset.
 
 
-<h5>Streams and Records</h5>
+<h5><a id="connect_streamsandrecords" href="#connect_streamsandrecords">Streams and Records</a></h5>
 
 Each stream should be a sequence of key-value records. Both the keys and values can have complex structure -- many primitive types are provided, but arrays, objects, and nested data structures can be represented as well. The runtime data format does not assume any particular serialization format; this conversion is handled internally by the framework.
 
 In addition to the key and value, records (both those generated by sources and those delivered to sinks) have associated stream IDs and offsets. These are used by the framework to periodically commit the offsets of data that have been processed so that in the event of failures, processing can resume from the last committed offsets, avoiding unnecessary reprocessing and duplication of events.
 
-<h5>Dynamic Connectors</h5>
+<h5><a id="connect_dynamicconnectors" href="#connect_dynamicconnectors">Dynamic Connectors</a></h5>
 
 Not all jobs are static, so <code>Connector</code> implementations are also responsible for monitoring the external system for any changes that might require reconfiguration. For example, in the <code>JDBCSourceConnector</code> example, the <code>Connector</code> might assign a set of tables to each <code>Task</code>. When a new table is created, it must discover this so it can assign the new table to one of the <code>Tasks</code> by updating its configuration. When it notices a change that requires reconfiguration (or a change in the number of <code>Tasks</code>), it notifies the framework and the framework updates anycorresponding <code>Tasks</code>.
 
 
-<h4>Developing a Simple Connector</h4>
+<h4><a id="connect_developing" href="#connect_developing">Developing a Simple Connector</a></h4>
 
 Developing a connector only requires implementing two interfaces, the <code>Connector</code> and <code>Task</code>. A simple example is included with the source code for Kafka in the <code>file</code> package. This connector is meant for use in standalone mode and has implementations of a <code>SourceConnector</code>/<code>SourceTask</code> to read each line of a file and emit it as a record and a <code>SinkConnector</code>/<code>SinkTask</code> that writes each record to a file.
 
 The rest of this section will walk through some code to demonstrate the key steps in creating a connector, but developers should also refer to the full example source code as many details are omitted for brevity.
 
-<h5>Connector Example</h5>
+<h5><a id="connect_connectorexample" href="#connect_connectorexample">Connector Example</a></h5>
 
 We'll cover the <code>SourceConnector</code> as a simple example. <code>SinkConnector</code> implementations are very similar. Start by creating the class that inherits from <code>SourceConnector</code> and add a couple of fields that will store parsed configuration information (the filename to read from and the topic to send data to):
 
@@ -187,7 +187,7 @@ Even with multiple tasks, this method implementation is usually pretty simple. I
 
 Note that this simple example does not include dynamic input. See the discussion in the next section for how to trigger updates to task configs.
 
-<h5>Task Example - Source Task</h5>
+<h5><a id="connect_taskexample" href="#connect_taskexample">Task Example - Source Task</a></h5>
 
 Next we'll describe the implementation of the corresponding <code>SourceTask</code>. The implementation is short, but too long to cover completely in this guide. We'll use pseudo-code to describe most of the implementation, but you can refer to the source code for the full example.
 
@@ -244,7 +244,7 @@ Again, we've omitted some details, but we can see the important steps: the <code
 
 Note that this implementation uses the normal Java <code>InputStream</code>interface and may sleep if data is not avaiable. This is acceptable because Kafka Connect provides each task with a dedicated thread. While task implementations have to conform to the basic <code>poll()</code>interface, they have a lot of flexibility in how they are implemented. In this case, an NIO-based implementation would be more efficient, but this simple approach works, is quick to implement, and is compatible with older versions of Java.
 
-<h5>Sink Tasks</h5>
+<h5><a id="connect_sinktasks" href="#connect_sinktasks">Sink Tasks</a></h5>
 
 The previous section described how to implement a simple <code>SourceTask</code>. Unlike <code>SourceConnector</code>and <code>SinkConnector</code>, <code>SourceTask</code>and <code>SinkTask</code>have very different interfaces because <code>SourceTask</code>uses a pull interface and <code>SinkTask</code>uses a push interface. Both share the common lifecycle methods, but the <code>SinkTask</code>interface is quite different:
 
@@ -263,7 +263,7 @@ The <code>flush()</code>method is used during the offset commit process, which a
 delivery. For example, an HDFS connector could do this and use atomic move operations to make sure the <code>flush()</code>operation atomically commits the data and offsets to a final location in HDFS.
 
 
-<h5>Resuming from Previous Offsets</h5>
+<h5><a id="connect_resuming" href="#connect_resuming">Resuming from Previous Offsets</a></h5>
 
 The <code>SourceTask</code>implementation included a stream ID (the input filename) and offset (position in the file) with each record. The framework uses this to commit offsets periodically so that in the case of a failure, the task can recover and minimize the number of events that are reprocessed and possibly duplicated (or to resume from the most recent offset if Kafka Connect was stopped gracefully, e.g. in standalone mode or due to a job reconfiguration). This commit process is completely automated by the framework, but only the connector knows how to seek back to the right position in the input stream to resume from that location.
 
@@ -281,7 +281,7 @@ To correctly resume upon startup, the task can use the <code>SourceContext</code
 
 Of course, you might need to read many keys for each of the input streams. The <code>OffsetStorageReader</code> interface also allows you to issue bulk reads to efficiently load all offsets, then apply them by seeking each input stream to the appropriate position.
 
-<h4>Dynamic Input/Output Streams</h4>
+<h4><a id="connect_dynamicio" href="#connect_dynamicio">Dynamic Input/Output Streams</a></h4>
 
 Kafka Connect is intended to define bulk data copying jobs, such as copying an entire database rather than creating many jobs to copy each table individually. One consequence of this design is that the set of input or output streams for a connector can vary over time.
 
@@ -299,7 +299,7 @@ Ideally this code for monitoring changes would be isolated to the <code>Connecto
 
 <code>SinkConnectors</code> usually only have to handle the addition of streams, which may translate to new entries in their outputs (e.g., a new database table). The framework manages any changes to the Kafka input, such as when the set of input topics changes because of a regex subscription. <code>SinkTasks</code>should expect new input streams, which may require creating new resources in the downstream system, such as a new table in a database. The trickiest situation to handle in these cases may be conflicts between multiple <code>SinkTasks</code>seeing a new input stream for the first time and simultaneoulsy trying to create the new resource. <code>SinkConnectors</code>, on the other hand, will generally require no special code for handling a dynamic set of streams.
 
-<h4>Working with Schemas</h4>
+<h4><a id="connect_schemas" href="#connect_schemas">Working with Schemas</a></h4>
 
 The FileStream connectors are good examples because they are simple, but they also have trivially structured data -- each line is just a string. Almost all practical connectors will need schemas with more complex data formats.
 

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/design.html
----------------------------------------------------------------------
diff --git a/docs/design.html b/docs/design.html
index 347f602..5d3090c 100644
--- a/docs/design.html
+++ b/docs/design.html
@@ -5,9 +5,9 @@
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
- 
+
     http://www.apache.org/licenses/LICENSE-2.0
- 
+
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -15,7 +15,7 @@
  limitations under the License.
 -->
 
-<h3><a id="majordesignelements">4.1 Motivation</a></h3>
+<h3><a id="majordesignelements" href="#majordesignelements">4.1 Motivation</a></h3>
 <p>
 We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds <a href="#introduction">a large company might have</a>. To do this we had to think through a fairly broad set of use cases.
 <p>
@@ -31,8 +31,8 @@ Finally in cases where the stream is fed into other data systems for serving we
 <p>
 Supporting these uses led use to a design with a number of unique elements, more akin to a database log then a traditional messaging system. We will outline some elements of the design in the following sections.
 
-<h3><a id="persistence">4.2 Persistence</a></h3>
-<h4>Don't fear the filesystem!</h4>
+<h3><a id="persistence" href="#persistence">4.2 Persistence</a></h3>
+<h4><a id="design_filesystem" href="#design_filesystem">Don't fear the filesystem!</a></h4>
 <p>
 Kafka relies heavily on the filesystem for storing and caching messages. There is a general perception that "disks are slow" which makes people skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than people expect depending on how they are used; and a properly designed disk structure can often be as fast as the network.
 <p>
@@ -52,7 +52,7 @@ This suggests a design which is very simple: rather than maintain as much as pos
 <p>
 This style of pagecache-centric design is described in an <a href="http://varnish.projects.linpro.no/wiki/ArchitectNotes">article</a> on the design of Varnish here (along with a healthy dose of arrogance).
 
-<h4>Constant Time Suffices</h4>
+<h4><a id="design_constanttime" href="#design_constanttime">Constant Time Suffices</a></h4>
 <p>
 The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other general-purpose random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and make it possible to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly high cost, though: Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk operations. Disk seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk seeks leads to very high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed performance of tree structures is often superlinear as data increases with fixed cache--i.e. doubling your data makes things much worse then twice a
 s slow.
 <p>
@@ -60,7 +60,7 @@ Intuitively a persistent queue could be built on simple reads and appends to fil
 <p>
 Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usually found in a messaging system. For example, in Kafka, instead of attempting to deleting messages as soon as they are consumed, we can retain messages for a relative long period (say a week). This leads to a great deal of flexibility for consumers, as we will describe.
 
-<h3><a id="maximizingefficiency">4.3 Efficiency</a></h3>
+<h3><a id="maximizingefficiency" href="#maximizingefficiency">4.3 Efficiency</a></h3>
 <p>
 We have put significant effort into efficiency. One of our primary use cases is handling web activity data, which is very high volume: each page view may generate dozens of writes. Furthermore we assume each message published is read by at least one consumer (often many), hence we strive to make consumption as cheap as possible.
 <p>
@@ -74,7 +74,7 @@ To avoid this, our protocol is built around a "message set" abstraction that nat
 <p>
 This simple optimization produces orders of magnitude speed up. Batching leads to larger network packets, larger sequential disk operations, contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes that flow to the consumers.
 <p>
-The other inefficiency is in byte copying. At low message rates this is not an issue, but under load the impact is significant. To avoid this we employ a standardized binary message format that is shared by the producer, the broker, and the consumer (so data chunks can be transferred without modification between them). 
+The other inefficiency is in byte copying. At low message rates this is not an issue, but under load the impact is significant. To avoid this we employ a standardized binary message format that is shared by the producer, the broker, and the consumer (so data chunks can be transferred without modification between them).
 <p>
 The message log maintained by the broker is itself just a directory of files, each populated by a sequence of message sets that have been written to disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most important operation: network transfer of persistent log chunks. Modern unix operating systems offer a highly optimized code path for transferring data out of pagecache to a socket; in Linux this is done with the <a href="http://man7.org/linux/man-pages/man2/sendfile.2.html">sendfile system call</a>.
 <p>
@@ -94,7 +94,7 @@ This combination of pagecache and sendfile means that on a Kafka cluster where t
 <p>
 For more background on the sendfile and zero-copy support in Java, see this <a href="http://www.ibm.com/developerworks/linux/library/j-zerocopy">article</a>.
 
-<h4>End-to-end Batch Compression</h4>
+<h4><a id="design_compression" href="#design_compression">End-to-end Batch Compression</a></h4>
 <p>
 In some cases the bottleneck is actually not CPU or disk but network bandwidth. This is particularly true for a data pipeline that needs to send messages between data centers over a wide-area network. Of course the user can always compress its messages one at a time without any support needed from Kafka, but this can lead to very poor compression ratios as much of the redundancy is due to repetition between messages of the same type (e.g. field names in JSON or user agents in web logs or common string values). Efficient compression requires compressing multiple messages together rather than compressing each message individually.
 <p>
@@ -102,25 +102,25 @@ Kafka supports this by allowing recursive message sets. A batch of messages can
 <p>
 Kafka supports GZIP and Snappy compression protocols. More details on compression can be found <a href="https://cwiki.apache.org/confluence/display/KAFKA/Compression">here</a>.
 
-<h3><a id="theproducer">4.4 The Producer</a></h3>
+<h3><a id="theproducer" href="#theproducer">4.4 The Producer</a></h3>
 
-<h4>Load balancing</h4>
+<h4><a id="design_loadbalancing" href="#design_loadbalancing">Load balancing</a></h4>
 <p>
 The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. To help the producer do this all Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are at any given time to allow the producer to appropriate direct its requests.
 <p>
 The client controls which partition it publishes messages to. This can be done at random, implementing a kind of random load balancing, or it can be done by some semantic partitioning function. We expose the interface for semantic partitioning by allowing the user to specify a key to partition by and using this to hash to a partition (there is also an option to override the partition function if need be). For example if the key chosen was a user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumptions about their consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers.
 
-<h4>Asynchronous send</h4>
+<h4><a id="design_asyncsend" href="#design_asyncsend">Asynchronous send</a></h4>
 <p>
 Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.
 <p>
 Details on <a href="#newproducerconfigs">configuration</a> and <a href="http://kafka.apache.org/082/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html">api</a> for the producer can be found elsewhere in the documentation.
 
-<h3><a id="theconsumer">4.5 The Consumer</a></h3>
+<h3><a id="theconsumer" href="#theconsumer">4.5 The Consumer</a></h3>
 
 The Kafka consumer works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. The consumer specifies its offset in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has significant control over this position and can rewind it to re-consume data if need be.
 
-<h4>Push vs. pull</h4>
+<h4><a id="design_pull" href="#design_pull">Push vs. pull</a></h4>
 <p>
 An initial question we considered is whether consumers should pull data from brokers or brokers should push data to the consumer. In this respect Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer and pulled from the broker by the consumer. Some logging-centric systems, such as <a href="http://github.com/facebook/scribe">Scribe</a> and <a href="http://flume.apache.org/">Apache Flume</a> follow a very different push based path where  data is pushed downstream. There are pros and cons to both approaches. However a push-based system has difficulty dealing with diverse consumers as the broker controls the rate at which data is transferred. The goal is generally for the consumer to be able to consume at the maximum possible rate; unfortunately in a push system this means the consumer tends to be overwhelmed when its rate of consumption falls below the rate of production (a denial of service attack, in essence). 
 A pull-based system has the nicer property that the consumer simply falls behind and catches up when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but getting the rate of transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems. Previous attempts at building systems in this fashion led us to go with a more traditional pull model.
 <p>
@@ -130,7 +130,7 @@ The deficiency of a naive pull-based system is that if the broker has no data th
 <p>
 You could imagine other possible designs which would be only pull, end-to-end. The producer would locally write to a local log, and brokers would pull from that with consumers pulling from them. A similar type of "store-and-forward" producer is often proposed. This is intriguing but we felt not very suitable for our target use cases which have thousands of producers. Our experience running persistent data systems at scale led us to feel that involving thousands of disks in the system across many applications would not actually make things more reliable and would be a nightmare to operate. And in practice we have found that we can run a pipeline with strong SLAs at large scale without a need for producer persistence.
 
-<h4>Consumer Position</h4>
+<h4><a id="design_consumerposition" href="#design_consumerposition">Consumer Position</a></h4>
 Keeping track of <i>what</i> has been consumed, is, surprisingly, one of the key performance points of a messaging system.
 <p>
 Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker either records that fact locally immediately or it may wait for acknowledgement from the consumer. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else this state could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately delete it, keeping the data size small.
@@ -141,13 +141,13 @@ Kafka handles this differently. Our topic is divided into a set of totally order
 <p>
 There is a side benefit of this decision. A consumer can deliberately <i>rewind</i> back to an old offset and re-consume data. This violates the common contract of a queue, but turns out to be an essential feature for many consumers. For example, if the consumer code has a bug and is discovered after some messages are consumed, the consumer can re-consume those messages once the bug is fixed.
 
-<h4>Offline Data Load</h4>
+<h4><a id="design_offlineload" href="#design_offlineload">Offline Data Load</a></h4>
 
 Scalable persistence allows for the possibility of consumers that only periodically consume such as batch data loads that periodically bulk-load data into an offline system such as Hadoop or a relational data warehouse.
 <p>
 In the case of Hadoop we parallelize the data load by splitting the load over individual map tasks, one for each node/topic/partition combination, allowing full parallelism in the loading. Hadoop provides the task management, and tasks which fail can restart without danger of duplicate data&mdash;they simply restart from their original position.
 
-<h3><a id="semantics">4.6 Message Delivery Semantics</a></h3>
+<h3><a id="semantics" href="#semantics">4.6 Message Delivery Semantics</a></h3>
 <p>
 Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
 <ul>
@@ -160,7 +160,7 @@ Now that we understand a little about how producers and consumers work, let's di
   <li>
     <i>Exactly once</i>&mdash;this is what people actually want, each message is delivered once and only once.
   </li>
-</ul>	
+</ul>
 
 It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.
 <p>
@@ -181,7 +181,7 @@ Now let's describe the semantics from the point-of-view of the consumer. All rep
 <p>
 So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
 
-<h3><a id="replication">4.7 Replication</a></h3>
+<h3><a id="replication" href="#replication">4.7 Replication</a></h3>
 <p>
 Kafka replicates the log for each topic's partitions across a configurable number of servers (you can set this replication factor on a topic-by-topic basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures.
 <p>
@@ -206,7 +206,7 @@ The guarantee that Kafka offers is that a committed message will not be lost, as
 <p>
 Kafka will remain available in the presence of node failures after a short fail-over period, but may not remain available in the presence of network partitions.
 
-<h4>Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)</h4>
+<h4><a id="design_replicatedlog" href="#design_replicatedlog">Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)</a></h4>
 
 At its heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, and there are many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distributed systems in the <a href="http://en.wikipedia.org/wiki/State_machine_replication">state-machine style</a>.
 <p>
@@ -230,7 +230,7 @@ For most use cases we hope to handle, we think this tradeoff is a reasonable one
 <p>
 Another important design distinction is that Kafka does not require that crashed nodes recover with all their data intact. It is not uncommon for replication algorithms in this space to depend on the existence of "stable storage" that cannot be lost in any failure-recovery scenario without potential consistency violations. There are two primary problems with this assumption. First, disk errors are the most common problem we observe in real operation of persistent data systems and they often do not leave data intact. Secondly, even if this were not a problem, we do not want to require the use of fsync on every write for our consistency guarantees as this can reduce performance by two to three orders of magnitude. Our protocol for allowing a replica to rejoin the ISR ensures that before rejoining, it must fully re-sync again even if it lost unflushed data in its crash.
 
-<h4>Unclean leader election: What if they all die?</h4>
+<h4><a id="design_uncleanleader" href="#design_uncleanleader">Unclean leader election: What if they all die?</a></h4>
 
 Note that Kafka's guarantee with respect to data loss is predicated on at least on replica remaining in sync. If all the nodes replicating a partition die, this guarantee no longer holds.
 <p>
@@ -245,10 +245,10 @@ This is a simple tradeoff between availability and consistency. If we wait for r
 This dilemma is not specific to Kafka. It exists in any quorum-based scheme. For example in a majority voting scheme, if a majority of servers suffer a permanent failure, then you must either choose to lose 100% of your data or violate consistency by taking what remains on an existing server as your new source of truth.
 
 
-<h4>Availability and Durability Guarantees</h4>
+<h4><a id="design_ha" href="#design_ha">Availability and Durability Guarantees</a></h4>
 
 When writing to Kafka, producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas.
-Note that "acknowledgement by all replicas" does not guarantee that the full set of assigned replicas have received the message. By default, when request.required.acks=-1, acknowledgement happens as soon as all the current in-sync replicas have received the message. For example, if a topic is configured with only two replicas and one fails (i.e., only one in sync replica remains), then writes that specify request.required.acks=-1 will succeed. However, these writes could be lost if the remaining replica also fails. 
+Note that "acknowledgement by all replicas" does not guarantee that the full set of assigned replicas have received the message. By default, when request.required.acks=-1, acknowledgement happens as soon as all the current in-sync replicas have received the message. For example, if a topic is configured with only two replicas and one fails (i.e., only one in sync replica remains), then writes that specify request.required.acks=-1 will succeed. However, these writes could be lost if the remaining replica also fails.
 
 Although this ensures maximum availability of the partition, this behavior may be undesirable to some users who prefer durability over availability. Therefore, we provide two topic-level configurations that can be used to prefer message durability over availability:
 <ol>
@@ -258,13 +258,13 @@ This setting offers a trade-off between consistency and availability. A higher s
 </ol>
 
 
-<h4>Replica Management</h4>
+<h4><a id="design_replicamanagment" href="#design_replicamanagment">Replica Management</a></h4>
 
 The above discussion on replicated logs really covers only a single log, i.e. one topic partition. However a Kafka cluster will manage hundreds or thousands of these partitions. We attempt to balance partitions within a cluster in a round-robin fashion to avoid clustering all partitions for high-volume topics on a small number of nodes. Likewise we try to balance leadership so that each node is the leader for a proportional share of its partitions.
 <p>
 It is also important to optimize the leadership election process as that is the critical window of unavailability. A naive implementation of leader election would end up running an election per partition for all partitions a node hosted when that node failed. Instead, we elect one of the brokers as the "controller". This controller detects failures at the broker level and is responsible for changing the leader of all affected partitions in a failed broker. The result is that we are able to batch together many of the required leadership change notifications which makes the election process far cheaper and faster for a large number of partitions. If the controller fails, one of the surviving brokers will become the new controller.
 
-<h3><a id="compaction">4.8 Log Compaction</a></h3>
+<h3><a id="compaction" href="#compaction">4.8 Log Compaction</a></h3>
 
 Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition.  It addresses use cases and scenarios such as restoring state after application crashes or system failure, or reloading caches after application restarts during operational maintenance. Let's dive into these use cases in more detail and then describe how compaction works.
 <p>
@@ -299,10 +299,10 @@ The general idea is quite simple. If we had infinite log retention, and we logge
 Log compaction is a mechanism to give finer-grained per-record retention, rather than the coarser-grained time-based retention. The idea is to selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have at least the last state for each key.
 <p>
 This retention policy can be set per-topic, so a single cluster can have some topics where retention is enforced by size or time and other topics where retention is enforced by compaction.
-<p> 
+<p>
 This functionality is inspired by one of LinkedIn's oldest and most successful pieces of infrastructure&mdash;a database changelog caching service called <a href="https://github.com/linkedin/databus">Databus</a>. Unlike most log-structured storage systems Kafka is built for subscription and organizes data for fast linear reads and writes. Unlike Databus, Kafka acts a source-of-truth store so it is useful even in situations where the upstream data source would not otherwise be replayable.
 
-<h4>Log Compaction Basics</h4>
+<h4><a id="design_compactionbasics" href="#design_compactionbasics">Log Compaction Basics</a></h4>
 
 Here is a high-level picture that shows the logical structure of a Kafka log with the offset for each message.
 <p>
@@ -316,7 +316,7 @@ The compaction is done in the background by periodically recopying log segments.
 <p>
 <img src="images/log_compaction.png">
 <p>
-<h4>What guarantees does log compaction provide?</h4>
+<h4><a id="design_compactionguarantees" href="#design_compactionguarantees">What guarantees does log compaction provide?</a></h4>
 
 Log compaction guarantees the following:
 <ol>
@@ -327,7 +327,7 @@ Log compaction guarantees the following:
 <li>Any consumer progressing from the start of the log, will see at least the <em>final</em> state of all records in the order they were written.  All delete markers for deleted records will be seen provided the consumer reaches the head of the log in a time period less than the topic's <code>delete.retention.ms</code> setting (the default is 24 hours).  This is important as delete marker removal happens concurrently with read, and thus it is important that we do not remove any delete marker prior to the consumer seeing it.
 </ol>
 
-<h4>Log Compaction Details</h4>
+<h4><a id="design_compactiondetails" href="#design_compactiondetails">Log Compaction Details</a></h4>
 
 Log compaction is handled by the log cleaner, a pool of background threads that recopy log segment files, removing records whose key appears in the head of the log. Each compactor thread works as follows:
 <ol>
@@ -337,7 +337,7 @@ Log compaction is handled by the log cleaner, a pool of background threads that
 <li>The summary of the log head is essentially just a space-compact hash table. It uses exactly 24 bytes per entry. As a result with 8GB of cleaner buffer one cleaner iteration can clean around 366GB of log head (assuming 1k messages).
 </ol>
 <p>
-<h4>Configuring The Log Cleaner</h4>
+<h4><a id="design_compactionconfig" href="#design_compactionconfig">Configuring The Log Cleaner</a></h4>
 
 The log cleaner is disabled by default. To enable it set the server config
   <pre>  log.cleaner.enable=true</pre>
@@ -347,21 +347,21 @@ This can be done either at topic creation time or using the alter topic command.
 <p>
 Further cleaner configurations are described <a href="/documentation.html#brokerconfigs">here</a>.
 
-<h4>Log Compaction Limitations</h4>
+<h4><a id="design_compactionlimitations" href="#design_compactionlimitations">Log Compaction Limitations</a></h4>
 
 <ol>
   <li>You cannot configure yet how much log is retained without compaction (the "head" of the log).  Currently all segments are eligible except for the last segment, i.e. the one currently being written to.</li>
   <li>Log compaction is not yet compatible with compressed topics.</li>
 </ol>
-<h3><a id="semantics">4.9 Quotas</a></h3>
+<h3><a id="design_quotas" href="#design_quotas">4.9 Quotas</a></h3>
 <p>
     Starting in 0.9, the Kafka cluster has the ability to enforce quotas on produce and fetch requests. Quotas are basically byte-rate thresholds defined per client-id. A client-id logically identifies an application making a request. Hence a single client-id can span multiple producer and consumer instances and the quota will apply for all of them as a single entity i.e. if client-id="test-client" has a produce quota of 10MB/sec, this is shared across all instances with that same id.
 
-<h4>Why are quotas necessary?</h4>
+<h4><a id="design_quotasnecessary" href="#design_quotasnecessary">Why are quotas necessary?</a></h4>
 <p>
 It is possible for producers and consumers to produce/consume very high volumes of data and thus monopolize broker resources, cause network saturation and generally DOS other clients and the brokers themselves. Having quotas protects against these issues and is all tbe more important in large multi-tenant clusters where a small set of badly behaved clients can degrade user experience for the well behaved ones. In fact, when running Kafka as a service this even makes it possible to enforce API limits according to an agreed upon contract.
 </p>
-<h4>Enforcement</h4>
+<h4><a id="design_quotasenforcement" href="#design_quotasenforcement">Enforcement</a></h4>
 <p>
     By default, each unique client-id receives a fixed quota in bytes/sec as configured by the cluster (quota.producer.default, quota.consumer.default).
     This quota is defined on a per-broker basis. Each client can publish/fetch a maximum of X bytes/sec per broker before it gets throttled. We decided that defining these quotas per broker is much better than having a fixed cluster wide bandwidth per client because that would require a mechanism to share client quota usage among all the brokers. This can be harder to get right than the quota implementation itself!
@@ -372,9 +372,9 @@ It is possible for producers and consumers to produce/consume very high volumes
 <p>
 Client byte rate is measured over multiple small windows (for e.g. 30 windows of 1 second each) in order to detect and correct quota violations quickly. Typically, having large measurement windows (for e.g. 10 windows of 30 seconds each) leads to large bursts of traffic followed by long delays which is not great in terms of user experience.
 </p>
-<h4>Quota overrides</h4>
+<h4><a id="design_quotasoverrides" href="#design_quotasoverrides">Quota overrides</a></h4>
 <p>
     It is possible to override the default quota for client-ids that need a higher (or even lower) quota. The mechanism is similar to the per-topic log config overrides.
     Client-id overrides are written to ZooKeeper under <i><b>/config/clients</b></i>. These overrides are read by all brokers and are effective immediately. This lets us change quotas without having to do a rolling restart of the entire cluster. See <a href="/ops.html#quotas">here</a> for details.
 
-</p>
\ No newline at end of file
+</p>

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/documentation.html
----------------------------------------------------------------------
diff --git a/docs/documentation.html b/docs/documentation.html
index eddc0c6..53c801e 100644
--- a/docs/documentation.html
+++ b/docs/documentation.html
@@ -131,37 +131,37 @@ Prior releases: <a href="/07/documentation.html">0.7.x</a>, <a href="/08/documen
     </li>
 </ul>
 
-<h2><a id="gettingStarted">1. Getting Started</a></h2>
+<h2><a id="gettingStarted" href="#gettingStarted">1. Getting Started</a></h2>
 <!--#include virtual="introduction.html" -->
 <!--#include virtual="uses.html" -->
 <!--#include virtual="quickstart.html" -->
 <!--#include virtual="ecosystem.html" -->
 <!--#include virtual="upgrade.html" -->
 
-<h2><a id="api">2. API</a></h2>
+<h2><a id="api" href="#api">2. API</a></h2>
 
 <!--#include virtual="api.html" -->
 
-<h2><a id="configuration">3. Configuration</a></h2>
+<h2><a id="configuration" href="#configuration">3. Configuration</a></h2>
 
 <!--#include virtual="configuration.html" -->
 
-<h2><a id="design">4. Design</a></h2>
+<h2><a id="design" href="#design">4. Design</a></h2>
 
 <!--#include virtual="design.html" -->
 
-<h2><a id="implementation">5. Implementation</a></h2>
+<h2><a id="implementation" href="#implementation">5. Implementation</a></h2>
 
 <!--#include virtual="implementation.html" -->
 
-<h2><a id="operations">6. Operations</a></h2>
+<h2><a id="operations" href="#operations">6. Operations</a></h2>
 
 <!--#include virtual="ops.html" -->
 
-<h2><a id="security">7. Security</a></h2>
+<h2><a id="security" href="#security">7. Security</a></h2>
 <!--#include virtual="security.html" -->
 
-<h2><a id="connect">8. Kafka Connect</a></h2>
+<h2><a id="connect" href="#connect">8. Kafka Connect</a></h2>
 <!--#include virtual="connect.html" -->
 
 <!--#include virtual="../includes/footer.html" -->

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/ecosystem.html
----------------------------------------------------------------------
diff --git a/docs/ecosystem.html b/docs/ecosystem.html
index e99a446..73d5706 100644
--- a/docs/ecosystem.html
+++ b/docs/ecosystem.html
@@ -5,9 +5,9 @@
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
- 
+
     http://www.apache.org/licenses/LICENSE-2.0
- 
+
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -15,6 +15,6 @@
  limitations under the License.
 -->
 
-<h3><a id="ecosystem">1.4 Ecosystem</a></h3>
+<h3><a id="ecosystem" href="#ecosystem">1.4 Ecosystem</a></h3>
 
 There are a plethora of tools that integrate with Kafka outside the main distribution. The <a href="https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem"> ecosystem page</a> lists many of these, including stream processing systems, Hadoop integration, monitoring, and deployment tools.

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/implementation.html
----------------------------------------------------------------------
diff --git a/docs/implementation.html b/docs/implementation.html
index b95d36f..0b603d4 100644
--- a/docs/implementation.html
+++ b/docs/implementation.html
@@ -15,9 +15,9 @@
  limitations under the License.
 -->
 
-<h3><a id="apidesign">5.1 API Design</a></h3>
+<h3><a id="apidesign" href="#apidesign">5.1 API Design</a></h3>
 
-<h4>Producer APIs</h4>
+<h4><a id="impl_producer" href="#impl_producer">Producer APIs</a></h4>
 
 <p>
 The Producer API that wraps the 2 low-level producers - <code>kafka.producer.SyncProducer</code> and <code>kafka.producer.async.AsyncProducer</code>.
@@ -68,7 +68,7 @@ The partition API uses the key and the number of available broker partitions to
 </ul>
 </p>
 
-<h4>Consumer APIs</h4>
+<h4><a id="impl_consumer" href="#impl_consumer">Consumer APIs</a></h4>
 <p>
 We have 2 levels of consumer APIs. The low-level "simple" API maintains a connection to a single broker and has a close correspondence to the network requests sent to the server. This API is completely stateless, with the offset being passed in on every request, allowing the user to maintain this metadata however they choose.
 </p>
@@ -76,7 +76,7 @@ We have 2 levels of consumer APIs. The low-level "simple" API maintains a connec
 The high-level API hides the details of brokers from the consumer and allows consuming off the cluster of machines without concern for the underlying topology. It also maintains the state of what has been consumed. The high-level API also provides the ability to subscribe to topics that match a filter expression (i.e., either a whitelist or a blacklist regular expression).
 </p>
 
-<h5>Low-level API</h5>
+<h5><a id="impl_lowlevel" href="#impl_lowlevel">Low-level API</a></h5>
 <pre>
 class SimpleConsumer {
 
@@ -99,7 +99,7 @@ class SimpleConsumer {
 
 The low-level API is used to implement the high-level API as well as being used directly for some of our offline consumers which have particular requirements around maintaining state.
 
-<h5>High-level API</h5>
+<h5><a id="impl_highlevel" href="#impl_highlevel">High-level API</a></h5>
 <pre>
 
 /* create a connection to the cluster */
@@ -138,15 +138,15 @@ This API is centered around iterators, implemented by the KafkaStream class. Eac
 The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing. The createMessageStreamsByFilter call (additionally) registers watchers to discover new topics that match its filter. Note that each stream that createMessageStreamsByFilter returns may iterate over messages from multiple topics (i.e., if multiple topics are allowed by the filter).
 </p>
 
-<h3><a id="networklayer">5.2 Network Layer</a></h3>
+<h3><a id="networklayer" href="#networklayer">5.2 Network Layer</a></h3>
 <p>
 The network layer is a fairly straight-forward NIO server, and will not be described in great detail. The sendfile implementation is done by giving the <code>MessageSet</code> interface a <code>writeTo</code> method. This allows the file-backed message set to use the more efficient <code>transferTo</code> implementation instead of an in-process buffered write. The threading model is a single acceptor thread and <i>N</i> processor threads which handle a fixed number of connections each. This design has been pretty thoroughly tested <a href="http://sna-projects.com/blog/2009/08/introducing-the-nio-socketserver-implementation">elsewhere</a> and found to be simple to implement and fast. The protocol is kept quite simple to allow for future implementation of clients in other languages.
 </p>
-<h3><a id="messages">5.3 Messages</a></h3>
+<h3><a id="messages" href="#messages">5.3 Messages</a></h3>
 <p>
 Messages consist of a fixed-size header and variable length opaque byte array payload. The header contains a format version and a CRC32 checksum to detect corruption or truncation. Leaving the payload opaque is the right decision: there is a great deal of progress being made on serialization libraries right now, and any particular choice is unlikely to be right for all uses. Needless to say a particular application using Kafka would likely mandate a particular serialization type as part of its usage. The <code>MessageSet</code> interface is simply an iterator over messages with specialized methods for bulk reading and writing to an NIO <code>Channel</code>.
 
-<h3><a id="messageformat">5.4 Message Format</a></h3>
+<h3><a id="messageformat" href="#messageformat">5.4 Message Format</a></h3>
 
 <pre>
 	/**
@@ -173,7 +173,7 @@ Messages consist of a fixed-size header and variable length opaque byte array pa
 	 */
 </pre>
 </p>
-<h3><a id="log">5.5 Log</a></h3>
+<h3><a id="log" href="#log">5.5 Log</a></h3>
 <p>
 A log for a topic named "my_topic" with two partitions consists of two directories (namely <code>my_topic_0</code> and <code>my_topic_1</code>) populated with data files containing the messages for that topic. The format of the log files is a sequence of "log entries""; each log entry is a 4 byte integer <i>N</i> storing the message length which is followed by the <i>N</i> message bytes. Each message is uniquely identified by a 64-bit integer <i>offset</i> giving the byte position of the start of this message in the stream of all messages ever sent to that topic on that partition. The on-disk format of each message is given below. Each log file is named with the offset of the first message it contains. So the first file created will be 00000000000.kafka, and each additional file will have an integer name roughly <i>S</i> bytes from the previous file where <i>S</i> is the max log file size given in the configuration.
 </p>
@@ -192,11 +192,11 @@ payload        : n bytes
 The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and maintain a mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of the GUID provides no value. Furthermore the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index structure which must be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure we decided to use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; this makes the lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jump to directly using the offset seemed natural&mdash;both after all are monotonically increasing integers unique to a partition. Since the offs
 et is hidden from the consumer API this decision is ultimately an implementation detail and we went with the more efficient approach.
 </p>
 <img src="images/kafka_log.png">
-<h4>Writes</h4>
+<h4><a id="impl_writes" href="#impl_writes">Writes</a></h4>
 <p>
 The log allows serial appends which always go to the last file. This file is rolled over to a fresh file when it reaches a configurable size (say 1GB). The log takes two configuration parameter <i>M</i> which gives the number of messages to write before forcing the OS to flush the file to disk, and <i>S</i> which gives a number of seconds after which a flush is forced. This gives a durability guarantee of losing at most <i>M</i> messages or <i>S</i> seconds of data in the event of a system crash.
 </p>
-<h4>Reads</h4>
+<h4><a id="impl_reads" href="#impl_reads">Reads</a></h4>
 <p>
 Reads are done by giving the 64-bit logical offset of a message and an <i>S</i>-byte max chunk size. This will return an iterator over the messages contained in the <i>S</i>-byte buffer. <i>S</i> is intended to be larger than any single message, but in the event of an abnormally large message, the read can be retried multiple times, each time doubling the buffer size, until the message is read successfully. A maximum message and buffer size can be specified to make the server reject messages larger than some size, and to give a bound to the client on the maximum it need ever read to get a complete message. It is likely that the read buffer ends with a partial message, this is easily detected by the size delimiting.
 </p>
@@ -228,12 +228,11 @@ messageSetSend 1
 ...
 messageSetSend n
 </pre>
-
-<h4>Deletes</h4>
+<h4><a id="impl_deletes" href="#impl_deletes">Deletes</a></h4>
 <p>
 Data is deleted one log segment at a time. The log manager allows pluggable delete policies to choose which files are eligible for deletion. The current policy deletes any log with a modification time of more than <i>N</i> days ago, though a policy which retained the last <i>N</i> GB could also be useful. To avoid locking reads while still allowing deletes that modify the segment list we use a copy-on-write style segment list implementation that provides consistent views to allow a binary search to proceed on an immutable static snapshot view of the log segments while deletes are progressing.
 </p>
-<h4>Guarantees</h4>
+<h4><a id="impl_guarantees" href="#impl_guarantees">Guarantees</a></h4>
 <p>
 The log provides a configuration parameter <i>M</i> which controls the maximum number of messages that are written before forcing a flush to disk. On startup a log recovery process is run that iterates over all messages in the newest log segment and verifies that each message entry is valid. A message entry is valid if the sum of its size and offset are less than the length of the file AND the CRC32 of the message payload matches the CRC stored with the message. In the event corruption is detected the log is truncated to the last valid offset.
 </p>
@@ -241,8 +240,8 @@ The log provides a configuration parameter <i>M</i> which controls the maximum n
 Note that two kinds of corruption must be handled: truncation in which an unwritten block is lost due to a crash, and corruption in which a nonsense block is ADDED to the file. The reason for this is that in general the OS makes no guarantee of the write order between the file inode and the actual block data so in addition to losing written data the file can gain nonsense data if the inode is updated with a new size but a crash occurs before the block containing that data is not written. The CRC detects this corner case, and prevents it from corrupting the log (though the unwritten messages are, of course, lost).
 </p>
 
-<h3><a id="distributionimpl">5.6 Distribution</a></h3>
-<h4>Consumer Offset Tracking</h4>
+<h3><a id="distributionimpl" href="#distributionimpl">5.6 Distribution</a></h3>
+<h4><a id="impl_offsettracking" href="#impl_offsettracking">Consumer Offset Tracking</a></h4>
 <p>
 The high-level consumer tracks the maximum offset it has consumed in each partition and periodically commits its offset vector so that it can resume from those offsets in the event of a restart. Kafka provides the option to store all the offsets for a given consumer group in a designated broker (for that group) called the <i>offset manager</i>. i.e., any consumer instance in that consumer group should send its offset commits and fetches to that offset manager (broker). The high-level consumer handles this automatically. If you use the simple consumer you will need to manage offsets manually. This is currently unsupported in the Java simple consumer which can only commit or fetch offsets in ZooKeeper. If you use the Scala simple consumer you can discover the offset manager and explicitly commit or fetch offsets to the offset manager. A consumer can look up its offset manager by issuing a ConsumerMetadataRequest to any Kafka broker and reading the ConsumerMetadataResponse which will c
 ontain the offset manager. The consumer can then proceed to commit or fetch offsets from the offsets manager broker. In case the offset manager moves, the consumer will need to rediscover the offset manager. If you wish to manage your offsets manually, you can take a look at these <a href="https://cwiki.apache.org/confluence/display/KAFKA/Committing+and+fetching+consumer+offsets+in+Kafka">code samples that explain how to issue OffsetCommitRequest and OffsetFetchRequest</a>.
 </p>
@@ -255,7 +254,7 @@ When the offset manager receives an OffsetCommitRequest, it appends the request
 When the offset manager receives an offset fetch request, it simply returns the last committed offset vector from the offsets cache. In case the offset manager was just started or if it just became the offset manager for a new set of consumer groups (by becoming a leader for a partition of the offsets topic), it may need to load the offsets topic partition into the cache. In this case, the offset fetch will fail with an OffsetsLoadInProgress exception and the consumer may retry the OffsetFetchRequest after backing off. (This is done automatically by the high-level consumer.)
 </p>
 
-<h5><a id="offsetmigration">Migrating offsets from ZooKeeper to Kafka</a></h5>
+<h5><a id="offsetmigration" href="#offsetmigration">Migrating offsets from ZooKeeper to Kafka</a></h5>
 <p>
 Kafka consumers in earlier releases store their offsets by default in ZooKeeper. It is possible to migrate these consumers to commit offsets into Kafka by following these steps:
 <ol>
@@ -271,17 +270,17 @@ Kafka consumers in earlier releases store their offsets by default in ZooKeeper.
 A roll-back (i.e., migrating from Kafka back to ZooKeeper) can also be performed using the above steps if you set <code>offsets.storage=zookeeper</code>.
 </p>
 
-<h4>ZooKeeper Directories</h4>
+<h4><a id="impl_zookeeper" href="#impl_zookeeper">ZooKeeper Directories</a></h4>
 <p>
 The following gives the ZooKeeper structures and algorithms used for co-ordination between consumers and brokers.
 </p>
 
-<h4>Notation</h4>
+<h4><a id="impl_zknotation" href="#impl_zknotation">Notation</a></h4>
 <p>
 When an element in a path is denoted [xyz], that means that the value of xyz is not fixed and there is in fact a ZooKeeper znode for each possible value of xyz. For example /topics/[topic] would be a directory named /topics containing a sub-directory for each topic name. Numerical ranges are also given such as [0...5] to indicate the subdirectories 0, 1, 2, 3, 4. An arrow -> is used to indicate the contents of a znode. For example /hello -> world would indicate a znode /hello containing the value "world".
 </p>
 
-<h4>Broker Node Registry</h4>
+<h4><a id="impl_zkbroker" href="#impl_zkbroker">Broker Node Registry</a></h4>
 <pre>
 /brokers/ids/[0...N] --> host:port (ephemeral node)
 </pre>
@@ -291,7 +290,7 @@ This is a list of all present broker nodes, each of which provides a unique logi
 <p>
 Since the broker registers itself in ZooKeeper using ephemeral znodes, this registration is dynamic and will disappear if the broker is shutdown or dies (thus notifying consumers it is no longer available).
 </p>
-<h4>Broker Topic Registry</h4>
+<h4><a id="impl_zktopic" href="#impl_zktopic">Broker Topic Registry</a></h4>
 <pre>
 /brokers/topics/[topic]/[0...N] --> nPartions (ephemeral node)
 </pre>
@@ -300,7 +299,7 @@ Since the broker registers itself in ZooKeeper using ephemeral znodes, this regi
 Each broker registers itself under the topics it maintains and stores the number of partitions for that topic.
 </p>
 
-<h4>Consumers and Consumer Groups</h4>
+<h4><a id="impl_zkconsumers" href="#impl_zkconsumers">Consumers and Consumer Groups</a></h4>
 <p>
 Consumers of topics also register themselves in ZooKeeper, in order to coordinate with each other and balance the consumption of data. Consumers can also store their offsets in ZooKeeper by setting <code>offsets.storage=zookeeper</code>. However, this offset storage mechanism will be deprecated in a future release. Therefore, it is recommended to <a href="#offsetmigration">migrate offsets storage to Kafka</a>.
 </p>
@@ -314,7 +313,7 @@ For example if one consumer is your foobar process, which is run across three ma
 The consumers in a group divide up the partitions as fairly as possible, each partition is consumed by exactly one consumer in a consumer group.
 </p>
 
-<h4>Consumer Id Registry</h4>
+<h4><a id="impl_zkconsumerid" href="#impl_zkconsumerid">Consumer Id Registry</a></h4>
 <p>
 In addition to the group_id which is shared by all consumers in a group, each consumer is given a transient, unique consumer_id (of the form hostname:uuid) for identification purposes. Consumer ids are registered in the following directory.
 <pre>
@@ -323,7 +322,7 @@ In addition to the group_id which is shared by all consumers in a group, each co
 Each of the consumers in the group registers under its group and creates a znode with its consumer_id. The value of the znode contains a map of &lt;topic, #streams&gt;. This id is simply used to identify each of the consumers which is currently active within a group. This is an ephemeral node so it will disappear if the consumer process dies.
 </p>
 
-<h4>Consumer Offsets</h4>
+<h4><a id="impl_zkconsumeroffsets" href="#impl_zkconsumeroffsets">Consumer Offsets</a></h4>
 <p>
 Consumers track the maximum offset they have consumed in each partition. This value is stored in a ZooKeeper directory if <code>offsets.storage=zookeeper</code>. This valued is stored in a ZooKeeper directory.
 </p>
@@ -331,7 +330,7 @@ Consumers track the maximum offset they have consumed in each partition. This va
 /consumers/[group_id]/offsets/[topic]/[broker_id-partition_id] --> offset_counter_value ((persistent node)
 </pre>
 
-<h4>Partition Owner registry</h4>
+<h4><a id="impl_zkowner" href="#impl_zkowner">Partition Owner registry</a></h4>
 
 <p>
 Each broker partition is consumed by a single consumer within a given consumer group. The consumer must establish its ownership of a given partition before any consumption can begin. To establish its ownership, a consumer writes its own id in an ephemeral node under the particular broker partition it is claiming.
@@ -341,13 +340,13 @@ Each broker partition is consumed by a single consumer within a given consumer g
 /consumers/[group_id]/owners/[topic]/[broker_id-partition_id] --> consumer_node_id (ephemeral node)
 </pre>
 
-<h4>Broker node registration</h4>
+<h4><a id="impl_brokerregistration" href="#impl_brokerregistration">Broker node registration</a></h4>
 
 <p>
 The broker nodes are basically independent, so they only publish information about what they have. When a broker joins, it registers itself under the broker node registry directory and writes information about its host name and port. The broker also register the list of existing topics and their logical partitions in the broker topic registry. New topics are registered dynamically when they are created on the broker.
 </p>
 
-<h4>Consumer registration algorithm</h4>
+<h4><a id="impl_consumerregistration" href="#impl_consumerregistration">Consumer registration algorithm</a></h4>
 
 <p>
 When a consumer starts, it does the following:
@@ -363,7 +362,7 @@ When a consumer starts, it does the following:
 </ol>
 </p>
 
-<h4>Consumer rebalancing algorithm</h4>
+<h4><a id="impl_consumerrebalance" href="#impl_consumerrebalance">Consumer rebalancing algorithm</a></h4>
 <p>
 The consumer rebalancing algorithms allows all the consumers in a group to come into consensus on which consumer is consuming which partitions. Consumer rebalancing is triggered on each addition or removal of both broker nodes and other consumers within the same group. For a given topic and a given consumer group, broker partitions are divided evenly among consumers within the group. A partition is always consumed by a single consumer. This design simplifies the implementation. Had we allowed a partition to be concurrently consumed by multiple consumers, there would be contention on the partition and some kind of locking would be required. If there are more consumers than partitions, some consumers won't get any data at all. During rebalancing, we try to assign partitions to consumers in such a way that reduces the number of broker nodes each consumer has to connect to.
 </p>

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/introduction.html
----------------------------------------------------------------------
diff --git a/docs/introduction.html b/docs/introduction.html
index 7e0b150..e5b2e78 100644
--- a/docs/introduction.html
+++ b/docs/introduction.html
@@ -5,9 +5,9 @@
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
- 
+
     http://www.apache.org/licenses/LICENSE-2.0
- 
+
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -15,7 +15,7 @@
  limitations under the License.
 -->
 
-<h3><a id="introduction">1.1 Introduction</a></h3>
+<h3><a id="introduction" href="#introduction">1.1 Introduction</a></h3>
 Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.
 <p>
 What does all that mean?
@@ -35,7 +35,7 @@ So, at a high level, producers send messages over the network to the Kafka clust
 
 Communication between the clients and the servers is done with a simple, high-performance, language agnostic <a href="https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol">TCP protocol</a>. We provide a Java client for Kafka, but clients are available in <a href="https://cwiki.apache.org/confluence/display/KAFKA/Clients">many languages</a>.
 
-<h4>Topics and Logs</h4>
+<h4><a id="intro_topics" href="#intro_topics">Topics and Logs</a></h4>
 Let's first dive into the high-level abstraction Kafka provides&mdash;the topic.
 <p>
 A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:
@@ -50,19 +50,19 @@ In fact the only metadata retained on a per-consumer basis is the position of th
 <p>
 This combination of features means that Kafka consumers are very cheap&mdash;they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
 <p>
-The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism&mdash;more on that in a bit. 
+The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism&mdash;more on that in a bit.
 
-<h4>Distribution</h4>
+<h4><a id="intro_distribution" href="#intro_distribution">Distribution</a></h4>
 
 The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
 <p>
 Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
 
-<h4>Producers</h4>
+<h4><a id="intro_producers" href="#intro_producers">Producers</a></h4>
 
 Producers publish data to the topics of their choice. The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message). More on the use of partitioning in a second.
 
-<h4><a id="intro_consumers">Consumers</a></h4>
+<h4><a id="intro_consumers" href="#intro_consumers">Consumers</a></h4>
 
 Messaging traditionally has two models: <a href="http://en.wikipedia.org/wiki/Message_queue">queuing</a> and <a href="http://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern">publish-subscribe</a>. In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe the message is broadcast to all consumers. Kafka offers a single consumer abstraction that generalizes both of these&mdash;the <i>consumer group</i>.
 <p>
@@ -70,7 +70,7 @@ Consumers label themselves with a consumer group name, and each message publishe
 <p>
 If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
 <p>
-If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers. 
+If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
 <p>
 More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is cluster of consumers instead of a single process.
 <p>
@@ -88,7 +88,7 @@ Kafka does it better. By having a notion of parallelism&mdash;the partition&mdas
 <p>
 Kafka only provides a total order over messages <i>within</i> a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.
 
-<h4>Guarantees</h4>
+<h4><a id="intro_guarantees" href="#intro_guarantees">Guarantees</a></h4>
 
 At a high-level Kafka gives the following guarantees:
 <ul>

http://git-wip-us.apache.org/repos/asf/kafka/blob/6cbd9759/docs/migration.html
----------------------------------------------------------------------
diff --git a/docs/migration.html b/docs/migration.html
index 18ab6d4..2da6a7e 100644
--- a/docs/migration.html
+++ b/docs/migration.html
@@ -5,9 +5,9 @@
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
- 
+
     http://www.apache.org/licenses/LICENSE-2.0
- 
+
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -16,11 +16,11 @@
 -->
 
 <!--#include virtual="../includes/header.html" -->
-<h2>Migrating from 0.7.x to 0.8</h2>
+<h2><a id="migration" href="#migration">Migrating from 0.7.x to 0.8</a></h2>
 
 0.8 is our first (and hopefully last) release with a non-backwards-compatible wire protocol, ZooKeeper     layout, and on-disk data format. This was a chance for us to clean up a lot of cruft and start fresh. This means performing a no-downtime upgrade is more painful than normal&mdash;you cannot just swap in the new code in-place.
 
-<h3>Migration Steps</h3>
+<h3><a id="migration_steps" href="#migration_steps">Migration Steps</a></h3>
 
 <ol>
     <li>Setup a new cluster running 0.8.


Mime
View raw message