flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gonzalo Herreros <gherre...@gmail.com>
Subject Re: Batchsize in kafka sink
Date Sun, 27 Sep 2015 09:46:51 GMT
There are subtle but significant differences.

When you configure in the sink: "batchSize" you are specifying how many
messages are taken as a transaction from the channel at once (like in any
other sink).
While the Kafka property "batch.num.messages" (which in the flume config is
specified as "kafka.batch.num.messages", specifies the batch size for
sending messages to the broker from an asynchronous producer. By default
the producer is synchronous, so that configuration property would do
nothing.

If you use the synchronous producer (which is default), the messages taken
from the channel as a batch (100 by default) will be sent together to the
kafka broker.
However, if you change the producer to async then it's more complicated, by
default "kafka.batch.num.messages" is 200 so it means that the Sink will
take 100 from the channel and commit that but those messages will be kept
in memory until another 100 are taken (so there is a risk of losing
messages).

I would stay away for the async producer in a Flume sink because you want
the sink to control the pace (a file or memory channel will be faster) so
it doesn't need to buffer in memory risking message loss. An async producer
is useful when the client is an online application you don't want to delay.

Answering you question: if you don't specify any batching properties, by
default it will delivery messages in batches of 100, which is probably good
in most cases.
Hope that makes sense.

Regards,
Gonzalo


On 26 September 2015 at 05:19, Sharninder <sharninder@gmail.com> wrote:

> Anyone ?
>
> > On 25-Sep-2015, at 3:51 PM, Sharninder <sharninder@gmail.com> wrote:
> >
> > Hi,
> >
> > We want to move to the built-in kafka sink from our own custom
> implementation and I have a question about the batchsize config parameter.
> >
> > Looking at the code of the sink, I can tell that the batchsize is used
> to construct the list of keyed messages fed to the producer.
> >
> > My question is what is the difference between this variable and the
> kafka batch.num.messages parameter?
> >
> > Is the flume parameter necessary ?
> >
> > --
> > Sharninder
> >
> >
>

Mime
View raw message