flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Williams <jasonjwwilli...@gmail.com>
Subject Re: Kafka Sink random partition assignment
Date Tue, 07 Jun 2016 18:51:23 GMT
Hey Chris,

That's great. I'll look into that. Thank you very much!


Sent via iPhone

> On Jun 7, 2016, at 11:36, Chris Horrocks <chris@hor.rocks> wrote:
> You can implement sink groups load balance processor (https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors)
to spread the load over multiple producers to partly account for situation where you might
have one kafka sink rotating between mutliple partitions, it's not an exact science though
as you'd need multiple kafka sinks (n number more than you have partitions in the topic being
written to) in order to make the balance of probabilities such that you were always writing
to each partition from at least one sink.
> As with all hadoop technologies: depends on your use-case.
> -- 
> Chris Horrocks
> From: Jason J. W. Williams <jasonjwwilliams@gmail.com>
> Reply: user@flume.apache.org <user@flume.apache.org>
> Date: 7 June 2016 at 19:03:59
> To: user@flume.apache.org <user@flume.apache.org>
> Subject:  Re: Kafka Sink random partition assignment 
>> Thanks again Chris. I am curious why I see the round-robin behavior I expected when
using kafka-console-producer to inject messages though. 
>> -J
>>> On Tuesday, June 7, 2016, Chris Horrocks <chris@hor.rocks> wrote:
>>> It's by design of Kafka (and by extension flume). The producers are designed
to be many-to-one (producers to partitions) and as such picking a random partition every 10
minutes prevents separate producer instances from all randomly picking the same partition.

>>> -- 
>>> Chris Horrocks
>>> From: Jason Williams <jasonjwwilliams@gmail.com>
>>> Reply: user@flume.apache.org <user@flume.apache.org>
>>> Date: 7 June 2016 at 09:43:34
>>> To: user@flume.apache.org <user@flume.apache.org>
>>> Subject:  Re: Kafka Sink random partition assignment
>>>> Hey Chris,
>>>> Thanks for help! 
>>>> Is that a limitation of the Flume Kafka sink or Kafka itself? Because when
I use another Kafka producer and publish without a key, it randomly moves among the partitions
on every publish.
>>>> -J
>>>> Sent via iPhone
>>>> On Jun 7, 2016, at 00:08, Chris Horrocks <chris@hor.rocks> wrote:
>>>>> The producers bind to random partitions and move every 10 minutes. If
you leave it long enough you should see it in the producer flume agent logs, although there's
nothing to stop it from "randomly" choosing the same partition twice. Annoyingly there's no
concept of producer groups (yet) to ensure that producers apportion the available partitions
between them as this would create a synchronisation issue between what should be entirely
independent processes. 
>>>>> -- 
>>>>> Chris Horrocks
>>>>>> On 7 June 2016 at 00:32:29, Jason J. W. Williams (jasonjwwilliams@gmail.com)
>>>>>> Hi,
>>>>>> New to flume and I'm trying to relay log messages received over netcat
source to Kafka sink. 
>>>>>> Everything seems to be fine, except that Flume is acting like it
IS assigning a partition key to the produced messages though none is assigned. I'd like the
messages to be assigned to a random partition, so that consumers are load balanced.
>>>>>> * Flume 1.6.0
>>>>>> * Kafka
>>>>>> Flume config: https://gist.github.com/williamsjj/8ae025906955fbc4b5f990e162b75d7c
>>>>>> Kafka topic config: kafka-topics --zookeeper localhost/kafka --create
--topic activity.history --partitions 20 --replication-factor 1
>>>>>> Python consumer program: https://gist.github.com/williamsjj/9e67287f0154816c3a733a39ad008437
>>>>>> Test program (publishes to Flume): https://gist.github.com/williamsjj/1eb097a187a3edb17ec1a3913e47e58b
>>>>>> Flume agent listens on 3132tcp for connections, and publishes messages
received to the Kafka activity.history topic.  I'm running two instances of the Python consumer.
>>>>>> What happens however, is all logs messages get sent to a single Kafka
consumer...if I restart Flume (leave consumers running) and re-run the test, all messages
get published to the other consumer. So it feels like Flume is assigning a permanent partition
key even though one is not defined (and should therefore be random).
>>>>>> Any advice is greatly appreciated.
>>>>>> -J

View raw message