flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Ortiz <konstt2...@gmail.com>
Subject Re: Flume with Kafka , Architecture.
Date Wed, 18 Feb 2015 07:33:29 GMT
I mean

Machine TypeA (Kafka + Flume Agent with Source Kafka and Sink HDFS)
--> Machine TypeB (DataNode)

2015-02-17 22:40 GMT+01:00 Asim Zafir <asim.zafir@gmail.com>:
> Not sure what you mean by the Kafka+flume to HDFS but in our experience we
> have seen significant data loss with flume being used as a transport
> mechanism to sync data to HDFS. something haven't worked for us :
> 1) flume appenders on the source - installing appenders and flume agent on
> the application server side and seriously cause performance issued.
> appenders appears to reach into dead lock state due to thread locking.
> 2) log4j v 1 and appenders are bad option with flume
> 3)log4jv2 + embedded agent solves the problem of thread locking relieves the
> stress on the application servers - since now you have 1 less jvm to manage,
> so no performance issues there. for any high traffic server generating data
> it really works
> 4)flume has issues with some meta character (some specific UTF code) and it
> will truncate to commit to the data pipeline if struck with those if the
> read on that character is outside the limit of that read buffer - since
> there is no loggin, its painful to even troubleshoot.
> thanks,
> Asim Zafir
> On Tue, Feb 17, 2015 at 12:29 PM, Gwen Shapira <gshapira@cloudera.com>
> wrote:
>> I like the first option (Kafka + Flume cluster to HDFS cluster)
>> Flume doesn't actually benefit much from being local to HDFS, and as you
>> noticed - it may take resources from Spark and Impala.
>> Flume can live on same nodes as Kafka. Especially if you are using it with
>> Kafka channel - Kafka can be a bit sensitive to serious memory or disk
>> utilization.
>> Hope this helps.
>> Gwen
>> On Tue, Feb 17, 2015 at 2:13 AM, Guillermo Ortiz <konstt2000@gmail.com>
>> wrote:
>>> Hi,
>>> I have some machines with Kafka and DataNotes in different machines. I
>>> want to get with Flume the data from Kafka and store in HDFS. What's
>>> the best architecture? I assume that all the machines have access to
>>> the others.
>>> Cluster1 (Kafka + Flume) ---> Cluster2 (Hdfs)
>>> There are a agent in each machine where  Kafka is installed and the
>>> sink writes in HDFS directly, it could be configured some compress
>>> option in the sink, etc..
>>> Cluster1 (Kafka + Flume + Avro) --> Cluster2(Flume + Avro + HDFS)
>>> There are a agent in each machine where  Kafka is installed. Flume
>>> sends data to another flume through Avro and Flume which is installed
>>> in the DataNode writes data in HDFS.
>>> Cluster1 (Kafka) --> Cluster2(Flume + HDFS)
>>> Flume is just installed in the DataNodes
>>> I don't like to install Flume in the DataNodes because these machines
>>> execute process as Spark, Hive, Impala, MapReduce and they spend so
>>> many resources on theirs tasks. On other hand, it is where data have
>>> to be sent.
>>> I could be configure more than one source to get data from Kafka and
>>> more than one Flume to have more htan one VM.
>>> Could someone comment about advantages and disvantages that finds in
>>> each scenario?

View raw message