flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asim Zafir <asim.za...@gmail.com>
Subject Re: Flume with Kafka , Architecture.
Date Tue, 17 Feb 2015 21:40:24 GMT
Not sure what you mean by the Kafka+flume to HDFS but in our experience we
have seen significant data loss with flume being used as a transport
mechanism to sync data to HDFS. something haven't worked for us :

1) flume appenders on the source - installing appenders and flume agent on
the application server side and seriously cause performance issued.
appenders appears to reach into dead lock state due to thread locking.
2) log4j v 1 and appenders are bad option with flume
3)log4jv2 + embedded agent solves the problem of thread locking relieves
the stress on the application servers - since now you have 1 less jvm to
manage, so no performance issues there. for any high traffic server
generating data it really works
4)flume has issues with some meta character (some specific UTF code) and it
will truncate to commit to the data pipeline if struck with those if the
read on that character is outside the limit of that read buffer - since
there is no loggin, its painful to even troubleshoot.

Asim Zafir

On Tue, Feb 17, 2015 at 12:29 PM, Gwen Shapira <gshapira@cloudera.com>

> I like the first option (Kafka + Flume cluster to HDFS cluster)
> Flume doesn't actually benefit much from being local to HDFS, and as you
> noticed - it may take resources from Spark and Impala.
> Flume can live on same nodes as Kafka. Especially if you are using it with
> Kafka channel - Kafka can be a bit sensitive to serious memory or disk
> utilization.
> Hope this helps.
> Gwen
> On Tue, Feb 17, 2015 at 2:13 AM, Guillermo Ortiz <konstt2000@gmail.com>
> wrote:
>> Hi,
>> I have some machines with Kafka and DataNotes in different machines. I
>> want to get with Flume the data from Kafka and store in HDFS. What's
>> the best architecture? I assume that all the machines have access to
>> the others.
>> Cluster1 (Kafka + Flume) ---> Cluster2 (Hdfs)
>> There are a agent in each machine where  Kafka is installed and the
>> sink writes in HDFS directly, it could be configured some compress
>> option in the sink, etc..
>> Cluster1 (Kafka + Flume + Avro) --> Cluster2(Flume + Avro + HDFS)
>> There are a agent in each machine where  Kafka is installed. Flume
>> sends data to another flume through Avro and Flume which is installed
>> in the DataNode writes data in HDFS.
>> Cluster1 (Kafka) --> Cluster2(Flume + HDFS)
>> Flume is just installed in the DataNodes
>> I don't like to install Flume in the DataNodes because these machines
>> execute process as Spark, Hive, Impala, MapReduce and they spend so
>> many resources on theirs tasks. On other hand, it is where data have
>> to be sent.
>> I could be configure more than one source to get data from Kafka and
>> more than one Flume to have more htan one VM.
>> Could someone comment about advantages and disvantages that finds in
>> each scenario?

View raw message