flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: HDFSChannel?
Date Thu, 13 Dec 2012 08:25:30 GMT
There are several reasons we did not want a channel loading events into the next hop/final

One of the reasons is to clearly define the responsibilities of each component in the system
and the responsibility of the channel is to be a buffer and that is it - you can see this
from the Channel interface (It is the same reason you don't want classes and methods exist
- in theory you could put everything into your main method and expect it to work - but in
reality, that is not something you want to do.).

Another important thing to consider is that such an architecture is going to hit issues because
a transaction is owned by a source thread, and by making the same transaction responsible
for writing to HDFS, there is a tight coupling created between hop 1 to hop 2 writes and hop
2 to hdfs writes - which is exactly what Flume strives to remove, by providing the channel
as a buffer. 

In addition to this, such a single threaded source-sink coupling existed in Flume OG which
caused major issues and introduced much complexity making things impossible to debug.  

In your case if you have a channel that also does the writes within the same transaction,
you are going to have complex issues when HDFS writes fail or timeout (I guarantee you this
is going to happen). Handling such issues are complex. Now if you have an extra thread within
the channel trying to clear up the data out of the "HDFS channel," it is not any different
from an HDFS Sink. Having no channel and having just a source+sink is also going to make things
quite complex and you are going to have to do a lot of handling if and when you hit some failure.

I don't recommend having such an approach, and I don't think the File channel is going to
hit your performance too much - which is what I'd recommend you use.

Hari Shreedharan

On Wednesday, December 12, 2012 at 11:34 PM, Guy Peleg wrote:

> Say I have multi-hop flow, and lets say the last one stores its data in HDFS using the
HDFS sink.
> In the last agent, as in every agent, there are the source-channel-sink trio, my question
is: why do we need that channel if the only thing that agent does is store the events in HDFS
(or other data source)? 
> Won't it be more efficient to have an 'HDFSChannel' that is part of the transaction,
and no sink at all? otherwise I might need to use persistent channel (JDBC, File) to make
sure that data is not lost before 
> it is moved to the sink, which again, is redundant, since ideally I would like the incoming
events, on the 'last agent' to be stored as quickly as possible in their destination without
paying the extra channel coast

View raw message