flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gintautas Sulskus <gintautas.suls...@gmail.com>
Subject Re: Use case for Flume
Date Tue, 05 Sep 2017 19:13:31 GMT

Thanks for the quick replies, guys.

Donat, sorry, I do not have example configs. At the moment I am just
considering available solutions to tackle the problem at hand. I would very
much prefer Flume for its modular and scalable approach. I would like to
find an elegant solution that would be "native" to Flume.
I was considering the two-agent approach as well. But then, how would the
middle part would look like? What component would download the file? I
assume I would face the same problem as now.

Denes, files would be up to 5 megabytes in size.The interceptor approach
looks the most suitable in this situation.
Regarding the sink-side interceptor, wouldn't it have the same 64MB size
limit as the source-side one?


On 5 Sep 2017 16:54, "Denes Arvay" <denes@cloudera.com> wrote:

Hi GIntas,

What is the average (or expected maximum) size of the files you'd like to
In general it is not recommended to transfer large events (i.e. >64MB if
you use file channel, as this is a hard limit of the protobuf
If your files fit into this limit then I'd suggest to use an interceptor to
fetch the data and then update the event's body and push it through Flume.

In this case your setup would be:
Kafka source + data fetcher interceptor (custom code) -> file channel (or
memory) -> HDFS sink

If the files are larger then you could use a customised HDFS sink which
fetches the URL and stores the file on HDFS.
In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
without configuring any source.

Actually for your problem the sink-side interceptors would be a good
solution (https://issues.apache.org/jira/browse/FLUME-2580), but
unfortunately it is not implemented yet.


On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus <
gintautas.sulskus@gmail.com> wrote:

> Hi,
> I have a question regarding Flume suitability for a particular use case.
> Task: There is an incoming constant stream of links that point to files.
> Those files to be fetched and stored in HDFS.
> Desired implementation:
> 1. Each link to a file is stored in Kafka queue Q1.
> 2. Flume A1.source monitors Q1 for new links.
> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
> eventually is stored in HDFS by A1.sink
> My concern here is a seemingly overloaded functionality of A1.source. The
> A1.source would have to perform two activities: 1.) to periodically poll
> queue Q1 for new links to files and then  2.) fetch those files.
> What do you think? Is there a cleaner way to achieve this, e.g. by using
> an interceptor to fetch files? Would this be appropriate?
> Best,
> GIntas

View raw message