flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bessenyei Balázs Donát <bes...@apache.org>
Subject Re: Use case for Flume
Date Sat, 09 Sep 2017 09:27:57 GMT
Hi Gintas,

I can't think of a completely Flume out-of-the-box solution, but I
believe Flume does suit your needs.
The multi-agent solution is doable, you'd have to either implement a
source (probably based on Avro Source) or implement an interceptor to
do the downloading as previously discussed.

If you need any further help, please let us know.

Thank you,


2017-09-05 21:13 GMT+02:00 Gintautas Sulskus <gintautas.sulskus@gmail.com>:
> Hi,
> Thanks for the quick replies, guys.
> Donat, sorry, I do not have example configs. At the moment I am just
> considering available solutions to tackle the problem at hand. I would very
> much prefer Flume for its modular and scalable approach. I would like to
> find an elegant solution that would be "native" to Flume.
> I was considering the two-agent approach as well. But then, how would the
> middle part would look like? What component would download the file? I
> assume I would face the same problem as now.
> Denes, files would be up to 5 megabytes in size.The interceptor approach
> looks the most suitable in this situation.
> Regarding the sink-side interceptor, wouldn't it have the same 64MB size
> limit as the source-side one?
> Best,
> Gintas
> On 5 Sep 2017 16:54, "Denes Arvay" <denes@cloudera.com> wrote:
> Hi GIntas,
> What is the average (or expected maximum) size of the files you'd like to
> process?
> In general it is not recommended to transfer large events (i.e. >64MB if you
> use file channel, as this is a hard limit of the protobuf implementation).
> If your files fit into this limit then I'd suggest to use an interceptor to
> fetch the data and then update the event's body and push it through Flume.
> In this case your setup would be:
> Kafka source + data fetcher interceptor (custom code) -> file channel (or
> memory) -> HDFS sink
> If the files are larger then you could use a customised HDFS sink which
> fetches the URL and stores the file on HDFS.
> In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup
> without configuring any source.
> Actually for your problem the sink-side interceptors would be a good
> solution (https://issues.apache.org/jira/browse/FLUME-2580), but
> unfortunately it is not implemented yet.
> Regards,
> Denes
> On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus
> <gintautas.sulskus@gmail.com> wrote:
>> Hi,
>> I have a question regarding Flume suitability for a particular use case.
>> Task: There is an incoming constant stream of links that point to files.
>> Those files to be fetched and stored in HDFS.
>> Desired implementation:
>> 1. Each link to a file is stored in Kafka queue Q1.
>> 2. Flume A1.source monitors Q1 for new links.
>> 3. Upon retrieving a link from Q1, A1.source fetches the file. The file
>> eventually is stored in HDFS by A1.sink
>> My concern here is a seemingly overloaded functionality of A1.source. The
>> A1.source would have to perform two activities: 1.) to periodically poll
>> queue Q1 for new links to files and then  2.) fetch those files.
>> What do you think? Is there a cleaner way to achieve this, e.g. by using
>> an interceptor to fetch files? Would this be appropriate?
>> Best,
>> GIntas

View raw message