Hi GIntas,

What is the average (or expected maximum) size of the files you'd like to process?
In general it is not recommended to transfer large events (i.e. >64MB if you use file channel, as this is a hard limit of the protobuf implementation).
If your files fit into this limit then I'd suggest to use an interceptor to fetch the data and then update the event's body and push it through Flume.

In this case your setup would be:
Kafka source + data fetcher interceptor (custom code) -> file channel (or memory) -> HDFS sink

If the files are larger then you could use a customised HDFS sink which fetches the URL and stores the file on HDFS.
In this case I'd recommend to use a Kafka channel -> custom HDFS sink setup without configuring any source.

Actually for your problem the sink-side interceptors would be a good solution (https://issues.apache.org/jira/browse/FLUME-2580), but unfortunately it is not implemented yet.


On Tue, Sep 5, 2017 at 2:00 PM Gintautas Sulskus <gintautas.sulskus@gmail.com> wrote:

I have a question regarding Flume suitability for a particular use case.

Task: There is an incoming constant stream of links that point to files. Those files to be fetched and stored in HDFS.

Desired implementation:

1. Each link to a file is stored in Kafka queue Q1.
2. Flume A1.source monitors Q1 for new links.
3. Upon retrieving a link from Q1, A1.source fetches the file. The file eventually is stored in HDFS by A1.sink

My concern here is a seemingly overloaded functionality of A1.source. The A1.source would have to perform two activities: 1.) to periodically poll queue Q1 for new links to files and then  2.) fetch those files.

What do you think? Is there a cleaner way to achieve this, e.g. by using an interceptor to fetch files? Would this be appropriate?