I have a question regarding Flume suitability for a particular use case.
Task: There is an incoming constant stream of links that point to files. Those files to be fetched and stored in HDFS.
1. Each link to a file is stored in Kafka queue Q1.
2. Flume A1.source monitors Q1 for new links.
3. Upon retrieving a link from Q1, A1.source fetches the file. The file eventually is stored in HDFS by A1.sink
My concern here is a seemingly overloaded functionality of A1.source. The A1.source would have to perform two activities: 1.) to periodically poll queue Q1 for new links to files and then 2.) fetch those files.
What do you think? Is there a cleaner way to achieve this, e.g. by using an interceptor to fetch files? Would this be appropriate?