Would it be feasible to consider the addition of another extension point with Flume for the purposes of custom filtering, enrichment, routing etc. Without trying to envision Flume away into something it was never designed for (i.e without going overboard) The concept of some sort of intermediate processing unit is quite attractive to me personally as I have my dedicated AvroSources purely for aggregating data however in the interest of modularisation I may want to perform some enrichment/filtering exercise before I dump the events on my durable channel. I guess the conversation of flow and some sort of declarative way of configuring the ordering of the processing units etc. Just thinking out loud.
@Nitin/Mike , your experience in the field will assist in validating this further
Quoting Nitin Pawar <email@example.com>:Mike, Yes
I am not against the approach flume doing it. I would love to see it part
of flume (it ofcourse helps to remove overload of one processing engine).
As flume already supports the grouping of agents to the normal route of
acquisition and sink can continue.
In another route, we can have it to sink to a processor source of flume
which then converts the data and runs quick analysis on data in memory and
update the global counters kind of things which then can be sink to live
On Fri, Feb 8, 2013 at 2:26 PM, Mike Percy <firstname.lastname@example.org> wrote:Nitin,
Good to hear more of your thoughts. Please see inline.
On Thu, Feb 7, 2013 at 8:55 PM, Nitin Pawar <email@example.com>wrote:
I can understand the idea of having data processed inside flume bystreaming it to another flume agent. But do we really need to re-engineerflume does have have an open jira on this integration FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286>
something inside flume is what I am thinking? Core flume dev team may have
better ideas on this but currently for streaming data processing storm is a
Yes, a Storm sink could be useful. But that wouldn't preclude us from
taking a hard look at what may be missing in Flume itself, right?
It will be interesting to draw up the comparisons in performance if thedata processing logic is added to to flume. We do see currently people
having a little bit of pre-processing of their data (they have their own
custom channel types where they modify the data and sink it)
It sounds like you have some experience with Flume. Are you guys using it
I work with a lot of folks to set up and deploy Flume, many of which do
lookups / joins with other systems, transformations, etc. in real time
along their data ingest pipeline before writing the data to HDFS or HBase
for further processing and archival. I wouldn't say these are really heavy
number crunching implementations in Flume, but certainly i see a lot of
inline parsing, inspection, enrichment, routing, and the like going on. I
think Flume could do a lot more, given the right abstractions.