flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Yates <sya...@stevendyates.com>
Subject Re: Analysis of Data
Date Fri, 08 Feb 2013 03:22:48 GMT
Thanks for your feedback Mike, I have been thinking about this a little more and just using
Mahout as an example I was considering the concept of somehow developing an enriched 'sink'
so to speak where it would accept input streams / msgs from a flume channel and onforward
specifically to a 'service' i.e Mahout service which would subsequently deliver the results
to the configured sink. So yes it would behave as an intercept->filter->process->sink
for applicable data items.

I apologise if that is still vague. It would be great to receive further feedback from the
user group.

-SteveMike Percy <mpercy@apache.org> wrote:Hi Steven,
Thanks for chiming in! Please see my responses inline:

On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <syates@stevendyates.com> wrote:
The only missing link within the Flume architecture I see in this conversation is the actual
channel's and brokers themselves which orchestrate this lovely undertaking of data collection.

Can you define what you mean by channels and brokers in this context? Since channel is a synonym
for queueing event buffer in Flume parlance. Also, can you elaborate more on what you mean
by orchestration? I think I know where you're going but I don't want to put words in your

One opportunity I do see (and I may be wrong) is for the data to offloaded into a system such
as Apache Mahout  before being sent to the sink. Perhaps the concept of a ChannelAdapter
of sorts? I.e Mahout Adapter ? Just thinking out loud and it may be well out of the question.

Why not a Mahout sink? Since Mahout often wants sequence files in a particular format to begin
its MapReduce processing (e.g. its k-Means clustering implementation), Flume is already a
good fit with its HDFS sink and EventSerializers allowing for writing a plugin to format your
data however it needs to go in. In fact that works today if you have a batch (even 5-minute
batch) use case. With today's functionality, you could use Oozie to coordinate kicking off
the Mahout M/R job periodically, as new data becomes available and the files are rolled.

Perhaps even more interestingly, I can see a use case where you might want to use Mahout to
do streaming / realtime updates driven by Flume in the form of an interceptor or a Mahout
sink. If online machine learning (e.g. stochastic gradient descent or something else online)
was what you were thinking, I wonder if there are any folks on this list who might have an
interest in helping to work on putting such a thing together.

In any case, I'd like to hear more about specific use cases for streaming analytics. :)


View raw message