flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alo alt <wget.n...@googlemail.com>
Subject Re: Logs used as Flume Sources and real-time analytics
Date Tue, 07 Feb 2012 12:46:44 GMT
Yes, I agree fully. The tailing is a useful mechanism, but since we also have to deliver in
time and reliable the core team decides to remove that feature. In your case tail make sense,
in a session application not (bank, travel, car rental, pizza service and so on). One missing
token or session can harm.

For flumeNG another sink is implemented, called exec-agent. Here you can easy put a tail sink,
but then you have to consider that all runs well. But for new users I would please point them
to flumeNG, because flume and flumeNG has no compatibility, flumeNG is written completely
new. I think when flumeNG release the next milestone the support for flume will slowly going
down.

best, and thanks for the discussion,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:

> Hi Alex,
> 
> truth be told, I am quite satisfied with the file tailing and I'll try to explain why
I like it. The main reason is, at least for us, the web application itself is business critical,
the event collection is not. Writing to a plain file is a thing that can rarely fail and if
it fails, it fails quickly and in a controlled fashion. But piping to a flume agent for example?
How sure can I be, that the write will work all the time or fail immediately? That it will
not wait for some timeout or the other? Or throw some unexpected error and bring down the
app.
> 
> The other aspect is simple development and debugging. Any developer can read a plain
file and check if the data he's writing is correct, but with any sophisticated method you
either need more complicated testing environment or redirection switches that will write to
files in development and to flume in testing and production, which complicates stuff.
> 
> --
> Michal Táborský
> chief systems architect
> Netretail Holding, BV
> nrholding.com
> 
> 
> 
> 
> 2012/2/7 alo alt <wget.null@googlemail.com>
> Hi,
> 
> sorry for pitching in, but FlumeNG will not support tailing sources, because we have
here a lot of problems. First, and mostly the worst problem is the marker in a tailing file.
If the agent crash, or the server, or the collector the marker will be lost. So, if you restart
you'll getting all events again. Sure, you can use append, but here you get lost events.
> 
> For a easy migration from flume to flumeNG use sources which are supported in NG. Syslog
as example, more sources you can found here: https://cwiki.apache.org/FLUME/flume-ng.html
> 
> You could use Avro for the sessions, and you could pipe direct to a local flume agent.
Also syslog with a buffering mode could be work. Also in flumeNG now we have hBase handler
and thrift.
> Another idea for collecting sessions could be http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html
, an REST api for hdfs?
> 
> - Alex
> 
> 
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
> 
> On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
> 
> > Thank you for your answer it helps me a lot knowing I'am doing things in a good
way.
> >
> > I've got an other question: How do you manage restart the service after a crash
? I mean, you tail the log file, so if your server crashes or you stop the tail for any reason,
how do you do not to tail all the logs from the start, how do you manage restarting from the
exact point where you left your tail process ?
> >
> > Thanks again for your help, I really appreciate :-).
> >
> > Alain
> >
> > 2012/2/2 Michal Taborsky <michal.taborsky@nrholding.com>
> > Hello Alain,
> >
> > we are using Flume for probably the same purposes. We are writing JSON encoded event
data to flat file on every application server. Since each application server writes only maybe
tens of events per second, the performance hit of writing to disk is negligible (and the events
are written to disk only after the content is generated and sent to the user, so there is
no latency for the end user). This file is tailed by Flume and delivered thru collectors to
HDFS. The collectors are forking the events to RabbitMQ as well. We have a Node.js application,
that picks up these events and does some real-time analytics on them. The delay between event
origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> >
> > Hope this helps.
> >
> > --
> > Michal Táborský
> > chief systems architect
> > Netretail Holding, BV
> > nrholding.com
> >
> >
> >
> >
> > 2012/2/2 Alain RODRIGUEZ <arodrime@gmail.com>
> > Hi,
> >
> > I'm new with Flume and I'd like to use it to get a stable flow of data to my database
(To be able to handle rush hours by delaying the write in database, without introducing any
timeout or latency to the user).
> >
> > My questions are :
> >
> > What is the best way to create the log file that will be used as source for flume
?
> >
> > Our production environment is running apache servers and php scripts.
> > I can't just use access log because some informations are stored in session, so
I need to build a custom source.
> > An other point is that writing a file seems to be primitive and not really efficient
since it writes the disk instead of writing the memory for any event I store (many events
every second).
> >
> > How to use this system (as Facebook does with scribe) to proceed real-time analytics
?
> >
> > I'm open to here about hdfs, hbase or whatever could help reaching my goals which
are a stable flow to the database and near real-time analytics (seconds to minutes).
> >
> > Thanks for your help.
> >
> > Alain
> >
> >
> 
> 


Mime
View raw message