flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Re: Logs used as Flume Sources and real-time analytics
Date Thu, 09 Feb 2012 09:37:51 GMT
Hi again :).

1 - How to rewrite all the logs after a crash knowing that we rotate the
logs. I mean we tail a specific file, how to tell to tail to rewrite also
the old logs that have been rotated ?

2 - To avoid duplicates can't we just use checkpoints inserted into the
logs, let's say every hour. When something crashes I would just have to
erase every entry that come after checkpoint time and rewrite the logs from
this checkpoint. Is this a bad Idea ?

3 - What should I use to store the checkpoint in a different way from the
real log ? Are decorators made for this kind of work ?

4 - I would like to use Cassandra for the logs storage, I saw some plugins
giving Cassandra sinks but I would like to store data in a custom way. How
to do it ? Do I need to build a custom plugin/sink ?

5 - My business process also use my Cassandra DB (without flume, directly
via thrift), how to ensure that log writing won't overload my database and
introduce latency in my business process ? I mean, is there a way to to
manage the throughput sent by the flume's tails and slow then when my
Cassandra cluster is overloaded ?

I hope I'm not flooding you with all these (stupid?) questions.

Alain

2012/2/7 alo alt <wget.null@googlemail.com>

> Yes, I agree fully. The tailing is a useful mechanism, but since we also
> have to deliver in time and reliable the core team decides to remove that
> feature. In your case tail make sense, in a session application not (bank,
> travel, car rental, pizza service and so on). One missing token or session
> can harm.
>
> For flumeNG another sink is implemented, called exec-agent. Here you can
> easy put a tail sink, but then you have to consider that all runs well. But
> for new users I would please point them to flumeNG, because flume and
> flumeNG has no compatibility, flumeNG is written completely new. I think
> when flumeNG release the next milestone the support for flume will slowly
> going down.
>
> best, and thanks for the discussion,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:
>
> > Hi Alex,
> >
> > truth be told, I am quite satisfied with the file tailing and I'll try
> to explain why I like it. The main reason is, at least for us, the web
> application itself is business critical, the event collection is not.
> Writing to a plain file is a thing that can rarely fail and if it fails, it
> fails quickly and in a controlled fashion. But piping to a flume agent for
> example? How sure can I be, that the write will work all the time or fail
> immediately? That it will not wait for some timeout or the other? Or throw
> some unexpected error and bring down the app.
> >
> > The other aspect is simple development and debugging. Any developer can
> read a plain file and check if the data he's writing is correct, but with
> any sophisticated method you either need more complicated testing
> environment or redirection switches that will write to files in development
> and to flume in testing and production, which complicates stuff.
> >
> > --
> > Michal Táborský
> > chief systems architect
> > Netretail Holding, BV
> > nrholding.com
> >
> >
> >
> >
> > 2012/2/7 alo alt <wget.null@googlemail.com>
> > Hi,
> >
> > sorry for pitching in, but FlumeNG will not support tailing sources,
> because we have here a lot of problems. First, and mostly the worst problem
> is the marker in a tailing file. If the agent crash, or the server, or the
> collector the marker will be lost. So, if you restart you'll getting all
> events again. Sure, you can use append, but here you get lost events.
> >
> > For a easy migration from flume to flumeNG use sources which are
> supported in NG. Syslog as example, more sources you can found here:
> https://cwiki.apache.org/FLUME/flume-ng.html
> >
> > You could use Avro for the sessions, and you could pipe direct to a
> local flume agent. Also syslog with a buffering mode could be work. Also in
> flumeNG now we have hBase handler and thrift.
> > Another idea for collecting sessions could be
> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api
> for hdfs?
> >
> > - Alex
> >
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
> >
> > > Thank you for your answer it helps me a lot knowing I'am doing things
> in a good way.
> > >
> > > I've got an other question: How do you manage restart the service
> after a crash ? I mean, you tail the log file, so if your server crashes or
> you stop the tail for any reason, how do you do not to tail all the logs
> from the start, how do you manage restarting from the exact point where you
> left your tail process ?
> > >
> > > Thanks again for your help, I really appreciate :-).
> > >
> > > Alain
> > >
> > > 2012/2/2 Michal Taborsky <michal.taborsky@nrholding.com>
> > > Hello Alain,
> > >
> > > we are using Flume for probably the same purposes. We are writing JSON
> encoded event data to flat file on every application server. Since each
> application server writes only maybe tens of events per second, the
> performance hit of writing to disk is negligible (and the events are
> written to disk only after the content is generated and sent to the user,
> so there is no latency for the end user). This file is tailed by Flume and
> delivered thru collectors to HDFS. The collectors are forking the events to
> RabbitMQ as well. We have a Node.js application, that picks up these events
> and does some real-time analytics on them. The delay between event
> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> > >
> > > Hope this helps.
> > >
> > > --
> > > Michal Táborský
> > > chief systems architect
> > > Netretail Holding, BV
> > > nrholding.com
> > >
> > >
> > >
> > >
> > > 2012/2/2 Alain RODRIGUEZ <arodrime@gmail.com>
> > > Hi,
> > >
> > > I'm new with Flume and I'd like to use it to get a stable flow of data
> to my database (To be able to handle rush hours by delaying the write in
> database, without introducing any timeout or latency to the user).
> > >
> > > My questions are :
> > >
> > > What is the best way to create the log file that will be used as
> source for flume ?
> > >
> > > Our production environment is running apache servers and php scripts.
> > > I can't just use access log because some informations are stored in
> session, so I need to build a custom source.
> > > An other point is that writing a file seems to be primitive and not
> really efficient since it writes the disk instead of writing the memory for
> any event I store (many events every second).
> > >
> > > How to use this system (as Facebook does with scribe) to proceed
> real-time analytics ?
> > >
> > > I'm open to here about hdfs, hbase or whatever could help reaching my
> goals which are a stable flow to the database and near real-time analytics
> (seconds to minutes).
> > >
> > > Thanks for your help.
> > >
> > > Alain
> > >
> > >
> >
> >
>
>

Mime
View raw message