flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Giguere Villegas <felix.gigu...@mate1inc.com>
Subject Re: Best way to extract mysql bin logs?
Date Thu, 11 Aug 2011 21:00:34 GMT
Quick update.

Sometimes the simplest solutions are the best hehe... We decided to just use
a perl script to modify the data as it's being outputted by mysqlbinlog

perl -e 'while (<STDIN>) {s/\r|\n|\r\n/ /g;s/# at/\n# at/g;print;}'

We write this modified binlog to a file and tail it using flume. This way,
mysql events are grouped per line, and even if they get to their destination
in the wrong order, it's easy to sort them correctly downstream :)

Eventually, we will probably wrap it all up in a script and use the execStream
Flume source. This way, we won't need to write an intermediate/temporary
file to disk, which will allow us to have greater disk IO throughput for DFO
or E2E reliability...


On Tue, Aug 9, 2011 at 4:34 PM, Felix Giguere Villegas <
felix.giguere@mate1inc.com> wrote:

> Ah, so a source would make more sense than a decorator then. I see...
> We are definitely open to the idea of contributing back, and this is indeed
> probably something that a lot of people could use...
> We are still evaluating what we will do, as we have a lot of stuff going on
> at once, but if we do decide to develop a custom source, then I'll do as you
> suggest and open a JIRA issue.
> Thanks for your time and info :)
> --
> Felix
> On Tue, Aug 9, 2011 at 4:06 PM, Eric Sammer <esammer@cloudera.com> wrote:
>> Felix:
>> You definitely need to implement a custom source that knows how to
>> read the bin logs and pack each transaction into an event rather than
>> just tailing it. This will give you a discreet event for each
>> transaction that can be treated as a single unit and make downstream
>> processing MUCH easier.
>> Things to keep in mind:
>> * Flume does NOT guarantee order so make sure each event has a
>> timestamp or transaction ID that you can order by.
>> * Flume does NOT guarantee that you won't get duplicates so make sure
>> you have a globally unique transaction ID so you can deduplicate
>> transactions.
>> This would be interesting functionality to get back into Flume. If you
>> can / want to contribute it back in the form of a custom source, feel
>> free to open a JIRA so others can help / watch progress.
>> Thanks!
>> On Tue, Aug 9, 2011 at 11:42 AM, Felix Giguere Villegas
>> <felix.giguere@mate1inc.com> wrote:
>> > Hi :) !
>> >
>> > I have a use case where I want to keep a historical record of all the
>> > changes (insert/update/delete) happening on a MySQL DB.
>> >
>> > I am able to tail the bin logs and record them in HDFS, but they are not
>> > easy to parse because one operation is split on many lines. There are
>> some
>> > comments that include the timestamp, the total time it took to execute
>> the
>> > query and other stuff. A lot of this extra info is not relevant, but the
>> > timestamp is important for me, and I thought I might as well keep the
>> rest
>> > of the info as well since the raw data gives me the option of going back
>> to
>> > look for these other fields if I determine later on that I need them.
>> >
>> > Now, the fact that it's split over many lines makes it harder to use
>> with
>> > Map/Reduce.
>> >
>> > I have thought of using a custom M/R RecordReader but I still have the
>> > problem that some of the lines related to one operation will be at the
>> end
>> > of one HDFS file and the rest will be at the beginning of the next HDFS
>> > file, since I am opening and closing those files at an arbitrary roll
>> time.
>> >
>> > I think the easiest way would be to do some minimal ETL at the source. I
>> > think I could use a custom decorator for this. Basically, that decorator
>> > would group together on a single line all the bin log lines that relate
>> to a
>> > single DB operation. The original lines would be separated by
>> semi-colons or
>> > some other character in the final output.
>> >
>> > I wanted to check with you guys to see if that approach made sense. If
>> you
>> > have better suggestions, then I'm all ears, of course. Also, if you
>> think
>> > there is an easier way than reading the bin logs to accomplish my
>> original
>> > goal, then I'd like to hear about it as well :)
>> >
>> > Thanks :) !
>> >
>> > --
>> > Felix
>> >
>> >
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com

View raw message