flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Layland <stephen.layl...@gmail.com>
Subject Re: Syslog Source Performance: (Was: is collectorSink(dest, prefix, millis, format) broken or am istupid?)
Date Tue, 20 Sep 2011 22:36:40 GMT
I recently found https://issues.cloudera.org/browse/FLUME-648 which I
noticed is not in the flume git repo yet.  boo! Any idea when this will make
it in?

This Jira also helped solve my previous questions below.  I applied the
Chetan's super simple patch patch and received an expected increase in
throughput.  We went from 6k msgs/sec up to 16k msgs/sec on the same remote
syslogTcp -> hdfs sink.  The syslog-ng test util loggen is also invaluable
for this benchmarking, so thanks to Chetan for indirectly pointing me in
that direction from the jira and answering my question "is there a better
way to test this?".

Thanks,

-Steve


On Mon, Sep 19, 2011 at 6:39 PM, Stephen Layland
<stephen.layland@gmail.com>wrote:

> Jeff, I actually just finished banging my head against this today :)  I
> used the thrift 0.6.1 compiler and flume built and the tests seemed to pass
> (before I got tired of waiting for them all to finish).  mvn package then
> builds a distribution package in path/to/src/flume-distribution/target/.
>  You can unzip and install that on your machine and run flume from that
> directory (or just run from the target/flume-distribution-...-bin dir.)
>
> Ed, thanks for the heads up re: syslog TCP code.  When testing the tcp
> syslog source I noticed pretty abysmal performance (though I could be
> testing the wrong thing...)  When a hadoop fs -put was handling around 40
> M/s raw hdfs write speeds, flume's tcp syslog collector was writing at only
> at 1 M/s.  Unfortunately, after upgrading to the 0.9.5 SNAPSHOT I'm not
> seeing any difference in my tests.  It is still taking quite a while to
> stream syslog data into hadoop, and I haven't even got to multiple flume
> tiers yet.
>
> Is this throughput expected, or is there a better way I can test this[1]?
>  We're planning on having around 8k syslog streams coming in at a reasonable
> clip and these numbers have me worried.  Perhaps I just need to try it out
> and see.  Does anyone have a better way of benchmarking 'live' syslog data?
>
> Thanks all,
>
> -Steve
>
>
> [1] Here's my test setup:
>
>     host1 = flume master
>     host2 = flume node
>     node1 = logical config mapped to host2
>     host3 = host sending syslog data
>
> cru@host1:~/flume-0.9.5-SNAPSHOT$ flume master
> .....
> cru@host1:~/flume-bench$ cat bench.flume
> exec unmapAll
> exec refreshAll
> exec config node1 'syslogTcp(5145)'
> 'collectorSink("hdfs://master-hadoop:54310/user/cru/bench/%Y-%m-%d/%H00/",
> "test-")'
> exec map node1 flume-host1
> exec refreshAll
> cru@host1:~/flume-bench$ cat bench.flume |
> ~/flume-0.9.5-SNAPSHOT/bin/flume shell -c localhost
> .....
>
> cru@host2:~/flume-0.9.5-SNAPSHOT$ ./bin/flume node
> .....
>
> Once the flume node is up and listening to port 5145 on host2 (sometimes it
> goes straight to ERROR so I have to refresh it a couple of times), I simply
> netcat a 72M file to the listening flume node:
>
> cru@host2:~$ ls -l mysyslog.file
> -rw-r--r-- 1 cru cru 75565855 Sep 12 18:35
> cru@host2:~$ time cat mysyslog.file | sed -e 's/^/<37>/' | nc host1 5145
> -q 1
>
> real    1m1.043s
> user    0m0.316s
> sys     0m0.360s
> cru@host2:~$ echo '75565855 / 61.043' | bc
> 1237911
>
> 1M/s?!
>
> Here's a comparable test using hadoop fs -put which also takes into account
> network traffic:
>
> listen to port 5155 on the flume node and when the files' all here, just fs
> -put it:
>
> cru@host1:~$ nc -l -p 5155 > test.txt && time hadoop fs -put test.txt
> /user/cru/test.txt
> ....
>
> cru@host2:~$ time cat mysyslog.file | sed -e 's/^/<37>/' | nc host1 5155
> -q 1
>
> real    0m1.500s
> user    0m0.332s
> sys     0m0.448s
>
> (back on host1, time shows:
>
> real    0m1.816s
> user    0m2.600s
> sys     0m0.312s
> )
>
> cru@host1:~$ echo '75565855 / ( 1.816 + 1.5 )' | bc
> 22788255
>
> So, that's ~20M/s vs 1M/s.  While I know this isn't a *completely* fair
> fight, it seems flume is loitering a bit more than it should be.  Just out
> of curiosity, instead of syslogTcp(), i also tested a simple
> tail("/tmp/test.txt") source of the same netcatted file.  Looking at the
> timestamps for the first and last file and making collector.roll.millis
> sufficiently small (~3s), I estimated the time it took to load the file as
> about 30s (~2M/s):
>
> 2011-09-20 01:10:27,707 [logicalNode node2-30] INFO rolling.RollSink:
> opening RollSink
>  'escapedCustomDfs("hdfs://master-hadoop:54310/user/cru/bench/%Y-%m-%d/%H00/","test-%{rolltag}"
> )'
>
> 2011-09-20 01:10:59,219 [Roll-TriggerThread-1] INFO hdfs.CustomDfsSink:
> Closing HDFS file:
> hdfs://master-hadoop:54310/user/cru/bench/2011-09-20/0100/test-20110920-011055986+0000.1102905343428388.00000033.tmp
>
>
>
> On Mon, Sep 19, 2011 at 11:02 AM, Jeff Hansen <dscheffy@gmail.com> wrote:
>
>> Hi Ed, I noticed the same issue Stephen mentioned in this thread a
>> week or two ago.  I'd like to try running against trunk, but I'm
>> having some difficulties compiling it.
>>
>> (kept getting a thrift error, so I compiled/installed thrift, still
>> kept getting thrift errors, finally noticed note in devnotes that I
>> need to specify thrift executable location -- even though I used the
>> default... got lots more thrift issues because I had installed trunk
>> and it looks like flume is going against 0.6.0, recompiled and
>> installed thrift 0.6.x branch, still seeing tons of test failures when
>> I run mvn install or mvn package -- I'm not finding any jar files
>> created anywhere in the project after the build)
>>
>> If any of these sound familiar to you and you found a good source of
>> developer information I'd be grateful (a lot of the READMEs and
>> DEVNOTES in the source contain out of date links as well as pointing
>> back to the google groups as the mailing list, so I'm hesitant to put
>> too much faith in them).  I suppose subscribing to the developer
>> mailing list might be a good idea...
>>
>> By the way, once you were able to successfully build, did you just
>> replace the flume-core-0.9.4-cdh3-u1.jar in your cdh ditros lib folder
>> with the one from the build?  In the short term I thin I'll want to
>> run this from inside eclipse anyway for debugging purposes, but the
>> build docs were a bit spotty on how to deploy from a built project.
>>
>> Thanks!
>> Jeff
>>
>>
>>
>> On Fri, Sep 16, 2011 at 6:38 PM, Edward sanks <edsanks@hotmail.com>
>> wrote:
>> > Steve,
>> >
>> > If you noticed last week my mail about flume-0.9.4 hitting roof with
>> just 3 syslogTcp streams on an aws large machine, you may want to explore
>> going to latest code as well. Having said that I am yet to prove that point.
>> >
>> > Ed.
>> > -----Original Message-----
>> > From: Stephen Layland <stephen.layland@gmail.com>
>> > Date: Fri, 16 Sep 2011 23:16:49
>> > To: <flume-user@incubator.apache.org>
>> > Subject: is collectorSink(dest, prefix, millis, format) broken or am i
>> >  stupid?
>> >
>> > Hi!
>> >
>> >
>> > Forgive the n00b question, but I'm trying to benchmark flume while
>> building out a hadoop based central log store and am coming across some
>> weirdness.  The flume-conf.xml has the default flume.collector.output.format
>> set to 'avrojson'.  I had two simple configs:
>> >
>> >
>> > test1: syslogTcp(5140) | collectorSink("hdfs://...", "test", 30000,
>> "avrodata")
>> > test2: syslogTcp(5140) | collectorSink("hdfs://...", "test", 30000,
>> "raw")
>> >
>> >
>> > I then mapped a test flume node to each of these logical nodes in turn
>> (exec map node1 test1; exec refreshAll) and tested it out but the actual dfs
>> files are all appear to be the same size and all appear to be avronjson?
>> >
>> >
>> > Am I doing something wrong here?
>> >
>> >
>> > Using flume version: 0.9.4-cdh3u1.
>> >
>> >
>> > Thanks,
>> >
>> >
>> > -Steve
>> >
>>
>
>

Mime
View raw message