flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Layland <stephen.layl...@gmail.com>
Subject Syslog Source Performance: (Was: is collectorSink(dest, prefix, millis, format) broken or am istupid?)
Date Tue, 20 Sep 2011 01:39:33 GMT
Jeff, I actually just finished banging my head against this today :)  I used
the thrift 0.6.1 compiler and flume built and the tests seemed to pass
(before I got tired of waiting for them all to finish).  mvn package then
builds a distribution package in path/to/src/flume-distribution/target/.
 You can unzip and install that on your machine and run flume from that
directory (or just run from the target/flume-distribution-...-bin dir.)

Ed, thanks for the heads up re: syslog TCP code.  When testing the tcp
syslog source I noticed pretty abysmal performance (though I could be
testing the wrong thing...)  When a hadoop fs -put was handling around 40
M/s raw hdfs write speeds, flume's tcp syslog collector was writing at only
at 1 M/s.  Unfortunately, after upgrading to the 0.9.5 SNAPSHOT I'm not
seeing any difference in my tests.  It is still taking quite a while to
stream syslog data into hadoop, and I haven't even got to multiple flume
tiers yet.

Is this throughput expected, or is there a better way I can test this[1]?
 We're planning on having around 8k syslog streams coming in at a reasonable
clip and these numbers have me worried.  Perhaps I just need to try it out
and see.  Does anyone have a better way of benchmarking 'live' syslog data?

Thanks all,

-Steve


[1] Here's my test setup:

    host1 = flume master
    host2 = flume node
    node1 = logical config mapped to host2
    host3 = host sending syslog data

cru@host1:~/flume-0.9.5-SNAPSHOT$ flume master
.....
cru@host1:~/flume-bench$ cat bench.flume
exec unmapAll
exec refreshAll
exec config node1 'syslogTcp(5145)'
'collectorSink("hdfs://master-hadoop:54310/user/cru/bench/%Y-%m-%d/%H00/",
"test-")'
exec map node1 flume-host1
exec refreshAll
cru@host1:~/flume-bench$ cat bench.flume | ~/flume-0.9.5-SNAPSHOT/bin/flume
shell -c localhost
.....

cru@host2:~/flume-0.9.5-SNAPSHOT$ ./bin/flume node
.....

Once the flume node is up and listening to port 5145 on host2 (sometimes it
goes straight to ERROR so I have to refresh it a couple of times), I simply
netcat a 72M file to the listening flume node:

cru@host2:~$ ls -l mysyslog.file
-rw-r--r-- 1 cru cru 75565855 Sep 12 18:35
cru@host2:~$ time cat mysyslog.file | sed -e 's/^/<37>/' | nc host1 5145 -q
1

real    1m1.043s
user    0m0.316s
sys     0m0.360s
cru@host2:~$ echo '75565855 / 61.043' | bc
1237911

1M/s?!

Here's a comparable test using hadoop fs -put which also takes into account
network traffic:

listen to port 5155 on the flume node and when the files' all here, just fs
-put it:

cru@host1:~$ nc -l -p 5155 > test.txt && time hadoop fs -put test.txt
/user/cru/test.txt
....

cru@host2:~$ time cat mysyslog.file | sed -e 's/^/<37>/' | nc host1 5155 -q
1

real    0m1.500s
user    0m0.332s
sys     0m0.448s

(back on host1, time shows:

real    0m1.816s
user    0m2.600s
sys     0m0.312s
)

cru@host1:~$ echo '75565855 / ( 1.816 + 1.5 )' | bc
22788255

So, that's ~20M/s vs 1M/s.  While I know this isn't a *completely* fair
fight, it seems flume is loitering a bit more than it should be.  Just out
of curiosity, instead of syslogTcp(), i also tested a simple
tail("/tmp/test.txt") source of the same netcatted file.  Looking at the
timestamps for the first and last file and making collector.roll.millis
sufficiently small (~3s), I estimated the time it took to load the file as
about 30s (~2M/s):

2011-09-20 01:10:27,707 [logicalNode node2-30] INFO rolling.RollSink:
opening RollSink
 'escapedCustomDfs("hdfs://master-hadoop:54310/user/cru/bench/%Y-%m-%d/%H00/","test-%{rolltag}"
)'

2011-09-20 01:10:59,219 [Roll-TriggerThread-1] INFO hdfs.CustomDfsSink:
Closing HDFS file:
hdfs://master-hadoop:54310/user/cru/bench/2011-09-20/0100/test-20110920-011055986+0000.1102905343428388.00000033.tmp



On Mon, Sep 19, 2011 at 11:02 AM, Jeff Hansen <dscheffy@gmail.com> wrote:

> Hi Ed, I noticed the same issue Stephen mentioned in this thread a
> week or two ago.  I'd like to try running against trunk, but I'm
> having some difficulties compiling it.
>
> (kept getting a thrift error, so I compiled/installed thrift, still
> kept getting thrift errors, finally noticed note in devnotes that I
> need to specify thrift executable location -- even though I used the
> default... got lots more thrift issues because I had installed trunk
> and it looks like flume is going against 0.6.0, recompiled and
> installed thrift 0.6.x branch, still seeing tons of test failures when
> I run mvn install or mvn package -- I'm not finding any jar files
> created anywhere in the project after the build)
>
> If any of these sound familiar to you and you found a good source of
> developer information I'd be grateful (a lot of the READMEs and
> DEVNOTES in the source contain out of date links as well as pointing
> back to the google groups as the mailing list, so I'm hesitant to put
> too much faith in them).  I suppose subscribing to the developer
> mailing list might be a good idea...
>
> By the way, once you were able to successfully build, did you just
> replace the flume-core-0.9.4-cdh3-u1.jar in your cdh ditros lib folder
> with the one from the build?  In the short term I thin I'll want to
> run this from inside eclipse anyway for debugging purposes, but the
> build docs were a bit spotty on how to deploy from a built project.
>
> Thanks!
> Jeff
>
>
>
> On Fri, Sep 16, 2011 at 6:38 PM, Edward sanks <edsanks@hotmail.com> wrote:
> > Steve,
> >
> > If you noticed last week my mail about flume-0.9.4 hitting roof with just
> 3 syslogTcp streams on an aws large machine, you may want to explore going
> to latest code as well. Having said that I am yet to prove that point.
> >
> > Ed.
> > -----Original Message-----
> > From: Stephen Layland <stephen.layland@gmail.com>
> > Date: Fri, 16 Sep 2011 23:16:49
> > To: <flume-user@incubator.apache.org>
> > Subject: is collectorSink(dest, prefix, millis, format) broken or am i
> >  stupid?
> >
> > Hi!
> >
> >
> > Forgive the n00b question, but I'm trying to benchmark flume while
> building out a hadoop based central log store and am coming across some
> weirdness.  The flume-conf.xml has the default flume.collector.output.format
> set to 'avrojson'.  I had two simple configs:
> >
> >
> > test1: syslogTcp(5140) | collectorSink("hdfs://...", "test", 30000,
> "avrodata")
> > test2: syslogTcp(5140) | collectorSink("hdfs://...", "test", 30000,
> "raw")
> >
> >
> > I then mapped a test flume node to each of these logical nodes in turn
> (exec map node1 test1; exec refreshAll) and tested it out but the actual dfs
> files are all appear to be the same size and all appear to be avronjson?
> >
> >
> > Am I doing something wrong here?
> >
> >
> > Using flume version: 0.9.4-cdh3u1.
> >
> >
> > Thanks,
> >
> >
> > -Steve
> >
>

Mime
View raw message