flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Hoffman <ste...@goofy.net>
Subject Re: File rollover and small files
Date Mon, 20 Feb 2012 21:56:45 GMT
I noticed this too.
I was using a custom sink based off the AvroDataFileOutputFormat.
I think the problem comes from the sink.flush() call which calls
sync() in DataFileWriter.
This in turns calls writeBlock() which runs the data thru the compression codec.
Not sure where the logic goes wrong, but seems that calling flush()
causes the amount of compressible data to be small so it doesn't
really do much (I noticed about 1/2 size of the original).

Just to verify the data wasn't as small as it could be, I used
avro-tools tojson followed by a fromjson (with snappy codec and my
schema had from getschema on avrotools) and notices the resulting file
was about 1/10th the original.

So, in my own output format, I removed that flush() line (in my
format() after the sink.append()), which caused the size to come out
almost the same as my tojson/fromjson experiment.

So it seems that flush() should be removed, but I don't know what that
will do to DFO/E2E flows (not that a flush really flushes to HDFS the
way it makes it persistent on a POSIX filesystem).

Can anybody more familiar with the code say what removing this flush()
will do to the flume event processing/acks?  Is this a bug (do we need
a jira)?


On Tue, Feb 14, 2012 at 1:17 PM, Jeremy Custenborder
<jcustenborder@gmail.com> wrote:
> Hello All,
> I'm working on a POC using flume to aggregate our log files to s3
> which will later be imported to hdfs and consumed by hive.  Here is my
> problem. The web server I'm currently using for the POC is not pushing
> much traffic, maybe 3 to 5 requests per second. This is resulting is a
> huge number of small files on the web server. I have roll set for
> 900000 which I thought would generate a file every 15 minutes. I'm
> getting files uploaded to s3 anywhere from 5 seconds to 50 seconds
> apart, and they are pretty small too 600 bytes to. My goal is to have
> at most 4 - 6 files per hour.
> Web Server
> source: tailDir("/var/log/apache2/site1/", fileregex="access.log")
> sink:value("sitecode", "site1") value("subsitecode", "subsite1")
> agentDFOSink("collector node",35853)
> Collector node
> source: collectorSource(35853)
> sink: collector(35853) { webLogDecorator()  roll(900000) {
> escapedFormatDfs("s3n://<valid s3
> bucket>/hive/weblogs_live/dt=%Y-%m-%d/sitecode=%{sitecode}/subsitecode=%{subsitecode}/",
> "file-%{rolltag}", seqfile("snappy"))}}
> Here is what my config looks like.
>  <property>
>    <name>flume.collector.roll.millis</name>
>    <value>900000</value>
>    <description>The time (in milliseconds)
>    between when hdfs files are closed and a new file is opened
>    (rolled).
>    </description>
>  </property>
>  <property>
>    <name>flume.agent.logdir.retransmit</name>
>    <value>2700000</value>
>    <description>The time (in milliseconds)
>    between when hdfs files are closed and a new file is opened
>    (rolled).
>    </description>
>  </property>
>  <property>
>    <name>flume.agent.logdir.maxage</name>
>    <value>450000</value>
>    <description> number of milliseconds before a local log file is
>    considered closed and ready to forward.
>    </description>
>  </property>
> I have to be missing something? What am I doing wrong?
> J

View raw message