flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Durgapal <durgapalmo...@gmail.com>
Subject Re: Appending data into HDFS Sink!
Date Mon, 19 Jan 2015 13:49:10 GMT
But why do you want your MR Job to read from the .tmp file? .tmp means it
is a temporary file i.e it's state is not specific(at least not to the
user) and hence you're not supposed to read from it. Your MR Job should
only consider files that are not ending with .tmp. Also, there's very high
probability that the MR Job will not find the same .tmp file when it
actually tries to gather the file's contents from Namenode as flume removes
the .tmp after flushing the contents completely to a file. In that case it
will throw the file doesn't exist error. That's something I have faced. If
you want files at shorter intervals then you can reduce the
...hdfs.rollcount or hdfs.rollinterval properties of your flume agent to
have files generated more frequently but reading from the .tmp files seems
like a bad idea to me.


Just in case if anyone's interested :

what we did was added a filter  for .tmp files by using a PathFilter like
below:

public class InputFilter implements PathFilter {

    @Override
    public boolean accept(Path p) {
        String name = p.getName();
        return !name.endsWith(".tmp");
    }
}


And included it in the MR Job's driver code.



Mohit

On Mon, Jan 19, 2015 at 6:59 PM, Raj Kumar <rajkumartheone999@gmail.com>
wrote:

> Hello guys!
>
> I'm new to Flume and this group, so please be patience with me :-)
>
>
> I have a Flume which stream data into HDFS sink (appends to same file),
> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run
> MapReduce job on that file (.tmp), it only picks up the first batch that
> was flushed (bacthSize = 100) into HDFS. The rest are not being picked up,
> although I could cat and see the rest. When I execute the MapReduce job
> after the file is rolled(closed), it's picking up all data.
>
> Do you know why MR job is failing to find the rest of the batch even
> though it exists.
>
> Best regards,
>
> Raj.
>
>

Mime
View raw message