flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Percy <mpe...@apache.org>
Subject Re: Append existing Avro file - HDFS Sink
Date Fri, 12 Oct 2018 18:16:22 GMT
Also consider setting up a Spark job or similar (Impala, Hive) to
periodically read the Avro files and output in a columnar format (Parquet
or ORC) which would give you small-files compaction (assuming you delete
the source files periodically) and better analytical read performance on
the columnar files.

Mike

On Fri, Oct 12, 2018 at 12:20 AM Rickard Cardell <rickard.cardell@klarna.com>
wrote:

>
>
> Den fre 20 apr. 2018 20:49Nitin Kumar <nitin.kumar2512@gmail.com> skrev:
>
>> Hi All,
>>
>> I am using Flume v1.8 in which Flume agent comprises of Kafka Channel &
>> HDFS Sink.
>> I am able to write data in Avro file on HDFS into a external HIVE table,
>> but the problem is whenever Flume gets restarted it closes that file and
>> open a new file because of which I can see many small files. (Data is
>> partition on the basis of date)
>>
>> Can't Flume append to existing file to avoid creation of new file?
>>
> Hi
> No, not hdfs-sink at least
>
> Also, how can I solve this problem which leads to creation of too many
>> small files?
>>
>
>
> We also used the hdfs-sink but because of the high maintenance we went for
> hbase-sink instead, which also gave us deduplication. The major drawback is
> that it requires an extra step, an hbase to hdfs job.
>
> Your many-small-files problem might be solved with an extra step, e.g
> oozie job, that would merge smaller files to larger ones.
>
> That would also solve the problem with the left over temp-files that flume
> doesn't clean up in some circumstances
>
> /Rickard
>
>
>> Any help would be appreciated.
>>
>> --
>>
>> *Regards,Nitin Kumar*
>>
>

Mime
View raw message