flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michal Klempa <michal.kle...@gmail.com>
Subject HiveSink not multithreaded?
Date Thu, 26 Jan 2017 07:17:20 GMT
Hi,
I was working a lot with HiveSink to put the data into Hive, not only
I discovered this bug
https://issues.apache.org/jira/browse/HIVE-15658
but also I have found that HiveSink differs from HDFSEventSink in the
way the thread pool for
delayed operations is created.

See this line in HDFSEventSink:
https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L522
it uses argument threadsPoolSize which is by default 10
(https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L97)
but can be configured as hdfs.threadPoolSize in flume config
(https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java#L210)

To the contrary, HiveSink creates the thread pool this way:
https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hive-sink/src/main/java/org/apache/flume/sink/hive/HiveSink.java#L493
1 thread with note // call timeout pool needs only 1 thd as sink is
effectively single threaded

Why is the Hive sink effectively single threaded? There is no notion
of this in documentation (FlumeUserGuide) and how should I handle this
situation? For performance reasons, i would like to have multithreaded
writeout into Hive, do I have to Multiplex/Round-robin fan-out and
configure multiple HiveSinks? Probably I have to, but it is ugly.

What is the problem that the HiveSInk is single threaded?

Thanks, Michal

Mime
View raw message