flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Meyer, Dennis" <dennis.me...@adtech.com>
Subject Flume Reliability Issues
Date Fri, 02 Mar 2012 12:35:23 GMT

We encountered the following Issues in our development with Flume. We're investigating the
issues currently, but it would be great if someone could sent some feedback if this is

  *   Work as designed (but maybe misused)
  *   Known Issues (in that version only?)
  *   Not supported feature

Here comes the list with the four issues we have seen:

  *   Used Flume version 0.9.4+25.40-1
  *   1) Feature "Duplicate Data" works inconsistently
     *   Not all data will be duplicated all the time (usage: send data to a SAN for a full
backup and send data to HDFS)
     *   If a receiving node goes out of service, the sending node stops sending data to all
receiving nodes
     *   Should a failed receiving node reconnect
        *   The sending nodes CPU will go up to 100% usage, meaning that it will stop handling
records from now and as even if the CPU recovers, Flume does not
        *   Even if the failed node reconnects, there is a chance that the sending source
will not notice the reconnect. This can only be fixed by a full restart of all involved sending/receiving
  *   2) Flume is unable to recover failed/crashed/lost nodes reliably
     *   Often failed nodes get back up, but are not integrated into the data flow anymore(i.e.
a source not knowing that its sink reconnected)
     *   A node may be lost, but neither the master nor any connected node know about it
     *   A failed node can only be reliably re-introduced into its flow if ALL nodes are restarted
  *   3) Flume is unable to run the highest reliability mode for records crash free
     *   If a node reconnects after a failure, there is a good chance that the master node
  *   4) Loosing records on node failure
     *   Flume sends up to one thousand records as a batch from source to sink. If the sink
failes on the first record, the other 999 records sometimes get lost.
     *   On the highest reliability mode, Flume was unable to reroute records safely through.
As we send data to a node which is or goes out of service, Flume saves this data for later
when the node reconnects. What it should really do is take the events from the failed node
and reroute them accordingly to the defined flow into another node.


View raw message