phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: data ingestion
Date Thu, 09 Oct 2014 18:17:22 GMT
Hi Ralph,

I think this depends a bit on how quickly you want to get the data
into Phoenix after it arrives, what kind of infrastructure you've got
available to run MR jobs, and how the data is actually arriving.

In general, the highest-throughput and least-flexible option is the
CsvBulkLoadTool. Of your transformation requirements above, it'll be
able to take care of #1 and #4, but parsing fields and combining
fields won't be covered by it. The main reason for the relatively high
throughput of the CsvBulkLoadTool is that it writes directly to HFiles
-- however, it is pretty high latency, and it would probably be best
used to do one or two big loads each day instead of every 15 minutes.

Two other options that are probably worth considering are Pig and
Flume. I believe Pig should be able to provide enough transformation
logic for what you need, and then you can plug it into the Phoenix
StoreFunc for Pig [1]. Although this won't give you the same
throughput as the CsvBulkLoadTool, it'll be more flexible, as well as
probably having slightly lower overall latency because there is no
reduce involved.

I don't think that Flume itself provides much in the way of
transformation logic, but with Kite Morphlines [2] you can plug in
some transformation logic within Flume, and then send the data through
to the Phoenix Flume plugin [3]. I haven't got much experience with
Flume, but I believe that this should work in theory.

In any case, I would suggest trying to go with Pig first, and Flume
second. A custom solution will mean you'll need to worry about
scaling/parallelization to get high enough throughput, and both Pig
and Flume are more made for what you're looking for.

Extending the CsvBulkLoadTool would also be an option, but I would
recommend using that as a last resort (if you can't get high enough
throughput with the other options).

- Gabriel


[1] http://phoenix.apache.org/pig_integration.html
[2] http://kitesdk.org/docs/current/kite-morphlines/index.html
[3] http://phoenix.apache.org/flume.html

On Thu, Oct 9, 2014 at 4:36 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov> wrote:
> Hi,  What is the best way to ingest large amounts of csv data coming in at
> regular intervals (about every 15min for a total of about 500G/daily or 1.5B
> records/daily) that requires a few transformations before being inserted?
>
> By transformation I mean the following:
> 1) 1 field is converted to a timestamp
> 2) 1 field is parsed to create a new field
> 3) several fields are combined into 1
> 4) a couple columns need to be reordered
>
> Is there anyway to make these transformations through the bulk load tool or
> is MR the best route?
> If I use MR should I go purely through JDBC? Write directly to hbase?  Doing
> something similar to the csv bulk load tool (Perhaps even just customizing
> the CsvBulkLoadTool?) or something altogether different?
>
> Thanks!
> Ralph
>
> __________________________________________________
> Ralph Perko
> Pacific Northwest National Laboratory
>

Mime
View raw message