phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perko, Ralph J" <Ralph.Pe...@pnnl.gov>
Subject Re: data ingestion
Date Thu, 09 Oct 2014 21:44:03 GMT
Gabriel,

Thanks for the reply.

The data is arriving every 15min in multiple text files.  This is not a
real-time system so some waiting is acceptable.  We have a 40 node
cluster.  

I am investigating the Pig option.  I have not used Pig much - it appears
the table must pre-exist in Phoenix/Hbase  before loading the data using
Pig? Do I need to create two schema┬╣s - one for Pig and the other for
Phoenix?  Are there any other examples of using Pig with Phoenix aside
from what is online and in the source tree?

Writing MR is not an issue, this is my normal mode of ingesting data.  Is
latency your primary concern to extending the CsvBulkLoadTool?  Would
writing directly to the JDBC driver be a better approach in MR?

Lots of questions - thanks for your time.

Ralph

__________________________________________________
Ralph Perko 
Pacific Northwest National Laboratory









On 10/9/14, 11:17 AM, "Gabriel Reid" <gabriel.reid@gmail.com> wrote:

>Hi Ralph,
>
>I think this depends a bit on how quickly you want to get the data
>into Phoenix after it arrives, what kind of infrastructure you've got
>available to run MR jobs, and how the data is actually arriving.
>
>In general, the highest-throughput and least-flexible option is the
>CsvBulkLoadTool. Of your transformation requirements above, it'll be
>able to take care of #1 and #4, but parsing fields and combining
>fields won't be covered by it. The main reason for the relatively high
>throughput of the CsvBulkLoadTool is that it writes directly to HFiles
>-- however, it is pretty high latency, and it would probably be best
>used to do one or two big loads each day instead of every 15 minutes.
>
>Two other options that are probably worth considering are Pig and
>Flume. I believe Pig should be able to provide enough transformation
>logic for what you need, and then you can plug it into the Phoenix
>StoreFunc for Pig [1]. Although this won't give you the same
>throughput as the CsvBulkLoadTool, it'll be more flexible, as well as
>probably having slightly lower overall latency because there is no
>reduce involved.
>
>I don't think that Flume itself provides much in the way of
>transformation logic, but with Kite Morphlines [2] you can plug in
>some transformation logic within Flume, and then send the data through
>to the Phoenix Flume plugin [3]. I haven't got much experience with
>Flume, but I believe that this should work in theory.
>
>In any case, I would suggest trying to go with Pig first, and Flume
>second. A custom solution will mean you'll need to worry about
>scaling/parallelization to get high enough throughput, and both Pig
>and Flume are more made for what you're looking for.
>
>Extending the CsvBulkLoadTool would also be an option, but I would
>recommend using that as a last resort (if you can't get high enough
>throughput with the other options).
>
>- Gabriel
>
>
>[1] http://phoenix.apache.org/pig_integration.html
>[2] http://kitesdk.org/docs/current/kite-morphlines/index.html
>[3] http://phoenix.apache.org/flume.html
>
>On Thu, Oct 9, 2014 at 4:36 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>wrote:
>> Hi,  What is the best way to ingest large amounts of csv data coming in
>>at
>> regular intervals (about every 15min for a total of about 500G/daily or
>>1.5B
>> records/daily) that requires a few transformations before being
>>inserted?
>>
>> By transformation I mean the following:
>> 1) 1 field is converted to a timestamp
>> 2) 1 field is parsed to create a new field
>> 3) several fields are combined into 1
>> 4) a couple columns need to be reordered
>>
>> Is there anyway to make these transformations through the bulk load
>>tool or
>> is MR the best route?
>> If I use MR should I go purely through JDBC? Write directly to hbase?
>>Doing
>> something similar to the csv bulk load tool (Perhaps even just
>>customizing
>> the CsvBulkLoadTool?) or something altogether different?
>>
>> Thanks!
>> Ralph
>>
>> __________________________________________________
>> Ralph Perko
>> Pacific Northwest National Laboratory
>>


Mime
View raw message