phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: data ingestion
Date Fri, 10 Oct 2014 06:57:24 GMT
Hi Ralph,

Inlined below.

> The data is arriving every 15min in multiple text files.  This is not a
> real-time system so some waiting is acceptable.  We have a 40 node
> cluster.
> I am investigating the Pig option.  I have not used Pig much - it appears
> the table must pre-exist in Phoenix/Hbase  before loading the data using
> Pig? Do I need to create two schema┬╣s - one for Pig and the other for
> Phoenix?  Are there any other examples of using Pig with Phoenix aside
> from what is online and in the source tree?

The table must indeed pre-exist in Phoenix before using the Pig
loader, but I don't think that should be a problem for you (unless you
want to create a new table for every import).

Pig can simply load from text files and write to Phoenix, so there's
no need to have a separate Pig schema. Probably the easiest way to
start is to just write a small Pig script that starts from a text file
(i.e. doesn't touch Phoenix), and writes to a text file in the form
that you could use for loading with the CsvBulkLoadTool. Once you're
up to that point, you can just replace the STORE call in your
text-to-text Pig script with a STORE call to PhoenixHBaseStorage, and
things should "just work".

> Writing MR is not an issue, this is my normal mode of ingesting data.  Is
> latency your primary concern to extending the CsvBulkLoadTool?  Would
> writing directly to the JDBC driver be a better approach in MR?

The main reason I see to not extend the CsvBulkLoadTool is that it's
more work than using Pig, probably both in the long term and in the
short term. Depending on how reliable the sources of data are that
you're dealing with, there may be a need (now or in the future) to
intercept invalid records and write them elsewhere, or similar things.
Writing things like that as raw MapReduce, while certainly possible,
are much easier with Pig. Using the Phoenix Pig plugin essentially
works the same as writing to the JDBC driver in MR, but means you just
need to create a Pig script with somewhere around 5 lines of code, as
opposed to the work involved in writing and packaging a MR ingester.

I typically do quite a bit of ETL stuff with raw MapReduce and/or
Crunch [1], but for things like what you're describing, I'll always go
for Pig if possible.

If you find that Pig won't cut it for your requirements, then writing
via JDBC in a MapReduce job should be fine. Ravi Magham has been
working on some stuff [2] to make writing to Phoenix via MR easier,
and although it's not in Phoenix yet, it should provide you with some

- Gabriel


> Lots of questions - thanks for your time.
> Ralph
> __________________________________________________
> Ralph Perko
> Pacific Northwest National Laboratory
> On 10/9/14, 11:17 AM, "Gabriel Reid" <> wrote:
>>Hi Ralph,
>>I think this depends a bit on how quickly you want to get the data
>>into Phoenix after it arrives, what kind of infrastructure you've got
>>available to run MR jobs, and how the data is actually arriving.
>>In general, the highest-throughput and least-flexible option is the
>>CsvBulkLoadTool. Of your transformation requirements above, it'll be
>>able to take care of #1 and #4, but parsing fields and combining
>>fields won't be covered by it. The main reason for the relatively high
>>throughput of the CsvBulkLoadTool is that it writes directly to HFiles
>>-- however, it is pretty high latency, and it would probably be best
>>used to do one or two big loads each day instead of every 15 minutes.
>>Two other options that are probably worth considering are Pig and
>>Flume. I believe Pig should be able to provide enough transformation
>>logic for what you need, and then you can plug it into the Phoenix
>>StoreFunc for Pig [1]. Although this won't give you the same
>>throughput as the CsvBulkLoadTool, it'll be more flexible, as well as
>>probably having slightly lower overall latency because there is no
>>reduce involved.
>>I don't think that Flume itself provides much in the way of
>>transformation logic, but with Kite Morphlines [2] you can plug in
>>some transformation logic within Flume, and then send the data through
>>to the Phoenix Flume plugin [3]. I haven't got much experience with
>>Flume, but I believe that this should work in theory.
>>In any case, I would suggest trying to go with Pig first, and Flume
>>second. A custom solution will mean you'll need to worry about
>>scaling/parallelization to get high enough throughput, and both Pig
>>and Flume are more made for what you're looking for.
>>Extending the CsvBulkLoadTool would also be an option, but I would
>>recommend using that as a last resort (if you can't get high enough
>>throughput with the other options).
>>- Gabriel
>>On Thu, Oct 9, 2014 at 4:36 PM, Perko, Ralph J <>
>>> Hi,  What is the best way to ingest large amounts of csv data coming in
>>> regular intervals (about every 15min for a total of about 500G/daily or
>>> records/daily) that requires a few transformations before being
>>> By transformation I mean the following:
>>> 1) 1 field is converted to a timestamp
>>> 2) 1 field is parsed to create a new field
>>> 3) several fields are combined into 1
>>> 4) a couple columns need to be reordered
>>> Is there anyway to make these transformations through the bulk load
>>>tool or
>>> is MR the best route?
>>> If I use MR should I go purely through JDBC? Write directly to hbase?
>>> something similar to the csv bulk load tool (Perhaps even just
>>> the CsvBulkLoadTool?) or something altogether different?
>>> Thanks!
>>> Ralph
>>> __________________________________________________
>>> Ralph Perko
>>> Pacific Northwest National Laboratory

View raw message