phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Dole <w71...@yahoo.com>
Subject Re: bulk loading with dynamic columns
Date Fri, 17 Oct 2014 09:51:13 GMT
An update on my progress and troubles!

My understanding of JDBC is that there are two ways to batch queries: Statement#addBatch and
PreparedStatement#addBatch. I would prefer to use PreparedStatement as it frees me from having
to escape the SQL string and is probably faster.


When using PreparedStatement, all batched statements must conform to the declared format of
the PreparedStatement. I encounter troubles here because of the dynamic columns.My use case
is such that the dynamic columns have semantic meaning to the data and results in high cardinality
of dynamic columns. E.g. for the sake of discussion, we can treat each row as having a zero
or more unique dynamic columns. This forces me to create a new PreparedStatement per row,
preventing batching.


I was hoping to alleviate this problem by grouping my rows by the cardinality of the dynamic
columns. As an example of batching with this approach, consider the following PreparedStatement
for cardinality of size one,

UPSERT INTO foo(? VARCHAR) VALUES(?, ?, ?)

I would then use the PreparedStatement to set the name of the dynamic column. Unfortunately,
this fails with ERROR 601 (42P00) Syntax Error. Encountered "?" at line 1 column 17. Is this
a JDBC restriction or Phoenix? Would it be difficult to enable such a statement?


I suppose I'll try to use Statement#addBatch next. I am a bit concerned about performance
here. My intuition is that Statement is slower compared to a PreparedStatement and hopefully,
I won't run into too many troubles with escaping.


On another note, can you comment on whether there are any performance issues with so many
dynamic columns? I'm guessing these become qualifiers?

-Bob








On Thursday, October 16, 2014 11:54 PM, Gabriel Reid <gabriel.reid@gmail.com> wrote:
 


Bob,

Going via Scalding sounds like a fine idea as well -- the advantage of
using Pig is that you wouldn't need to implement anything custom in
terms of JDBC handling (because it already exists), but indeed I would
expect that you'll get comparable performance with Scalding.

If you want to generate HFiles, I would indeed look at extending (or
reusing parts of) the CsvBulkLoadTool, which currently creates HFiles.
However, I would definitely only go this way as a fallback if JDBC
performance isn't sufficient.

I actually never really considered a CSV file being dynamic, I was
more thinking along the lines of loading CSV files with different
schemas into the same table (via dynamic columns). If it's at all an
option, I would suggest splitting out records by schema first in a
pre-processing stage, and then loading the collection of files that
match a single schema together. CSV is a fine format for really simple
schemas, but I don't think it would be at all suited to storing
records with different schemas.

- Gabriel


On Fri, Oct 17, 2014 at 8:27 AM, Bob Dole <w71223@yahoo.com> wrote:
> Gabriel,
>
> Thanks for your response. My current plan is to implement the bulk load
> using scalding via jdbc. I have not played with Pig, but, my guess is my
> scalding solution will achieve comparable performance.
>
> I haven't done a performance test yet, but, if it turns out that loading via
> jdbc is too slow, I would need to generate the HFiles.
>
> I would be interested in your thoughts on how you'd approach generating
> hfiles. Would you extend the csv bulk loader? How would you represent
> dynamic columns in a csv? A general solution is also further complicated by
> the fact that a dynamic column may have heterogeneous types.
>
> -Bob
>
> On Thursday, October 16, 2014 12:24 AM, Gabriel Reid
> <gabriel.reid@gmail.com> wrote:
>
>
> Hi Bob,
>
> No, there currently isn't any support for bulk loading dynamic columns.
>
> I think that this would (in theory) be as simple as supplying a custom
> upsert statement to the bulk loader or PhoenixHBaseStorage (if you're
> using Pig), so it probably wouldn't be too tricky to implement.
>
> If you're interested in having something like this in Phoenix, could
> you log a ticket for it at
> https://issues.apache.org/jira/browse/PHOENIX? If you're interested in
> taking a crack at implementing it as well, feel free (as well as
> feeling free to ask for advice on how to go about it).
>
> - Gabriel
>
>
> On Thu, Oct 16, 2014 at 7:58 AM, Bob Dole <w71223@yahoo.com> wrote:
>> Is there any existing support perform bulk loading with dynamic columns?
>>
>> Thanks!
>
>
Mime
View raw message