phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: bulk loading with dynamic columns
Date Fri, 17 Oct 2014 15:15:53 GMT
Hi Bob,
Yes, you're correct - dynamic columns end up as column qualifiers.
Column names can't be supplied as parameters, though. How about if you
generate the UPSERT statement, and double quote your dynamic column
names if you want a case sensitive match? It'd look like this, then:

UPSERT INTO foo("col1" VARCHAR) VALUES(?, ?, ?)

Thanks,
James

On Fri, Oct 17, 2014 at 2:51 AM, Bob Dole <w71223@yahoo.com> wrote:
> An update on my progress and troubles!
>
> My understanding of JDBC is that there are two ways to batch queries:
> Statement#addBatch and PreparedStatement#addBatch. I would prefer to use
> PreparedStatement as it frees me from having to escape the SQL string and is
> probably faster.
>
> When using PreparedStatement, all batched statements must conform to the
> declared format of the PreparedStatement. I encounter troubles here because
> of the dynamic columns. My use case is such that the dynamic columns have
> semantic meaning to the data and results in high cardinality of dynamic
> columns. E.g. for the sake of discussion, we can treat each row as having a
> zero or more unique dynamic columns. This forces me to create a new
> PreparedStatement per row, preventing batching.
>
> I was hoping to alleviate this problem by grouping my rows by the
> cardinality of the dynamic columns. As an example of batching with this
> approach, consider the following PreparedStatement for cardinality of size
> one,
>
> UPSERT INTO foo(? VARCHAR) VALUES(?, ?, ?)
>
> I would then use the PreparedStatement to set the name of the dynamic
> column. Unfortunately, this fails with ERROR 601 (42P00) Syntax Error.
> Encountered "?" at line 1 column 17. Is this a JDBC restriction or Phoenix?
> Would it be difficult to enable such a statement?
>
> I suppose I'll try to use Statement#addBatch next. I am a bit concerned
> about performance here. My intuition is that Statement is slower compared to
> a PreparedStatement and hopefully, I won't run into too many troubles with
> escaping.
>
> On another note, can you comment on whether there are any performance issues
> with so many dynamic columns? I'm guessing these become qualifiers?
>
> -Bob
>
>
>
>
>
>
>
> On Thursday, October 16, 2014 11:54 PM, Gabriel Reid
> <gabriel.reid@gmail.com> wrote:
>
>
> Bob,
>
> Going via Scalding sounds like a fine idea as well -- the advantage of
> using Pig is that you wouldn't need to implement anything custom in
> terms of JDBC handling (because it already exists), but indeed I would
> expect that you'll get comparable performance with Scalding.
>
> If you want to generate HFiles, I would indeed look at extending (or
> reusing parts of) the CsvBulkLoadTool, which currently creates HFiles.
> However, I would definitely only go this way as a fallback if JDBC
> performance isn't sufficient.
>
> I actually never really considered a CSV file being dynamic, I was
> more thinking along the lines of loading CSV files with different
> schemas into the same table (via dynamic columns). If it's at all an
> option, I would suggest splitting out records by schema first in a
> pre-processing stage, and then loading the collection of files that
> match a single schema together. CSV is a fine format for really simple
> schemas, but I don't think it would be at all suited to storing
> records with different schemas.
>
> - Gabriel
>
> On Fri, Oct 17, 2014 at 8:27 AM, Bob Dole <w71223@yahoo.com> wrote:
>> Gabriel,
>>
>> Thanks for your response. My current plan is to implement the bulk load
>> using scalding via jdbc. I have not played with Pig, but, my guess is my
>> scalding solution will achieve comparable performance.
>>
>> I haven't done a performance test yet, but, if it turns out that loading
>> via
>> jdbc is too slow, I would need to generate the HFiles.
>>
>> I would be interested in your thoughts on how you'd approach generating
>> hfiles. Would you extend the csv bulk loader? How would you represent
>> dynamic columns in a csv? A general solution is also further complicated
>> by
>> the fact that a dynamic column may have heterogeneous types.
>>
>> -Bob
>>
>> On Thursday, October 16, 2014 12:24 AM, Gabriel Reid
>> <gabriel.reid@gmail.com> wrote:
>>
>>
>> Hi Bob,
>>
>> No, there currently isn't any support for bulk loading dynamic columns.
>>
>> I think that this would (in theory) be as simple as supplying a custom
>> upsert statement to the bulk loader or PhoenixHBaseStorage (if you're
>> using Pig), so it probably wouldn't be too tricky to implement.
>>
>> If you're interested in having something like this in Phoenix, could
>> you log a ticket for it at
>> https://issues.apache.org/jira/browse/PHOENIX? If you're interested in
>> taking a crack at implementing it as well, feel free (as well as
>> feeling free to ask for advice on how to go about it).
>>
>> - Gabriel
>>
>>
>> On Thu, Oct 16, 2014 at 7:58 AM, Bob Dole <w71223@yahoo.com> wrote:
>>> Is there any existing support perform bulk loading with dynamic columns?
>>>
>>> Thanks!
>>
>>
>
>

Mime
View raw message