phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Get region for row key
Date Tue, 12 Jul 2016 17:03:12 GMT
If our Hive support would solve your use case, perhaps you could look into
supporting Hive 0.13-1. I'm not sure of the level of effort, as others
contributed this integration. How about filing a JIRA to discuss?

We'll have an RC up for 4.8 in the next day or so.

Thanks,
James

On Tue, Jul 12, 2016 at 6:31 PM, Simon Wang <simon.wang@airbnb.com> wrote:

> Hi James,
>
> Sorry if I wasn’t clear enough. One example use case is:
> 1. load a Hive data frame,
> 2. repartition (using default hash function),
> 3. forEachPartition batch query the rows against Phoenix.
>
> This process is a bit slow. We figured that it might have something to do
> with spark executor accessing too many regions. If we can repartition
> according to the region each row will be in, we should see a performance
> improvement.
>
> Yes, I am aware of the Phoenix-Hive integration, and actually tried to use
> it. Sadly we are running Hive 0.13-1. It doesn’t seem that we are moving to
> 1.2.0+ any time soon. It would be great if there will be an 0.13-1
> compatible version.
>
> By the way, is there any target release date for 4.8?
>
> Thanks,
> Simon
>
>
> On Jul 12, 2016, at 12:28 AM, James Taylor <jamestaylor@apache.org> wrote:
>
> Hi Simon,
>
> I still don't understand the use case completely. Also, did you know
> Phoenix has Hive integration now (as of 4.8)? Would it be possible for you
> to try using that? My initial impression is that you're dipping down to to
> low of a level here, using may non public APIs which may change in
> incompatible ways in future releases.
>
> Thanks,
> James
>
> On Tue, Jul 12, 2016 at 7:14 AM, Simon Wang <simon.wang@airbnb.com> wrote:
>
>> As I read more Phoenix code, I feel that I should do:
>>
>> 1. Use `PhoenixRuntime.getTable` to get a `PTable`
>> 2. Use `table.getPKColumns` to get a list of `PColumn`s
>> 3. For each column, use `column.getDataType`; then
>> `dataType.toBytes(value, column.getSortOrder)`
>> 4. Finally, create a new `ImmutableBytesPtr`, and do `table.newKey(ptr,
>> pksByteArray)`
>> 5. Eventually, get salted key as `SaltingUtil.getSaltedKey(ptr,
>> table.getBucketNum())`
>>
>> I appreciate anyone that can help me check this is correct. :)
>>
>> Thanks a lot!
>>
>> Best,
>> Simon
>>
>>
>>
>> On Jul 10, 2016, at 4:24 PM, Simon Wang <simon.wang@airbnb.com> wrote:
>>
>> About the use case:
>>
>> We want to do JDBC queries for each row in a Hive partition. Currently,
>> we use Spark to partition the Hive dataFrame, then do batch query in
>> foreachPartition. Since each partition is accessing multiple regionservers,
>> there are a lot of overhead. So we are thinking about partitioning the
>> dataFrame according to the HBase region.
>>
>> Any help is appreciated!
>>
>> Best,
>> Simon
>>
>> On Jul 10, 2016, at 2:01 PM, Simon Wang <simon.wang@airbnb.com> wrote:
>>
>> Hi all,
>>
>> Happy weekend!
>>
>> I am writing to ask if there is a way that I can get the region number of
>> any given row key?
>>
>> For the case will salting is applied, I discovered `
>> SaltingUtils.getSaltedKey` method, but I am not sure how I can get
>> serialize the key as `ImmutableBytesWritable`.
>>
>> In general, how should the client get the region number? Assuming that
>> the client have no prior knowledge of the table. So the client needs to
>> read from metadata (salted or not, SPLIT ON or not), serialize key, compare
>> with splits, etc.
>>
>> Thanks in advance!
>>
>>
>> Best,
>> Simon
>>
>>
>>
>>
>
>

Mime
View raw message