phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Wang <simon.w...@airbnb.com>
Subject Re: Get region for row key
Date Tue, 12 Jul 2016 18:22:36 GMT
Thanks James. I will file a JIRA. Actually I spent some time a few days ago trying to make
Phoenix-Hive compatible with 0.13-1, but it did seem easy. So before 0.13-1 is supported,
we may want to use the workflow I proposed as a temporary workaround for 4.7. It would be
great to know if these steps look good to you.
 
>> 1. Use `PhoenixRuntime.getTable` to get a `PTable`
>> 2. Use `table.getPKColumns` to get a list of `PColumn`s
>> 3. For each column, use `column.getDataType`; then `dataType.toBytes(value, column.getSortOrder)`
>> 4. Finally, create a new `ImmutableBytesPtr`, and do `table.newKey(ptr, pksByteArray)`
>> 5. Eventually, get salted key as `SaltingUtil.getSaltedKey(ptr, table.getBucketNum())`


Best,
Simon

> On Jul 12, 2016, at 10:03 AM, James Taylor <jamestaylor@apache.org> wrote:
> 
> If our Hive support would solve your use case, perhaps you could look into supporting
Hive 0.13-1. I'm not sure of the level of effort, as others contributed this integration.
How about filing a JIRA to discuss?
> 
> We'll have an RC up for 4.8 in the next day or so.
> 
> Thanks,
> James
> 
> On Tue, Jul 12, 2016 at 6:31 PM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
> Hi James,
> 
> Sorry if I wasn’t clear enough. One example use case is: 
> 1. load a Hive data frame, 
> 2. repartition (using default hash function), 
> 3. forEachPartition batch query the rows against Phoenix. 
> 
> This process is a bit slow. We figured that it might have something to do with spark
executor accessing too many regions. If we can repartition according to the region each row
will be in, we should see a performance improvement.
> 
> Yes, I am aware of the Phoenix-Hive integration, and actually tried to use it. Sadly
we are running Hive 0.13-1. It doesn’t seem that we are moving to 1.2.0+ any time soon.
It would be great if there will be an 0.13-1 compatible version.
> 
> By the way, is there any target release date for 4.8?
> 
> Thanks,
> Simon
> 
> 
>> On Jul 12, 2016, at 12:28 AM, James Taylor <jamestaylor@apache.org <mailto:jamestaylor@apache.org>>
wrote:
>> 
>> Hi Simon,
>> 
>> I still don't understand the use case completely. Also, did you know Phoenix has
Hive integration now (as of 4.8)? Would it be possible for you to try using that? My initial
impression is that you're dipping down to to low of a level here, using may non public APIs
which may change in incompatible ways in future releases.
>> 
>> Thanks,
>> James
>> 
>> On Tue, Jul 12, 2016 at 7:14 AM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
>> As I read more Phoenix code, I feel that I should do:
>> 
>> 1. Use `PhoenixRuntime.getTable` to get a `PTable`
>> 2. Use `table.getPKColumns` to get a list of `PColumn`s
>> 3. For each column, use `column.getDataType`; then `dataType.toBytes(value, column.getSortOrder)`
>> 4. Finally, create a new `ImmutableBytesPtr`, and do `table.newKey(ptr, pksByteArray)`
>> 5. Eventually, get salted key as `SaltingUtil.getSaltedKey(ptr, table.getBucketNum())`
>> 
>> I appreciate anyone that can help me check this is correct. :)
>> 
>> Thanks a lot!
>> 
>> Best,
>> Simon
>> 
>> 
>> 
>>> On Jul 10, 2016, at 4:24 PM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
>>> 
>>> About the use case:
>>> 
>>> We want to do JDBC queries for each row in a Hive partition. Currently, we use
Spark to partition the Hive dataFrame, then do batch query in foreachPartition. Since each
partition is accessing multiple regionservers, there are a lot of overhead. So we are thinking
about partitioning the dataFrame according to the HBase region.
>>> 
>>> Any help is appreciated!
>>> 
>>> Best,
>>> Simon
>>> 
>>>> On Jul 10, 2016, at 2:01 PM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> Happy weekend!
>>>> 
>>>> I am writing to ask if there is a way that I can get the region number of
any given row key? 
>>>> 
>>>> For the case will salting is applied, I discovered `SaltingUtils.getSaltedKey`
method, but I am not sure how I can get serialize the key as `ImmutableBytesWritable`.
>>>> 
>>>> In general, how should the client get the region number? Assuming that the
client have no prior knowledge of the table. So the client needs to read from metadata (salted
or not, SPLIT ON or not), serialize key, compare with splits, etc.
>>>> 
>>>> Thanks in advance!
>>>> 
>>>> 
>>>> Best,
>>>> Simon
>>>> 
>>> 
>> 
>> 
> 
> 


Mime
View raw message