phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Wang <simon.w...@airbnb.com>
Subject Re: Get region for row key
Date Tue, 12 Jul 2016 16:31:25 GMT
Hi James,

Sorry if I wasn’t clear enough. One example use case is: 
1. load a Hive data frame, 
2. repartition (using default hash function), 
3. forEachPartition batch query the rows against Phoenix. 

This process is a bit slow. We figured that it might have something to do with spark executor
accessing too many regions. If we can repartition according to the region each row will be
in, we should see a performance improvement.

Yes, I am aware of the Phoenix-Hive integration, and actually tried to use it. Sadly we are
running Hive 0.13-1. It doesn’t seem that we are moving to 1.2.0+ any time soon. It would
be great if there will be an 0.13-1 compatible version.

By the way, is there any target release date for 4.8?

Thanks,
Simon


> On Jul 12, 2016, at 12:28 AM, James Taylor <jamestaylor@apache.org> wrote:
> 
> Hi Simon,
> 
> I still don't understand the use case completely. Also, did you know Phoenix has Hive
integration now (as of 4.8)? Would it be possible for you to try using that? My initial impression
is that you're dipping down to to low of a level here, using may non public APIs which may
change in incompatible ways in future releases.
> 
> Thanks,
> James
> 
> On Tue, Jul 12, 2016 at 7:14 AM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
> As I read more Phoenix code, I feel that I should do:
> 
> 1. Use `PhoenixRuntime.getTable` to get a `PTable`
> 2. Use `table.getPKColumns` to get a list of `PColumn`s
> 3. For each column, use `column.getDataType`; then `dataType.toBytes(value, column.getSortOrder)`
> 4. Finally, create a new `ImmutableBytesPtr`, and do `table.newKey(ptr, pksByteArray)`
> 5. Eventually, get salted key as `SaltingUtil.getSaltedKey(ptr, table.getBucketNum())`
> 
> I appreciate anyone that can help me check this is correct. :)
> 
> Thanks a lot!
> 
> Best,
> Simon
> 
> 
> 
>> On Jul 10, 2016, at 4:24 PM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
>> 
>> About the use case:
>> 
>> We want to do JDBC queries for each row in a Hive partition. Currently, we use Spark
to partition the Hive dataFrame, then do batch query in foreachPartition. Since each partition
is accessing multiple regionservers, there are a lot of overhead. So we are thinking about
partitioning the dataFrame according to the HBase region.
>> 
>> Any help is appreciated!
>> 
>> Best,
>> Simon
>> 
>>> On Jul 10, 2016, at 2:01 PM, Simon Wang <simon.wang@airbnb.com <mailto:simon.wang@airbnb.com>>
wrote:
>>> 
>>> Hi all,
>>> 
>>> Happy weekend!
>>> 
>>> I am writing to ask if there is a way that I can get the region number of any
given row key? 
>>> 
>>> For the case will salting is applied, I discovered `SaltingUtils.getSaltedKey`
method, but I am not sure how I can get serialize the key as `ImmutableBytesWritable`.
>>> 
>>> In general, how should the client get the region number? Assuming that the client
have no prior knowledge of the table. So the client needs to read from metadata (salted or
not, SPLIT ON or not), serialize key, compare with splits, etc.
>>> 
>>> Thanks in advance!
>>> 
>>> 
>>> Best,
>>> Simon
>>> 
>> 
> 
> 


Mime
View raw message