phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kanagha <er.kana...@gmail.com>
Subject Re: phoenix spark options not supporint query in dbtable
Date Thu, 17 Aug 2017 18:36:30 GMT
Thanks for the details.

I tested out and saw that the no.of partitions equals to the no.of parallel
scans run upon DataFrame load in phoenix 4.10.
Also, how can we ensure that the read gets evenly distributed as tasks
across the no.of executors set for the job? I'm running phoenixTableAsDataFrame
API on a table with 4-way parallel scans and with 4 executors set for the
job. Thanks for the inputs.


Kanagha

On Thu, Aug 17, 2017 at 7:17 AM, Josh Mahonin <jmahonin@gmail.com> wrote:

> Hi,
>
> Phoenix is able to parallelize queries based on the underlying HBase
> region splits, as well as its own internal guideposts based on statistics
> collection [1]
>
> The phoenix-spark connector exposes those splits to Spark for the RDD /
> DataFrame parallelism. In order to test this out, you can try run an
> EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many
> parallel scans will be run, and then compare those to the RDD / DataFrame
> partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3],
> they will be the same. In versions below that, the partition count will
> equal the number of regions for that table.
>
> Josh
>
> [1] https://phoenix.apache.org/update_statistics.html
> [2] https://phoenix.apache.org/tuning_guide.html
> [3] https://issues.apache.org/jira/browse/PHOENIX-3600
>
>
> On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kanagha@gmail.com> wrote:
>
>> Also, I'm using phoenixTableAsDataFrame API to read from a pre-split
>> phoenix table. How can we ensure read is parallelized across all executors?
>> Would salting/pre-splitting tables help in providing parallelism?
>> Appreciate any inputs.
>>
>> Thanks
>>
>>
>> Kanagha
>>
>> On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kanagha@gmail.com> wrote:
>>
>>> Hi Josh,
>>>
>>> Per your previous post, it is mentioned "The phoenix-spark parallelism is
>>> based on the splits provided by the Phoenix query planner, and has no
>>> requirements on specifying partition columns or upper/lower bounds."
>>>
>>> Does it depend upon the region splits on the input table for parallelism?
>>> Could you please provide more details?
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-phoenix-user-lis
>>> t.1124778.n5.nabble.com/phoenix-spark-options-not-supporint-
>>> query-in-dbtable-tp1915p3810.html
>>> Sent from the Apache Phoenix User List mailing list archive at
>>> Nabble.com.
>>>
>>
>>
>

Mime
View raw message