Thanks for the details.

I tested out and saw that the no.of partitions equals to the no.of parallel scans run upon DataFrame load in phoenix 4.10.
Also, how can we ensure that the read gets evenly distributed as tasks across the no.of executors set for the job? I'm running phoenixTableAsDataFrame API on a table with 4-way parallel scans and with 4 executors set for the job. Thanks for the inputs.


On Thu, Aug 17, 2017 at 7:17 AM, Josh Mahonin <> wrote:

Phoenix is able to parallelize queries based on the underlying HBase region splits, as well as its own internal guideposts based on statistics collection [1]

The phoenix-spark connector exposes those splits to Spark for the RDD / DataFrame parallelism. In order to test this out, you can try run an EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many parallel scans will be run, and then compare those to the RDD / DataFrame partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3], they will be the same. In versions below that, the partition count will equal the number of regions for that table.



On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <> wrote:
Also, I'm using phoenixTableAsDataFrame API to read from a pre-split phoenix table. How can we ensure read is parallelized across all executors? Would salting/pre-splitting tables help in providing parallelism? Appreciate any inputs.



On Wed, Aug 16, 2017 at 10:16 PM, kanagha <> wrote:
Hi Josh,

Per your previous post, it is mentioned "The phoenix-spark parallelism is
based on the splits provided by the Phoenix query planner, and has no
requirements on specifying partition columns or upper/lower bounds."

Does it depend upon the region splits on the input table for parallelism?
Could you please provide more details?


View this message in context:
Sent from the Apache Phoenix User List mailing list archive at