Phoenix is able to parallelize queries based on the underlying HBase region splits, as well as its own internal guideposts based on statistics collection [1]

The phoenix-spark connector exposes those splits to Spark for the RDD / DataFrame parallelism. In order to test this out, you can try run an EXPLAIN SELECT... query [2] to mimic the DataFrame load to see how many parallel scans will be run, and then compare those to the RDD / DataFrame partition count (some_rdd.partitions.size). In Phoenix 4.10 and above [3], they will be the same. In versions below that, the partition count will equal the number of regions for that table.


[1] https://phoenix.apache.org/update_statistics.html
[2] https://phoenix.apache.org/tuning_guide.html
[3] https://issues.apache.org/jira/browse/PHOENIX-3600

On Thu, Aug 17, 2017 at 3:07 AM, Kanagha <er.kanagha@gmail.com> wrote:
Also, I'm using phoenixTableAsDataFrame API to read from a pre-split phoenix table. How can we ensure read is parallelized across all executors? Would salting/pre-splitting tables help in providing parallelism? Appreciate any inputs.



On Wed, Aug 16, 2017 at 10:16 PM, kanagha <er.kanagha@gmail.com> wrote:
Hi Josh,

Per your previous post, it is mentioned "The phoenix-spark parallelism is
based on the splits provided by the Phoenix query planner, and has no
requirements on specifying partition columns or upper/lower bounds."

Does it depend upon the region splits on the input table for parallelism?
Could you please provide more details?


View this message in context: http://apache-phoenix-user-list.1124778.n5.nabble.com/phoenix-spark-options-not-supporint-query-in-dbtable-tp1915p3810.html
Sent from the Apache Phoenix User List mailing list archive at Nabble.com.