phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Long, Xindian" <>
Subject How are Dataframes partitioned by default when using spark?
Date Mon, 19 Sep 2016 16:48:35 GMT
How are Dataframes/Datasets/RDD  partitioned by default when using spark? assuming the Dataframe/Datasets/RDD
 is the result of a query like that:

select col1, col2, col3 from table3 where col3 > xxx

I noticed that for HBase, a partitioner partitions the rowkeys based on region splits,  can
Phoenix do this as well?

I also read that if I use spark with the Phoenix jdbc interface "it's only able to parallelize
queries by partioning on a numeric column. It also requires a known lower bound, upper bound
and partition count in order to create split queries."

Question 1,  If I specify an option like this, is the partitioning based on segmenting the
range evenly, i.e. each partition gets a rowkey in ranges like: upperlimit-lowerlmit)/partitionCount

Question 2, if I do not specify any range, or the row key is not a numeric column, how is
the result partitioned using jdbc?

If I use the spark-phoenix  plug in, it is mentioned that it is able to leverage the underlying
splits provided by Phoenix?
Are there any example scenarios  of that? e.g. can it partition the resulted Dataframe based
on regions in the underling HBase table, so that spark can take advantage the locality of
the data?



View raw message