phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Long, Xindian" <Xindian.L...@sensus.com>
Subject RE: phoenix spark options not supporint query in dbtable
Date Thu, 09 Jun 2016 17:02:17 GMT
Hi, Josh:

Thanks for the answer. Do you know the underlining difference between the following two ways
of Loading a Dataframe? (using the Data Source API, or Load as a DataFrame directly using
a Configuration object)

Is there a  Java interface to use the functionality of phoenixTableAsDataFrame, saveToPhoenix
?

Thanks

Xindian

Load as a DataFrame using the Data Source API
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._

val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)

val df = sqlContext.load(
  "org.apache.phoenix.spark",
  Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)

df
  .filter(df("COL1") === "test_row_1" && df("ID") === 1L)
  .select(df("ID"))
  .show
Or Load Load as a DataFrame directly using a Configuration object
import org.apache.hadoop.conf.Configuration
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._

val configuration = new Configuration()
// Can set Phoenix-specific settings, requires 'hbase.zookeeper.quorum'

val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)

// Load the columns 'ID' and 'COL1' from TABLE1 as a DataFrame
val df = sqlContext.phoenixTableAsDataFrame(
  "TABLE1", Array("ID", "COL1"), conf = configuration
)

df.show



From: Josh Mahonin [mailto:jmahonin@gmail.com]
Sent: 2016年6月9日 9:44
To: user@phoenix.apache.org
Subject: Re: phoenix spark options not supporint query in dbtable

Hi Xindian,

The phoenix-spark integration is based on the Phoenix MapReduce layer, which doesn't support
aggregate functions. However, as you mentioned, both filtering and pruning predicates are
pushed down to Phoenix. With an RDD or DataFrame loaded, all of Spark's various aggregation
methods are available to you.

Although the Spark JDBC data source supports the full complement of Phoenix's supported queries,
the way it achieves parallelism is to split the query across a number of workers and connections
based on a 'partitionColumn' with a 'lowerBound' and 'upperBound', which must be numeric.
If your use case has numeric primary keys, then that is potentially a good solution for you.
[1]

The phoenix-spark parallelism is based on the splits provided by the Phoenix query planner,
and has no requirements on specifying partition columns or upper/lower bounds. It's up to
you to evaluate which technique is the right method for your use case. [2]

Good luck,

Josh
[1] http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
[2] https://phoenix.apache.org/phoenix_spark.html


On Wed, Jun 8, 2016 at 6:01 PM, Long, Xindian <Xindian.Long@sensus.com<mailto:Xindian.Long@sensus.com>>
wrote:
The Spark JDBC data source supports to specify a query as the  “dbtable” option.
I assume all queries in the above query in pushed down to the database instead of done in
Spark.

The  phoenix spark plug in seems not supporting that. Why is that? Any plan in the future
to support it?

I know phoenix spark does support an optional select clause and predicate push down in some
cases, but it is limited.

Thanks

Xindian


-------------------------------------------
Xindian “Shindian” Long
Mobile:  919-9168651<tel:919-9168651>
Email: Xindian.Long@gmail.com<mailto:Xindian.Long@gmail.com>




Mime
View raw message