phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Mahonin <jmaho...@gmail.com>
Subject Re: spark plugin with java
Date Wed, 02 Dec 2015 21:26:48 GMT
It does. Under the hood, the DataFrame/RDD makes use of the
PhoenixInputFormat, which derives the split information from the query
planner and passes those back through to Spark to use for its
parallelization.

After you have the RDD / DataFrame handle, you're also free to use Spark's
repartition() operation as needed as well.

On Wed, Dec 2, 2015 at 2:56 PM, Krishna <research800@gmail.com> wrote:

> Yes, I will create new tickets for any issues that I may run into.
> Another question: For now I'm pursuing the option of creating a dataframe
> as shown in my previous email. How does spark handle parallelization in
> this case? Does it use phoenix metadata on splits?
>
>
> On Wed, Dec 2, 2015 at 11:02 AM, Josh Mahonin <jmahonin@gmail.com> wrote:
>
>> Hi Krishna,
>>
>> That's great to hear. You're right, the plugin itself should be backwards
>> compatible to Spark 1.3.1 and should be for any version of Phoenix past
>> 4.4.0, though I can't guarantee that to be the case forever. As well, I
>> don't know how much usage there is across the board using the Java API and
>> DataFrames, you in fact may be the first. If you are encountering any
>> errors with it could you file a JIRA please with any stack traces you see?
>>
>> Since Spark is a very quickly changing project, often they update
>> internal functionality that we sometimes lag behind on support for, and as
>> a result there's no direct mapping between specific Phoenix versions and
>> specific Spark versions. We add new support as fast as we get patches,
>> essentially.
>>
>> My general recommendation is to stay back a major version on Spark if
>> possible, but if you need to use the latest Spark releases, try use the
>> latest Phoenix release as well. The DataFrame support in Phoenix, for
>> instance, has had many patches and improvements recently that older
>> versions are missing.
>>
>> Thanks,
>>
>> Josh
>>
>> On Wed, Dec 2, 2015 at 1:40 PM, Krishna <research800@gmail.com> wrote:
>>
>>> Yes, that works for Spark 1.4.x. Website says Spark 1.3.1+ for Spark
>>> plugin, is that accurate?
>>>
>>> For Spark 1.3.1, I created a dataframe as follows (could not use the
>>> plugin):
>>> *        Map<String, String> options = new HashMap<String, String>();*
>>> *        options.put("url", PhoenixRuntime.JDBC_PROTOCOL +
>>> PhoenixRuntime.JDBC_PROTOCOL_SEPARATOR + zkQuorum);*
>>> *        options.put("dbtable", "TABLE_NAME");*
>>>
>>> *        SQLContext sqlContext = new SQLContext(sc);*
>>> *        DataFrame jdbcDF = sqlContext.load("jdbc",
>>> options).filter("COL_NAME > SOME_VALUE");*
>>>
>>> Also, it isn't immediately obvious which version of Spark was used in
>>> building Phoenix artifacts available on Maven. May be, it's worth putting
>>> it on the website. Let me know if the mapping below is incorrect.
>>>
>>> Phoenix 4.4.x <--> Spark 1.4.0
>>> > Phoenix 4.5.x <--> Spark 1.5.0
>>> > Phoenix 4.6.x <--> Spark 1.5.0
>>>
>>>
>>> On Tue, Dec 1, 2015 at 7:05 PM, Josh Mahonin <jmahonin@gmail.com> wrote:
>>>
>>> > Hi Krishna,
>>> >
>>> > I've not tried it in Java at all, but I as of Spark 1.4+ the DataFrame
>>> API
>>> > should be unified between Scala and Java, so the following may work
>>> for you:
>>> >
>>> > DataFrame df = sqlContext.read()
>>> >     .format("org.apache.phoenix.spark")
>>> >     .option("table", "TABLE1")
>>> >     .option("zkUrl", "<phoenix-server:2181>")
>>> >     .load();
>>> >
>>> > Note that 'zkUrl' must be set to your Phoenix URL, and passing a 'conf'
>>> > parameter isn't supported. Please let us know back here if this works
>>> out
>>> > for you, I'd love to update the documentation and unit tests if it
>>> works.
>>> >
>>> > Josh
>>> >
>>> > On Tue, Dec 1, 2015 at 6:30 PM, Krishna <research800@gmail.com> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> Is there a working example for using spark plugin in Java?
>>> Specifically,
>>> >> what's the java equivalent for creating a dataframe as shown here in
>>> scala:
>>> >>
>>> >> val df = sqlContext.phoenixTableAsDataFrame("TABLE1", Array("ID",
>>> "COL1"), conf = configuration)
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Mime
View raw message