phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Mahonin <jmaho...@gmail.com>
Subject Re: phoenix-spark and pyspark
Date Wed, 20 Jan 2016 01:59:44 GMT
Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on YARN as
well. I suppose the JAR is probably shipped by YARN, though I don't see any
logging saying it, so I'm not certain how the nuts and bolts of that work.
By explicitly setting the classpath, we're bypassing Spark's native JAR
broadcast though.

Taking a quick look at the config in Ambari (which ships the config to each
node after saving), in 'Custom spark-defaults' I have the following:

spark.driver.extraClassPath ->
/etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
spark.executor.extraClassPath ->
/usr/hdp/current/phoenix-client/phoenix-client-spark.jar

I'm not sure if the /etc/hbase/conf is necessarily needed, but I think that
gets the Ambari generated hbase-site.xml in the classpath. Each node has
the custom phoenix-client-spark.jar installed to that same path as well.

I can pop into regular spark-shell and load RDDs/DataFrames using:
/usr/hdp/current/spark-client/bin/spark-shell --master yarn-client

or pyspark via:
/usr/hdp/current/spark-client/bin/pyspark

I also do this as the Ambari-created 'spark' user, I think there was some
fun HDFS permission issue otherwise.

On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimiduk@apache.org> wrote:

> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
> colocated with RegionServers; all the hosts have everything. There are no
> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>
> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmahonin@gmail.com> wrote:
>
>> Sadly, it needs to be installed onto each Spark worker (for now). The
>> executor config tells each Spark worker to look for that file to add to its
>> classpath, so once you have it installed, you'll probably need to restart
>> all the Spark workers.
>>
>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>> consistently see is fine.
>>
>> One day we'll be able to have Spark ship the JAR around and use it
>> without this classpath nonsense, but we need to do some extra work on the
>> Phoenix side to make sure that Phoenix's calls to DriverManager actually go
>> through Spark's weird wrapper version of it.
>>
>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimiduk@apache.org>
>> wrote:
>>
>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmahonin@gmail.com>
>>> wrote:
>>>
>>>> What version of Spark are you using?
>>>>
>>>
>>> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
>>> the welcome message in the pyspark console agrees.
>>>
>>> Are there any other traces of exceptions anywhere?
>>>>
>>>
>>> No other exceptions that I can find. YARN apparently doesn't want to
>>> aggregate spark's logs.
>>>
>>>
>>>> Are all your Spark nodes set up to point to the same
>>>> phoenix-client-spark JAR?
>>>>
>>>
>>> Yes, as far as I can tell... though come to think of it, is that jar
>>> shipped by the spark runtime to workers, or is it loaded locally on each
>>> host? I only changed spark-defaults.conf on the client machine, the machine
>>> from which I submitted the job.
>>>
>>> Thanks for taking a look Josh!
>>>
>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I'm doing my best to follow along with [0], but I'm hitting some
>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568.
I'm
>>>>> using pyspark for now.
>>>>>
>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>>>> This appears to basically work, as I see PhoenixInputFormat in the logs
and
>>>>> df.printSchema() shows me what I expect. However, when I try df.take(5),
I
>>>>> get "IllegalStateException: unread block data" [1] from the workers.
Poking
>>>>> around, this is commonly a problem with classpath. Any ideas as to
>>>>> specifically which jars are needed? Or better still, how to debug this
>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>>>> gives me a VerifyError about netty method version mismatch. Indeed I
see
>>>>> two netty versions in that lib directory...
>>>>>
>>>>> Thanks a lot,
>>>>> -n
>>>>>
>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>> [1]:
>>>>>
>>>>> java.lang.IllegalStateException: unread block data
>>>>> at
>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>> at
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>> at
>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>> at
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>> at
>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>> at
>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>> at
>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>>
>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <jamestaylor@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>
>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmahonin@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>> committed for 4.7.0 and the docs have been updated to include
these samples
>>>>>>> for PySpark users.
>>>>>>>
>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmahonin@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Nick,
>>>>>>>>
>>>>>>>> I think this used to work, and will again once PHOENIX-2503
gets
>>>>>>>> resolved. With the Spark DataFrame support, all the necessary
glue is there
>>>>>>>> for Phoenix and pyspark to play nice. With that client JAR
(or by
>>>>>>>> overriding the com.fasterxml.jackson JARS), you can do something
like:
>>>>>>>>
>>>>>>>> df = sqlContext.read \
>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>   .load()
>>>>>>>>
>>>>>>>> And
>>>>>>>>
>>>>>>>> df.write \
>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>   .mode("overwrite") \
>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>   .save()
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, this should be added to the documentation. I hadn't
actually
>>>>>>>> tried this till just now. :)
>>>>>>>>
>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Heya,
>>>>>>>>>
>>>>>>>>> Has anyone any experience using phoenix-spark integration
from
>>>>>>>>> pyspark instead of scala? Folks prefer python around
here...
>>>>>>>>>
>>>>>>>>> I did find this example [0] of using HBaseOutputFormat
from
>>>>>>>>> pyspark, haven't tried extending it for phoenix. Maybe
someone with more
>>>>>>>>> experience in pyspark knows better? Would be a great
addition to our
>>>>>>>>> documentation.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>> [0]:
>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message