phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@apache.org>
Subject Re: phoenix-spark and pyspark
Date Thu, 21 Jan 2016 05:41:43 GMT
I finally got to the bottom of things. There were two issues at play in my
particular environment.

1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
hardly thought of it when I hit the issue with MR job submission; its
impact on Spark was much more subtle.

2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1 while
my cluster is running HDP's 2.7.1 build), per my earlier email. Once I'd
worked around (1), I was able to work around (2) by setting
"/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
for both spark.driver.extraClassPath and spark.executor.extraClassPath. Per
the linked thread above, I believe this places the correct YARN client
version first in the classpath.

With the above in place, I'm able to submit work vs the Phoenix tables to
the YARN cluster. Success!

Ironically enough, I would not have been able to work around (2) if we
had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
be worth while to ship a uberclient jar that is entirely without Hadoop (or
HBase) classes. I believe spark does this for Hadoop with their builds as
well.

Thanks again for your help here Josh! I really appreciate it.

[0]: https://issues.apache.org/jira/browse/AMBARI-14751

On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <ndimiduk@apache.org> wrote:

> Well, I spoke too soon. It's working, but in local mode only. When I
> invoke `pyspark --master yarn` (or yarn-client), the submitted application
> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
> container log. Now that Phoenix is on my classpath, I'm suspicious that the
> versions of YARN client libraries are incompatible. I found an old thread
> [1] with the same stack trace I'm seeing, similar conclusion. I tried
> setting spark.driver.extraClassPath and spark.executor.extraClassPath
> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
> but that appears to have no impact.
>
> [0]:
> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
> java.lang.IllegalArgumentException: Invalid ContainerId:
> container_e07_1452901320122_0042_01_000001
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
> at
> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
> at
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
> at
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> Caused by: java.lang.NumberFormatException: For input string: "e07"
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Long.parseLong(Long.java:589)
> at java.lang.Long.parseLong(Long.java:631)
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
> at
> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
> ... 12 more
>
> [1]:
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>
> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jmahonin@gmail.com> wrote:
>
>> That's great to hear. Looking forward to the doc patch!
>>
>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimiduk@apache.org>
>> wrote:
>>
>>> Josh -- I deployed my updated phoenix build across the cluster, added
>>> the phoenix-client-spark.jar to configs on the whole cluster, and now basic
>>> dataframe access is now working. Let me see about updating the docs page to
>>> be more clear, I'll send a patch by you for review.
>>>
>>> Thanks a lot for the help!
>>> -n
>>>
>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmahonin@gmail.com>
>>> wrote:
>>>
>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark on
>>>> YARN as well. I suppose the JAR is probably shipped by YARN, though I don't
>>>> see any logging saying it, so I'm not certain how the nuts and bolts of
>>>> that work. By explicitly setting the classpath, we're bypassing Spark's
>>>> native JAR broadcast though.
>>>>
>>>> Taking a quick look at the config in Ambari (which ships the config to
>>>> each node after saving), in 'Custom spark-defaults' I have the following:
>>>>
>>>> spark.driver.extraClassPath ->
>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>> spark.executor.extraClassPath ->
>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>
>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I think
>>>> that gets the Ambari generated hbase-site.xml in the classpath. Each node
>>>> has the custom phoenix-client-spark.jar installed to that same path as well.
>>>>
>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>
>>>> or pyspark via:
>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>
>>>> I also do this as the Ambari-created 'spark' user, I think there was
>>>> some fun HDFS permission issue otherwise.
>>>>
>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>> wrote:
>>>>
>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers are
>>>>> colocated with RegionServers; all the hosts have everything. There are
no
>>>>> spark workers to restart. You're sure it's not shipped by the YARN runtime?
>>>>>
>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmahonin@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sadly, it needs to be installed onto each Spark worker (for now).
The
>>>>>> executor config tells each Spark worker to look for that file to
add to its
>>>>>> classpath, so once you have it installed, you'll probably need to
restart
>>>>>> all the Spark workers.
>>>>>>
>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>> consistently see is fine.
>>>>>>
>>>>>> One day we'll be able to have Spark ship the JAR around and use it
>>>>>> without this classpath nonsense, but we need to do some extra work
on the
>>>>>> Phoenix side to make sure that Phoenix's calls to DriverManager actually
go
>>>>>> through Spark's weird wrapper version of it.
>>>>>>
>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmahonin@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> What version of Spark are you using?
>>>>>>>>
>>>>>>>
>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install
say,
>>>>>>> and the welcome message in the pyspark console agrees.
>>>>>>>
>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>
>>>>>>>
>>>>>>> No other exceptions that I can find. YARN apparently doesn't
want to
>>>>>>> aggregate spark's logs.
>>>>>>>
>>>>>>>
>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>
>>>>>>>
>>>>>>> Yes, as far as I can tell... though come to think of it, is that
jar
>>>>>>> shipped by the spark runtime to workers, or is it loaded locally
on each
>>>>>>> host? I only changed spark-defaults.conf on the client machine,
the machine
>>>>>>> from which I submitted the job.
>>>>>>>
>>>>>>> Thanks for taking a look Josh!
>>>>>>>
>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting
some
>>>>>>>>> stumbling blocks. I'm running with HDP 2.3 for HBase
and Spark. My phoenix
>>>>>>>>> build is much newer, basically 4.6-branch + PHOENIX-2503,
PHOENIX-2568. I'm
>>>>>>>>> using pyspark for now.
>>>>>>>>>
>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath.
This allows
>>>>>>>>> me to use sqlContext.read to define a DataFrame against
a Phoenix table.
>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat
in the logs and
>>>>>>>>> df.printSchema() shows me what I expect. However, when
I try df.take(5), I
>>>>>>>>> get "IllegalStateException: unread block data" [1] from
the workers. Poking
>>>>>>>>> around, this is commonly a problem with classpath. Any
ideas as to
>>>>>>>>> specifically which jars are needed? Or better still,
how to debug this
>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*"
to the classpath
>>>>>>>>> gives me a VerifyError about netty method version mismatch.
Indeed I see
>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> -n
>>>>>>>>>
>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>> [1]:
>>>>>>>>>
>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>> at
>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>> at
>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>> at
>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>> at
>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>> at
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmahonin@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503
was just
>>>>>>>>>>> committed for 4.7.0 and the docs have been updated
to include these samples
>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>
>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>
>>>>>>>>>>> Josh
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin
<
>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>
>>>>>>>>>>>> I think this used to work, and will again
once PHOENIX-2503
>>>>>>>>>>>> gets resolved. With the Spark DataFrame support,
all the necessary glue is
>>>>>>>>>>>> there for Phoenix and pyspark to play nice.
With that client JAR (or by
>>>>>>>>>>>> overriding the com.fasterxml.jackson JARS),
you can do something like:
>>>>>>>>>>>>
>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>   .load()
>>>>>>>>>>>>
>>>>>>>>>>>> And
>>>>>>>>>>>>
>>>>>>>>>>>> df.write \
>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>   .save()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, this should be added to the documentation.
I hadn't
>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk
<
>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone any experience using phoenix-spark
integration from
>>>>>>>>>>>>> pyspark instead of scala? Folks prefer
python around here...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did find this example [0] of using
HBaseOutputFormat from
>>>>>>>>>>>>> pyspark, haven't tried extending it for
phoenix. Maybe someone with more
>>>>>>>>>>>>> experience in pyspark knows better? Would
be a great addition to our
>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message