phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Mahonin <jmaho...@gmail.com>
Subject Re: phoenix-spark and pyspark
Date Sun, 24 Jan 2016 18:20:19 GMT
Hey Nick,

Hopefully PHOENIX-2599 works out for you, but it's entirely possible
there's some other issues that crop up. My usage patterns are probably a
bit unique, in that I generally do a few JDBC UPSERT SELECT aggregations
from my "hot" tables to intermediate ones before Spark ever involved. That
likely hides an entire class of bugs, like the kind PHOENIX-2599 fixes.

However, as you're likely aware, the Spark integration is a pretty thin
wrapper over the regular MR integration, including Pig, so hopefully
there's not too much in the way of unexplored territory there in terms of
bulk loading and saving.

Good luck!

Josh

On Sat, Jan 23, 2016 at 9:12 PM, Nick Dimiduk <ndimiduk@apache.org> wrote:

> That looks about right. I was unaware of this patch; thanks!
>
> On Fri, Jan 22, 2016 at 5:10 PM, James Taylor <jamestaylor@apache.org>
> wrote:
>
>> FYI, Nick - do you know about Josh's fix for PHOENIX-2599? Does that help
>> here?
>>
>> On Fri, Jan 22, 2016 at 4:32 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>>
>>> On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jmahonin@gmail.com>
>>> wrote:
>>>
>>>> Amazing, good work.
>>>>
>>>
>>> All I did was consume your code and configure my cluster. Thanks though
>>> :)
>>>
>>> FWIW, I've got a support case in with Hortonworks to get the
>>>> phoenix-spark integration working out of the box. Assuming it gets
>>>> resolved, that'll hopefully help keep these classpath-hell issues to a
>>>> minimum going forward.
>>>>
>>>
>>> That's fine for HWX customers, but it doesn't solve the general problem
>>> community-side. Unless, of course, we assume the only Phoenix users are on
>>> HDP.
>>>
>>> Interesting point re: PHOENIX-2535. Spark does offer builds for specific
>>>> Hadoop versions, and also no Hadoop at all (with the assumption you'll
>>>> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
>>>> HBase (and by extension, Hadoop) versions though... do you think it be
>>>> possible to work around (2) if you locally added the HDP Maven repo and
>>>> adjusted versions accordingly? I've had some success with that in other
>>>> projects, though as I recall when I tried it with Phoenix I ran into a snag
>>>> trying to resolve some private transitive dependency of Hadoop.
>>>>
>>>
>>> I could specify Hadoop version and build Phoenix locally to remove the
>>> issue. It would work for me because I happen to be packaging my own Phoenix
>>> this week, but it doesn't help for the Apache releases. Phoenix _is_
>>> tightly coupled to HBase versions, but I don't think HBase versions are
>>> that tightly coupled to Hadoop versions. We build HBase 1.x releases
>>> against a long list of Hadoop releases as part of our usual build. I think
>>> what HBase consumes re: Hadoop API's is pretty limited.
>>>
>>> Now that you have Spark working you can start hitting real bugs! If you
>>>> haven't backported the full patchset, you might want to take a look at the
>>>> phoenix-spark history [1], there's been a lot of churn there, especially
>>>> with regards to the DataFrame API.
>>>>
>>>
>>> Yeah, speaking of which... :)
>>>
>>> I find this integration is basically unusable when the underlying HBase
>>> table partitions are in flux; i.e., when you have data being loaded
>>> concurrent to query. Spark RDD's assume stable partitions, multiple queries
>>> against a single DataFrame, or even a single query/DataFrame with lots of
>>> underlying splits and not enough workers, will inevitably fail
>>> with ConcurrentModificationException like [0].
>>>
>>> I think in HBase/MR integration, we define split points for the job as
>>> region boundaries initially, but then we're just using the regular API, so
>>> repartitions are handled transiently by the client layer. I need to dig
>>> into this Phoenix/Spark stuff to see if we can do anything similar here.
>>>
>>> Thanks again Josh,
>>> -n
>>>
>>> [0]:
>>>
>>> java.lang.RuntimeException: java.util.ConcurrentModificationException
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
>>> at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>>> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:57)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.util.ConcurrentModificationException
>>> at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
>>> at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2154)
>>> at
>>> org.apache.phoenix.util.PropertiesUtil.extractProperties(PropertiesUtil.java:52)
>>> at
>>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:60)
>>> at
>>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
>>> at
>>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:278)
>>> at
>>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectStatement(PhoenixConfigurationUtil.java:306)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:113)
>>> ... 32 more
>>>
>>> On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <ndimiduk@apache.org>
>>>> wrote:
>>>>
>>>>> I finally got to the bottom of things. There were two issues at play
>>>>> in my particular environment.
>>>>>
>>>>> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
>>>>> hardly thought of it when I hit the issue with MR job submission; its
>>>>> impact on Spark was much more subtle.
>>>>>
>>>>> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1
>>>>> while my cluster is running HDP's 2.7.1 build), per my earlier email.
Once
>>>>> I'd worked around (1), I was able to work around (2) by setting
>>>>> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
>>>>> for both spark.driver.extraClassPath and spark.executor.extraClassPath.
Per
>>>>> the linked thread above, I believe this places the correct YARN client
>>>>> version first in the classpath.
>>>>>
>>>>> With the above in place, I'm able to submit work vs the Phoenix tables
>>>>> to the YARN cluster. Success!
>>>>>
>>>>> Ironically enough, I would not have been able to work around (2) if we
>>>>> had PHOENIX-2535 in place. Food for thought in tackling that issue. It
may
>>>>> be worth while to ship a uberclient jar that is entirely without Hadoop
(or
>>>>> HBase) classes. I believe spark does this for Hadoop with their builds
as
>>>>> well.
>>>>>
>>>>> Thanks again for your help here Josh! I really appreciate it.
>>>>>
>>>>> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>>>>>
>>>>> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Well, I spoke too soon. It's working, but in local mode only. When
I
>>>>>> invoke `pyspark --master yarn` (or yarn-client), the submitted application
>>>>>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in
my
>>>>>> container log. Now that Phoenix is on my classpath, I'm suspicious
that the
>>>>>> versions of YARN client libraries are incompatible. I found an old
thread
>>>>>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>>>>>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>>>>>> to /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>> but that appears to have no impact.
>>>>>>
>>>>>> [0]:
>>>>>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>>>>>> java.lang.IllegalArgumentException: Invalid ContainerId:
>>>>>> container_e07_1452901320122_0042_01_000001
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>>>>>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>>>>>> at
>>>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>>>> at java.lang.Long.parseLong(Long.java:589)
>>>>>> at java.lang.Long.parseLong(Long.java:631)
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>>>>> ... 12 more
>>>>>>
>>>>>> [1]:
>>>>>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=A@mail.gmail.com%3E
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jmahonin@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That's great to hear. Looking forward to the doc patch!
>>>>>>>
>>>>>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimiduk@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Josh -- I deployed my updated phoenix build across the cluster,
>>>>>>>> added the phoenix-client-spark.jar to configs on the whole
cluster, and now
>>>>>>>> basic dataframe access is now working. Let me see about updating
the docs
>>>>>>>> page to be more clear, I'll send a patch by you for review.
>>>>>>>>
>>>>>>>> Thanks a lot for the help!
>>>>>>>> -n
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmahonin@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Right, this cluster I just tested on is HDP 2.3.4, so
it's Spark
>>>>>>>>> on YARN as well. I suppose the JAR is probably shipped
by YARN, though I
>>>>>>>>> don't see any logging saying it, so I'm not certain how
the nuts and bolts
>>>>>>>>> of that work. By explicitly setting the classpath, we're
bypassing Spark's
>>>>>>>>> native JAR broadcast though.
>>>>>>>>>
>>>>>>>>> Taking a quick look at the config in Ambari (which ships
the
>>>>>>>>> config to each node after saving), in 'Custom spark-defaults'
I have the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>> spark.driver.extraClassPath ->
>>>>>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>> spark.executor.extraClassPath ->
>>>>>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>>
>>>>>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed,
but I
>>>>>>>>> think that gets the Ambari generated hbase-site.xml in
the classpath. Each
>>>>>>>>> node has the custom phoenix-client-spark.jar installed
to that same path as
>>>>>>>>> well.
>>>>>>>>>
>>>>>>>>> I can pop into regular spark-shell and load RDDs/DataFrames
using:
>>>>>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master
yarn-client
>>>>>>>>>
>>>>>>>>> or pyspark via:
>>>>>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>>>>>
>>>>>>>>> I also do this as the Ambari-created 'spark' user, I
think there
>>>>>>>>> was some fun HDFS permission issue otherwise.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimiduk@apache.org
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN
NodeManagers
>>>>>>>>>> are colocated with RegionServers; all the hosts have
everything. There are
>>>>>>>>>> no spark workers to restart. You're sure it's not
shipped by the YARN
>>>>>>>>>> runtime?
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmahonin@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Sadly, it needs to be installed onto each Spark
worker (for
>>>>>>>>>>> now). The executor config tells each Spark worker
to look for that file to
>>>>>>>>>>> add to its classpath, so once you have it installed,
you'll probably need
>>>>>>>>>>> to restart all the Spark workers.
>>>>>>>>>>>
>>>>>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so
I just drop it in
>>>>>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere
that each worker can
>>>>>>>>>>> consistently see is fine.
>>>>>>>>>>>
>>>>>>>>>>> One day we'll be able to have Spark ship the
JAR around and use
>>>>>>>>>>> it without this classpath nonsense, but we need
to do some extra work on
>>>>>>>>>>> the Phoenix side to make sure that Phoenix's
calls to DriverManager
>>>>>>>>>>> actually go through Spark's weird wrapper version
of it.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk
<
>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin
<
>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What version of Spark are you using?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Probably HDP's Spark 1.4.1; that's what the
jars in my install
>>>>>>>>>>>> say, and the welcome message in the pyspark
console agrees.
>>>>>>>>>>>>
>>>>>>>>>>>> Are there any other traces of exceptions
anywhere?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No other exceptions that I can find. YARN
apparently doesn't
>>>>>>>>>>>> want to aggregate spark's logs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Are all your Spark nodes set up to point
to the same
>>>>>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, as far as I can tell... though come
to think of it, is
>>>>>>>>>>>> that jar shipped by the spark runtime to
workers, or is it loaded locally
>>>>>>>>>>>> on each host? I only changed spark-defaults.conf
on the client machine, the
>>>>>>>>>>>> machine from which I submitted the job.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk
<
>>>>>>>>>>>>> ndimiduk@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm doing my best to follow along
with [0], but I'm hitting
>>>>>>>>>>>>>> some stumbling blocks. I'm running
with HDP 2.3 for HBase and Spark. My
>>>>>>>>>>>>>> phoenix build is much newer, basically
4.6-branch + PHOENIX-2503,
>>>>>>>>>>>>>> PHOENIX-2568. I'm using pyspark for
now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar
to both
>>>>>>>>>>>>>> spark.executor.extraClassPath and
spark.driver.extraClassPath. This allows
>>>>>>>>>>>>>> me to use sqlContext.read to define
a DataFrame against a Phoenix table.
>>>>>>>>>>>>>> This appears to basically work, as
I see PhoenixInputFormat in the logs and
>>>>>>>>>>>>>> df.printSchema() shows me what I
expect. However, when I try df.take(5), I
>>>>>>>>>>>>>> get "IllegalStateException: unread
block data" [1] from the workers. Poking
>>>>>>>>>>>>>> around, this is commonly a problem
with classpath. Any ideas as to
>>>>>>>>>>>>>> specifically which jars are needed?
Or better still, how to debug this
>>>>>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*"
to the classpath
>>>>>>>>>>>>>> gives me a VerifyError about netty
method version mismatch. Indeed I see
>>>>>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>>> -n
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> java.lang.IllegalStateException:
unread block data
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM,
James Taylor <
>>>>>>>>>>>>>> jamestaylor@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for remembering about
the docs, Josh.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27
AM, Josh Mahonin <
>>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just an update for anyone
interested, PHOENIX-2503 was just
>>>>>>>>>>>>>>>> committed for 4.7.0 and the
docs have been updated to include these samples
>>>>>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Josh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20
PM, Josh Mahonin <
>>>>>>>>>>>>>>>> jmahonin@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this used to
work, and will again once
>>>>>>>>>>>>>>>>> PHOENIX-2503 gets resolved.
With the Spark DataFrame support, all the
>>>>>>>>>>>>>>>>> necessary glue is there
for Phoenix and pyspark to play nice. With that
>>>>>>>>>>>>>>>>> client JAR (or by overriding
the com.fasterxml.jackson JARS), you can do
>>>>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df = sqlContext.read
\
>>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark")
\
>>>>>>>>>>>>>>>>>   .option("table", "TABLE1")
\
>>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512")
\
>>>>>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark")
\
>>>>>>>>>>>>>>>>>   .mode("overwrite")
\
>>>>>>>>>>>>>>>>>   .option("table", "TABLE1")
\
>>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512")
\
>>>>>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, this should be added
to the documentation. I hadn't
>>>>>>>>>>>>>>>>> actually tried this till
just now. :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Dec 9, 2015 at
6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>>>>>> ndimiduk@apache.org>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Has anyone any experience
using phoenix-spark integration
>>>>>>>>>>>>>>>>>> from pyspark instead
of scala? Folks prefer python around here...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I did find this example
[0] of using HBaseOutputFormat
>>>>>>>>>>>>>>>>>> from pyspark, haven't
tried extending it for phoenix. Maybe someone with
>>>>>>>>>>>>>>>>>> more experience in
pyspark knows better? Would be a great addition to our
>>>>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message