phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Mahonin <>
Subject Re: bulk-upsert spark phoenix
Date Wed, 28 Sep 2016 15:10:19 GMT
Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop
OutputFormat under the hood, which effectively does a parallel, batch JDBC
upsert. It should scale depending on the number of Spark executors,
RDD/DataFrame parallelism, and number of HBase RegionServers, though
admittedly there's a lot of overhead involved.

The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
It's likely possible to do so, but it's probably a non-trivial amount of
work. If you're interested in taking it on, I'd start with looking at the
following classes:

Good luck,


On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia <>

> Hi,
> I would like to perform a Bulk insert to HBase using Apache Phoenix from
> Spark. I tried using Apache Spark Phoenix library but, as far as I was
> able to understand from the code, it looks like it performs a jdbc batch
> of upserts (am I right?). Instead I want to perform a Bulk load like the
> one described in this blog post
> ( but taking advance of
> the automatic transformation between java/scala types to Bytes.
> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
> manipulate the phoenix table, and if it possible i want to avoid to
> spawn a MR job that reads data from csv
> ( Actually i just want to
> do what the csv loader is doing with MR but programmatically with Spark
> (since the data I want to persist is already loaded in memory).
> Thank you all!

View raw message