phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Murgia <>
Subject Re: bulk-upsert spark phoenix
Date Mon, 17 Oct 2016 16:07:51 GMT
Hi Josh,

thank for your reply, I'm trying to implement a bulk save to Phoenix 
with Apache Spark, and the code you linked helped me a lot. I'm now 
facing an issue with composite primary keys, I cannot find anywhere in 
the Phoenix code where the row-key is built using the partial phoenix 
primary keys. Can someone point me to the piece of code inside Phoenix 
that realizes that?
Thank you in advance.


On 09/28/2016 05:10 PM, Josh Mahonin wrote:
> Hi Antonio,
> You're correct, the phoenix-spark output uses the Phoenix Hadoop 
> OutputFormat under the hood, which effectively does a parallel, batch 
> JDBC upsert. It should scale depending on the number of Spark 
> executors, RDD/DataFrame parallelism, and number of HBase 
> RegionServers, though admittedly there's a lot of overhead involved.
> The CSV Bulk loading tool uses MapReduce, it's not integrated with 
> Spark. It's likely possible to do so, but it's probably a non-trivial 
> amount of work. If you're interested in taking it on, I'd start with 
> looking at the following classes:
> Good luck,
> Josh
> On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
> < <>> wrote:
>     Hi,
>     I would like to perform a Bulk insert to HBase using Apache
>     Phoenix from
>     Spark. I tried using Apache Spark Phoenix library but, as far as I was
>     able to understand from the code, it looks like it performs a jdbc
>     batch
>     of upserts (am I right?). Instead I want to perform a Bulk load
>     like the
>     one described in this blog post
>     (
>     <>) but taking
>     advance of
>     the automatic transformation between java/scala types to Bytes.
>     I'm actually using phoenix 4.5.2, therefore I cannot use hive to
>     manipulate the phoenix table, and if it possible i want to avoid to
>     spawn a MR job that reads data from csv
>     (
>     <>). Actually i just
>     want to
>     do what the csv loader is doing with MR but programmatically with
>     Spark
>     (since the data I want to persist is already loaded in memory).
>     Thank you all!

View raw message