phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Phoenix as a source for Spark processing
Date Thu, 08 Mar 2018 23:06:32 GMT
I would guess that Hive would always be capable of out-matching what 
HBase/Phoenix can do for this type of workload (bulk-transformation). 
That said, I'm not ready to tell you that you can't get the 
Phoenix-Spark integration better performing. See the other thread where 
you provide more details..

It's important to remember that Phoenix is designed to shine when you 
have workloads which require updates to a single row/column. The 
underlying I/O system is much different in HBase compared to Hive in 
order to server the random update use-case.

On 3/7/18 4:08 AM, Stepan Migunov wrote:
> Some more details... We have done some simple tests to compare read/write possibility
spark+hive and spark+phoenix. And now we have the following results:
> Copy table (with no any transformations) (about 800 million rec):
> Hive (TEZ) - 752 sec
> Spark:
>  From Hive to Hive: 2463 sec
>  From Phoenix to Hive - 13310 sec
>  From Hive to Phoenix - > 30240 sec
> We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1
> So it seems that Spark + Phoenix led great performance degradation. Any thoughts?
> On 2018/03/04 11:08:56, Stepan Migunov <> wrote:
>> In our software we need to combine fast interactive access to the data with quite
complex data processing. I know that Phoenix intended for fast access, but hoped that also
I could be able to use Phoenix as a source for complex processing with the Spark.  Unfortunately,
Phoenix + Spark shows very poor performance. E.g., querying big (about billion records) table
with distinct takes about 2 hours. At the same time this task with Hive source takes a few
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable for batch processing
with spark and I should  duplicate data to Hive and process it with Hive?

View raw message