phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: Phoenix as a source for Spark processing
Date Mon, 05 Mar 2018 18:14:01 GMT
Hi Stepan,

Can you better ballpark the Phoenix-Spark performance you've seen (e.g. 
how much hardware do you have, how many spark executors did you use, how 
many region servers)? Also, what versions of software are you using?

I don't think there are any firm guidelines on how you can solve this 
problem, but you've found the tools available for you.

* You can try Phoenix+Spark to run over the Phoenix tables in place
* You can use Phoenix+Hive to offload the data into Hive for queries

If Phoenix-Spark wasn't fast enough, I'd imagine using the Phoenix-Hive 
integration to query the data would be similarly not fast enough.

It's possible that the bottleneck is something we could fix in the 
integration, or fix configuration of Spark and/or Phoenix. We'd need you 
to help quantify this better :)

On 3/4/18 6:08 AM, Stepan Migunov wrote:
> In our software we need to combine fast interactive access to the data with quite complex
data processing. I know that Phoenix intended for fast access, but hoped that also I could
be able to use Phoenix as a source for complex processing with the Spark.  Unfortunately,
Phoenix + Spark shows very poor performance. E.g., querying big (about billion records) table
with distinct takes about 2 hours. At the same time this task with Hive source takes a few
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable for batch processing
with spark and I should  duplicate data to Hive and process it with Hive?
> 

Mime
View raw message