phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chinmay Kulkarni <chinmayskulka...@gmail.com>
Subject Re: Fw: Read Performance in latest code
Date Tue, 23 Jul 2019 00:41:40 GMT
Hi Manohar,

What query are you using when reading the data into a DataFrame? Can you
show us the DAG for your job? Perhaps you can further filter the data to
decrease the amount of data being shuffled. Also, are you doing any
group-by or join operations which would could lead to significant data
shuffling?
Another thing I found is to tune the GC, see this
<https://stackoverflow.com/questions/38981772/spark-shuffle-operation-leading-to-long-gc-pause/39111205>
.

Thanks,
Chinmay

On Sun, Jul 21, 2019 at 10:19 AM manohar mc <manohar_mc@yahoo.co.in> wrote:

>
>
> ----- Forwarded message -----
> *From:* manohar mc <manohar_mc@yahoo.co.in>
> *To:* user-allow@phoenix.apache.org <user-allow@phoenix.apache.org>
> *Sent:* Friday, 19 July, 2019, 11:14:41 am IST
> *Subject:* Read Performance in latest code
>
> Hi List, I am using the latest phoenix spark connector
> https://github.com/apache/phoenix-connectors/tree/master/phoenix-spark.
> While using initially we observed issues in write performance and after
> some changes we could be get down time from 30 minutes to < 1 minute in our
> test environment. But we are seeing lots of CPU time is consumed while
> reading data into dataframe, if you see below picture >50% cpu time is
> spent in ShuffleMapTask.
>
> [image: Inline image]
>
>
>
>
>
> If you see picture there are lots of recursive calls till
> DataSourceRDD.compute get called. So wanted to understand what happening in
> this case and any way we can reduce the CPU time while shuffleMapTask.
>


-- 
Chinmay Kulkarni

Mime
View raw message