phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: bulk loader MR counters
Date Fri, 03 Apr 2015 14:04:49 GMT
About the record count differences: the output values of the mapper are
KeyValues, not Phoenix rows. Each column's value is stored in separate
KeyValue, so one input row with a single-column primary key and five other
columns will result in 6 output KeyValues: one KeyValue for each of the
non-primary-key columns, plus an extra KeyValue for the internal Phoenix
marker column.

An index stores a single KeyValue per row, which is why the output record
count there is the same as the input record count.

The fact that your spill count is a multiple of your output count is
definitely something that will lead to poor(er) performance, as this means
that all data is being serialized/deserialized multiple times. The tuning
parameters that Ravi pointed out should help there.

About your performance degradation, could you give some more details on the
performance differences you're seeing (i.e. what is the difference in
throughput now compared to before)? Which version of Phoenix were you using
before? Are you running with exactly the same data and table definitions as
before, or have those changed by any chance?

- Gabriel


On Thu, Apr 2, 2015 at 11:43 PM Perko, Ralph J <Ralph.Perko@pnnl.gov> wrote:

>  Thanks - I will try your suggestion.  Do you know why there are so some
> many more output than input records on the main table (39x more).
>
>
>
> *From:* Ravi Kiran [mailto:maghamravikiran@gmail.com]
> *Sent:* Thursday, April 02, 2015 2:35 PM
> *To:* user@phoenix.apache.org
> *Subject:* Re: bulk loader MR counters
>
>
>
> Hi Ralph.
>
>     I assume when you are running the MR for the main table, you have a
> larger number of columns to load than the MR for the index table due to
> which you see more spilled records.
>
> To tune the MR for the Main table, I would do the following first and then
> measure the counters to see for any improvement.
>
> a) To avoid the spilled the records during the MR for the main table, I
> would recommend trying to increase the *mapreduce.task.io.sort.mb* to a
> value like 500 MB rather than the default 100 MB
>
> b) *mapreduce.task.io.sort.factor* to have higher number of streams to
> merge at once during sorting map output .
>
>  Regards
>
> Ravi
>
>
>
>
>
> *From:* Perko, Ralph J
> *Sent:* Thursday, April 02, 2015 2:36 PM
> *To:* user@phoenix.apache.org
> *Subject:* RE: bulk loader MR counters
>
>
>
> My apologies, the formatting did not come out as planned.  Here is another
> go:
>
>
>
> Hi, we recently upgraded our cluster (Phoenix 4.3 – HDP 2.2) and I’m
> seeing a significant degradation in performance.  I am going through the MR
> counters for a Phoenix CsvBulkLoad job and I am hoping you can help me
> understand some things.
>
>
>
> There is a base table with 4 index tables, so a total of 5 MR jobs run –
> one for each table.
>
>
>
> Here are the counters for an index table MR job:
>
>
>
> Note two things – the Input and output are the same number as expected
>
> There seems to be a lot spilled records.
>
> ===========================================================
>
> Category,Map, Reduce,Total
>
> Combine input records,0,0,0
>
> Combine output records,0,0,0
>
> CPU time spent (ms),1800380,156630,1957010
>
> Failed Shuffles,0,0,0
>
> GC time elapsed (ms),39738,1923,41661
>
> Input split bytes,690,0,690
>
> Map input records,*13637198*,0,13637198
>
> Map output bytes,2144112474,0,2144112474
>
> Map output materialized bytes,2171387170,0,2171387170
>
> Map output records,*13637198*,0,13637198
>
> Merged Map outputs,0,50,50
>
> Physical memory (bytes) snapshot,8493744128,10708692992,19202437120
>
> Reduce input groups,0,13637198,13637198
>
> Reduce input records,0,13637198,13637198
>
> Reduce output records,0,13637198,13637198
>
> Reduce shuffle bytes,0,2171387170,2171387170
>
> Shuffled Maps,0,50,50
>
> Spilled Records,*13637198*,13637198,27274396
>
> Total committed heap usage (bytes),11780751360,26862419968,38643171328
>
> Virtual memory (bytes) snapshot,25903271936,96590065664,122493337600
>
>
>
> Here are the counters for the main table MR job
>
> Please note the input records are correct – same as above
>
> The output records are many times the input
>
> The output bytes are many times the output from above
>
> The amount of spilled records is many times the number of input records
> and twice the number of output records
>
> ===========================================================
>
> Category,Map, Reduce,Total
>
> Combine input records,0,0,0
>
> Combine output records,0,0,0
>
> CPU time spent (ms),5059340,2035910,7095250
>
> Failed Shuffles,0,0,0
>
> GC time elapsed (ms),38937,13748,52685
>
> Input split bytes,690,0,690
>
> Map input records,*13637198*,0,13637198
>
> Map output bytes,59638106406,0,59638106406
>
> Map output materialized bytes,60702718624,0,60702718624
>
> Map output records,*531850722*,0,531850722
>
> Merged Map outputs,0,50,50
>
> Physical memory (bytes) snapshot,8398745600,2756530176,11155275776
>
> Reduce input groups,0,13637198,13637198
>
> Reduce input records,0,531850722,531850722
>
> Reduce output records,0,531850722,531850722
>
> Reduce shuffle bytes,0,60702718624,60702718624
>
> Shuffled Maps,0,50,50
>
> Spilled Records,*1063701444*,531850722,1595552166
>
> Total committed heap usage (bytes),10136059904,19488309248,29624369152
>
> Virtual memory (bytes) snapshot,25926946816,96562970624,122489917440
>
>
>
>
>
> Is the large number of output records as opposed to input records normal?
>
> Is the large number of spilled records normal?
>
>
>
> Thanks for your help,
>
> Ralph
>

Mime
View raw message