phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From talluri abhishek <abhishektall...@gmail.com>
Subject Re: Growth in table size and performance degradation on read-queries
Date Tue, 29 Jan 2019 22:13:58 GMT
Hi Vincent,

Versions is set to1 and keep_deleted_cells is false. It's basically the
default settings and nothing has been changed.

describe on the hbase table gives below:

VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER',
KEEP_DELETED_CELLS => 'FALSE'


Thanks,
Abhishek

On Tue, Jan 29, 2019 at 3:20 PM Vincent Poon <vincentpoon@apache.org> wrote:

> is your max_versions set to 1 ?  keep_deleted_cells?
>
> On Tue, Jan 29, 2019 at 10:41 AM talluri abhishek <
> abhishektalluri@gmail.com> wrote:
>
>> Hi All,
>>
>> We are seeing a couple of issues on some of our Phoenix tables where the
>> size of the tables keep growing 2-3 times after around 2-3 days of
>> ingestion and the read performance takes a big hit after that. Now, if we
>> insert overwrite the data in that table to a new copy table, the data size
>> comes back to normal size and the queries perform fast on that copy table.
>>
>> Initial table size after 1st day ~ 5G
>> After 2 days of ingestion ~ 15G
>> Re-write into a copy table ~ 5-6 G
>>
>> Query performance becomes proportional to the size of the table, lets say
>> the query took 40 secs to run on the original table after first day, it
>> takes around 130-160 secs after 2 days of ingestion. The same query when
>> run on the copy table finishes in around ~40secs.
>>
>> Most of the ingested data after the first day are mostly updates
>> happening on the existing rows, so we thought major compaction should solve
>> the size issue but it does not shrink the size every time (load happens in
>> parallel when the compaction is run).
>> Write performance is always good and we have used salt buckets to even
>> out the writes. The primary key is a 12-bit string which is made by the
>> concatenation of some account id and an auto-generated transaction number.
>>
>> One query that has a toll on its performance as mentioned above is:
>> *select (list of 50-70 columns) from original_table where account_id IN
>> (list of 100k account ids) *[account_id in this query is the primary key
>> on that table]
>>
>> We are currently increasing the heap space on these region servers to
>> provide more memstore size, which could reduce the number of flushes for
>> the upserted data.
>>
>> Could there be any other reason for the increase in the size of the table
>> apart from the updated rows? How could we better the performance of those
>> read queries?
>>
>> Thanks,
>> Abhishek
>>
>

Mime
View raw message