phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Li <shawnli...@gmail.com>
Subject Re: "upsert select" with "limit" clause
Date Mon, 17 Dec 2018 03:12:37 GMT
Hi Jaanai,

According to Phoenix website, " If auto commit is on, and both a) the
target table matches the source table, and b) the select performs no
aggregation, then the population of the target table will be done
completely on the server-side (with constraint violations logged, but
otherwise ignored). Otherwise, data is buffered on the client and, if auto
commit is on, committed in row batches as specified by the UpsertBatchSize
connection property (or the phoenix.mutate.upsertBatchSize HBase config
property which defaults to 10000 rows)"

And our sql statement is just: *upsert into table2 select * from table1; *which
should match the first case, all operations should be in server site. But
the memory usage on the client machine is higher than "upsert select limit"
clause. And the memory usage is check by run 'top'  command under Linux. So
we are sure it is caused by "select upsert" in Phoenix and not others, and
can't explain why there is so high memory usage on client/gateway machine
when all operations are supposed to happen on the serve side.

Thanks,
Shawn

On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.poster@gmail.com> wrote:

> Shawn,
>
> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause,
> only query 1 table, the query is doing aggregation, no sequences and auto
> commit is on. Please check your SQL ... and you can also check whether some
> resources have not been released.
>
> ----------------------------------------
>    Jaanai Zhang
>    Best regards!
>
>
>
> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四 下午12:10写道:
>
>> Hi Jaanai,
>>
>> Thanks for putting your thought. The behavior you describe is correct on
>> the Hbase region sever side. The memory usage for blockcache and memstore
>> will be high under such high throughput. But our phoenix client is on a
>> gateway machine (no hbase region server sitting on it or any Hbase service
>> on it), so not sure how to explain such high memory usage for upsert select
>> without "limit" clause. The high memory usage behavior like all select
>> results send to client machine, cached in client machine's memory, and then
>> insert back to target table, which is not like the behavior that should
>> happen, all of this should be done on the server side as the table schema
>> is exactly the same. By the way, this happens on both Phoenix 4.7 and
>> Phoenix 4.14.
>>
>>
>> Thanks,
>> Shawn
>>
>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.poster@gmail.com>
>> wrote:
>>
>>> Shawn,
>>>
>>>
>>> For the upsert without limit,  which will read the source table and
>>> write the target tables on the server side.  I think the higher memory
>>> usage is caused by using scan cache and memstore under the higher
>>> throughput.
>>>
>>> ----------------------------------------
>>>    Jaanai Zhang
>>>    Best regards!
>>>
>>>
>>>
>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四 上午10:13写道:
>>>
>>>> Hi Vincent,
>>>>
>>>> So you describe limit will sent result to client side and then write to
>>>> server, this might explain why upsert with limit is slower than without
>>>> limit. But looks like it can't explain the memory usage? The memory usage
>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>> upsert without "limit" can even reach 20gb for big table.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentpoon@apache.org>
>>>> wrote:
>>>>
>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100,
if
>>>>> you have more than one regionserver.  So instead results are sent back
to
>>>>> the client, where the LIMIT is applied and then written back to the server
>>>>> in the UPSERT.
>>>>>
>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnlijob@gmail.com>
wrote:
>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>>
>>>>>>
>>>>>> The table creation statement is similar to below. We have about 200
>>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>>
>>>>>>
>>>>>>
>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>
>>>>>>       state CHAR(2) NOT NULL,
>>>>>>
>>>>>>       city VARCHAR,
>>>>>>
>>>>>>       population BIGINT,
>>>>>>
>>>>>>       …
>>>>>>
>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Shawn
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>> wrote:
>>>>>>
>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>
>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <shawnlijob@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 1.       Want to check what is underlying running for limit
clause
>>>>>>>> used in the following Upsert statement (is it involving any
coprocessor
>>>>>>>> working behind?):
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *                                  upsert into table2 select
* from
>>>>>>>> table1 limit 3000000; * (table 1 and table 2 have same schema)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>               The above statement is running a lot slower
than
>>>>>>>> without “limit”  clause as shown in following, even the
above statement
>>>>>>>> upsert less data:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *                                upsert into table2 select
* from
>>>>>>>> table1;*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2.       We also observe memory usable is pretty high without
the
>>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can
reach 20gb
>>>>>>>> without using limit clause.  According to phoenix website
description for
>>>>>>>> upsert select “If auto commit is on, and both a) the target
table matches
>>>>>>>> the source table, and b) the select performs no aggregation,
then the
>>>>>>>> population of the target table will be done completely on
the server-side
>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                My question is If everything is done on server-side,
>>>>>>>> how come we have such high memory usage on the client machine?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>

Mime
View raw message