phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Li <shawnli...@gmail.com>
Subject Re: "upsert select" with "limit" clause
Date Tue, 18 Dec 2018 05:07:48 GMT
Hi Jonathan,

The single threaded on one side sounds logical to me. Hopefully Vincent can
confirm it.

Thanks,
Shawn

On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonathaz@gmail.com> wrote:

> My guess is that in order to enforce the limit that it’s effectively
> single threaded in either the select or the upsert.
>
> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnlijob@gmail.com> wrote:
>
> Hi Vincent,
>
> Thanks for explaining. That makes much more sense now and it explains the
> high memory usage when without "limit" clause. Because it upserts much
> quickly when using "upsert select" without "limit", the memory usage in
> client machine is much higher than "upsert select" with "limit" .
>
> So back to the other question. Can you explain what is underlying Phoenix
> implementation for "upsert select limit"? Why it is slower than without
> "limit" when insert a huge number (2m rows) like ""upsert into table2
> select * from table1 limit 2,000,000;". This is much slower than inserting
> the whole table (upsert into table2 select * from table1;).
>
> Thanks,
> Xiang
>
>
> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentpoon@apache.org>
> wrote:
>
>> Shawn,
>> Your query *upsert into table2 select * from table1;  *would not be run
>> on the server - the source and target table are different.  It would have
>> to be something like:
>> *upsert into table1 select * from table1;*
>>
>> If you want to run server-side upsert select on a target table that is
>> different from the source table, you need to set
>> "phoenix.client.enable.server.upsert.select" to true on your client.
>> The are some other restrictions: the table can't have any global indexes,
>> and the statement can't have a join or where subquery.  We need to update
>> the documentation with this information.
>>
>> The reason there are all these hurdles is because it's generally not
>> recommended to do server-side upsert select across different tables,
>> because that means you're doing cross-regionserver RPCs (e.g. read data
>> from a region of sourcetable, and write to a region of targettable on a
>> different regionserver), potentially tying up handlers in an unpredictable
>> way.
>>
>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>
>>> Hi Jaanai,
>>>
>>> According to Phoenix website, " If auto commit is on, and both a) the
>>> target table matches the source table, and b) the select performs no
>>> aggregation, then the population of the target table will be done
>>> completely on the server-side (with constraint violations logged, but
>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>> property which defaults to 10000 rows)"
>>>
>>> And our sql statement is just: *upsert into table2 select * from
>>> table1; *which should match the first case, all operations should be in
>>> server site. But the memory usage on the client machine is higher than
>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>> command under Linux. So we are sure it is caused by "select upsert" in
>>> Phoenix and not others, and can't explain why there is so high memory usage
>>> on client/gateway machine when all operations are supposed to happen on the
>>> serve side.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.poster@gmail.com>
>>> wrote:
>>>
>>>> Shawn,
>>>>
>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>> whether some resources have not been released.
>>>>
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>>
>>>>
>>>>
>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四 下午12:10写道:
>>>>
>>>>> Hi Jaanai,
>>>>>
>>>>> Thanks for putting your thought. The behavior you describe is correct
>>>>> on the Hbase region sever side. The memory usage for blockcache and
>>>>> memstore will be high under such high throughput. But our phoenix client
is
>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>> service on it), so not sure how to explain such high memory usage for
>>>>> upsert select without "limit" clause. The high memory usage behavior
like
>>>>> all select results send to client machine, cached in client machine's
>>>>> memory, and then insert back to target table, which is not like the
>>>>> behavior that should happen, all of this should be done on the server
side
>>>>> as the table schema is exactly the same. By the way, this happens on
both
>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.poster@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>>
>>>>>>
>>>>>> For the upsert without limit,  which will read the source table and
>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>> throughput.
>>>>>>
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
上午10:13写道:
>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>> So you describe limit will sent result to client side and then
write
>>>>>>> to server, this might explain why upsert with limit is slower
than without
>>>>>>> limit. But looks like it can't explain the memory usage? The
memory usage
>>>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit),
sometime
>>>>>>> upsert without "limit" can even reach 20gb for big table.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentpoon@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think it's done client-side if you have LIMIT.  If you
have e.g.
>>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver
to upsert 100, if
>>>>>>>> you have more than one regionserver.  So instead results
are sent back to
>>>>>>>> the client, where the LIMIT is applied and then written back
to the server
>>>>>>>> in the UPSERT.
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnlijob@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The table creation statement is similar to below. We
have about
>>>>>>>>> 200 fields. Table is mutable and don’t have any index
on the table.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>
>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>
>>>>>>>>>       city VARCHAR,
>>>>>>>>>
>>>>>>>>>       population BIGINT,
>>>>>>>>>
>>>>>>>>>       …
>>>>>>>>>
>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For #2, can you provide the table definition and
the statement
>>>>>>>>>> used?  e.g. Is the table immutable, or does it have
indexes?
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1.       Want to check what is underlying running
for limit
>>>>>>>>>>> clause used in the following Upsert statement
(is it involving any
>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *                                  upsert into
table2 select *
>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table
2 have same
>>>>>>>>>>> schema)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>               The above statement is running
a lot slower than
>>>>>>>>>>> without “limit”  clause as shown in following,
even the above statement
>>>>>>>>>>> upsert less data:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *                                upsert into
table2 select *
>>>>>>>>>>> from table1;*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2.       We also observe memory usable is pretty
high without
>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for
big table it can reach 20gb
>>>>>>>>>>> without using limit clause.  According to phoenix
website description for
>>>>>>>>>>> upsert select “If auto commit is on, and both
a) the target table matches
>>>>>>>>>>> the source table, and b) the select performs
no aggregation, then the
>>>>>>>>>>> population of the target table will be done completely
on the server-side
>>>>>>>>>>> (with constraint violations logged, but otherwise
ignored).”
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                My question is If everything is
done on
>>>>>>>>>>> server-side, how come we have such high memory
usage on the client machine?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>

Mime
View raw message