phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Leech <jonat...@gmail.com>
Subject Re: "upsert select" with "limit" clause
Date Tue, 18 Dec 2018 02:25:32 GMT
My guess is that in order to enforce the limit that it’s effectively single threaded in either
the select or the upsert.

> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnlijob@gmail.com> wrote:
> 
> Hi Vincent,
> 
> Thanks for explaining. That makes much more sense now and it explains the high memory
usage when without "limit" clause. Because it upserts much quickly when using "upsert select"
without "limit", the memory usage in client machine is much higher than "upsert select" with
"limit" .
> 
> So back to the other question. Can you explain what is underlying Phoenix implementation
for "upsert select limit"? Why it is slower than without "limit" when insert a huge number
(2m rows) like ""upsert into table2 select * from table1 limit 2,000,000;". This is much slower
than inserting the whole table (upsert into table2 select * from table1;).  
> 
> Thanks,
> Xiang 
> 
> 
>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentpoon@apache.org> wrote:
>> Shawn,
>> Your query upsert into table2 select * from table1;  would not be run on the server
- the source and target table are different.  It would have to be something like:
>> upsert into table1 select * from table1;
>> 
>> If you want to run server-side upsert select on a target table that is different
from the source table, you need to set "phoenix.client.enable.server.upsert.select" to true
on your client.  
>> The are some other restrictions: the table can't have any global indexes, and the
statement can't have a join or where subquery.  We need to update the documentation with this
information.
>> 
>> The reason there are all these hurdles is because it's generally not recommended
to do server-side upsert select across different tables, because that means you're doing cross-regionserver
RPCs (e.g. read data from a region of sourcetable, and write to a region of targettable on
a different regionserver), potentially tying up handlers in an unpredictable way.
>> 
>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>> Hi Jaanai,
>>> 
>>> According to Phoenix website, " If auto commit is on, and both a) the target
table matches the source table, and b) the select performs no aggregation, then the population
of the target table will be done completely on the server-side (with constraint violations
logged, but otherwise ignored). Otherwise, data is buffered on the client and, if auto commit
is on, committed in row batches as specified by the UpsertBatchSize connection property (or
the phoenix.mutate.upsertBatchSize HBase config property which defaults to 10000 rows)"
>>> 
>>> And our sql statement is just: upsert into table2 select * from table1; which
should match the first case, all operations should be in server site. But the memory usage
on the client machine is higher than "upsert select limit" clause. And the memory usage is
check by run 'top'  command under Linux. So we are sure it is caused by "select upsert" in
Phoenix and not others, and can't explain why there is so high memory usage on client/gateway
machine when all operations are supposed to happen on the serve side. 
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.poster@gmail.com>
wrote:
>>>> Shawn,
>>>> 
>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause,
only query 1 table, the query is doing aggregation, no sequences and auto commit is on. Please
check your SQL ... and you can also check whether some resources have not been released.
>>>> 
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>> 
>>>> 
>>>> 
>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四 下午12:10写道:
>>>>> Hi Jaanai,
>>>>> 
>>>>> Thanks for putting your thought. The behavior you describe is correct
on the Hbase region sever side. The memory usage for blockcache and memstore will be high
under such high throughput. But our phoenix client is on a gateway machine (no hbase region
server sitting on it or any Hbase service on it), so not sure how to explain such high memory
usage for upsert select without "limit" clause. The high memory usage behavior like all select
results send to client machine, cached in client machine's memory, and then insert back to
target table, which is not like the behavior that should happen, all of this should be done
on the server side as the table schema is exactly the same. By the way, this happens on both
Phoenix 4.7 and Phoenix 4.14.
>>>>>    
>>>>> 
>>>>> Thanks,
>>>>> Shawn 
>>>>> 
>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.poster@gmail.com>
wrote:
>>>>>> Shawn,
>>>>>> 
>>>>>> 
>>>>>> For the upsert without limit,  which will read the source table and
write the target tables on the server side.  I think the higher memory usage is caused by
using scan cache and memstore under the higher throughput.
>>>>>> 
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
上午10:13写道:
>>>>>>> Hi Vincent,
>>>>>>> 
>>>>>>> So you describe limit will sent result to client side and then
write to server, this might explain why upsert with limit is slower than without limit. But
looks like it can't explain the memory usage? The memory usage on client machine is 8gb (without
"limit") vs 2gb (with limit), sometime upsert without "limit" can even reach 20gb for big
table.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn 
>>>>>>> 
>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentpoon@apache.org>
wrote:
>>>>>>>> I think it's done client-side if you have LIMIT.  If you
have e.g. LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, if you have
more than one regionserver.  So instead results are sent back to the client, where the LIMIT
is applied and then written back to the server in the UPSERT.
>>>>>>>> 
>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnlijob@gmail.com>
wrote:
>>>>>>>>> Hi Vincent,
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> The table creation statement is similar to below. We
have about 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>> 
>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>> 
>>>>>>>>>       city VARCHAR,
>>>>>>>>> 
>>>>>>>>>       population BIGINT,
>>>>>>>>> 
>>>>>>>>>       …
>>>>>>>>> 
>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Shawn
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentpoon@apache.org
wrote:
>>>>>>>>>> For #2, can you provide the table definition and
the statement used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li
<shawnlijob@gmail.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 1.       Want to check what is underlying running
for limit clause used in the following Upsert statement (is it involving any coprocessor working
behind?):
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                                   upsert into
table2 select * from table1 limit 3000000;  (table 1 and table 2 have same schema)
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>               The above statement is running
a lot slower than without “limit”  clause as shown in following, even the above statement
upsert less data:
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                                 upsert into table2
select * from table1;
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 2.       We also observe memory usable is pretty
high without the limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb without
using limit clause.  According to phoenix website description for upsert select “If auto
commit is on, and both a) the target table matches the source table, and b) the select performs
no aggregation, then the population of the target table will be done completely on the server-side
(with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                My question is If everything is
done on server-side, how come we have such high memory usage on the client machine?
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Shawn

Mime
View raw message