phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Poon <vincentp...@apache.org>
Subject Re: "upsert select" with "limit" clause
Date Tue, 18 Dec 2018 22:31:13 GMT
Shawn, that sounds like a bug, I would file a JIRA.

On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnlijob@gmail.com> wrote:

> Hi Vincent & William,
>
>
>
> Below is the explain plan, both are PARALLEL excuted in plan:
>
>
>
> explain upsert into table1 select * from table2;
>
>
>
> UPSERT
> SELECT
>                       |
>
> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
> FULL SCAN OVER table2
>
>
>
> explain upsert into table1 select * from table2 limit 2000000;
>
>
>
> UPSERT
> SELECT
>                       |
>
> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
> ROBIN FULL SCAN OVER table2 |
>
> |     SERVER 2000000 ROW
> LIMIT
>                     |
>
> | CLIENT 2000000 ROW LIMIT
>
>
>
>
>
>
>
> Thanks,
>
> Shawn
>
> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentpoon@apache.org wrote:
>
>> Shawn,
>>
>> Can you do an "explain" to show what your two statements are doing?  That
>> might give some clues.  Perhaps one is able to be run on the server for
>> some reason and the other is not.
>> Otherwise, I don't see why one would be substantially slower than the
>> other.
>>
>> Vincent
>>
>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnlijob@gmail.com> wrote:
>>
>>> Hi Jonathan,
>>>
>>> The single threaded on one side sounds logical to me. Hopefully Vincent
>>> can confirm it.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonathaz@gmail.com>
>>> wrote:
>>>
>>>> My guess is that in order to enforce the limit that it’s effectively
>>>> single threaded in either the select or the upsert.
>>>>
>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnlijob@gmail.com> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> Thanks for explaining. That makes much more sense now and it explains
>>>> the high memory usage when without "limit" clause. Because it upserts much
>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>> client machine is much higher than "upsert select" with "limit" .
>>>>
>>>> So back to the other question. Can you explain what is underlying
>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>
>>>> Thanks,
>>>> Xiang
>>>>
>>>>
>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentpoon@apache.org>
>>>> wrote:
>>>>
>>>>> Shawn,
>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>> run on the server - the source and target table are different.  It would
>>>>> have to be something like:
>>>>> *upsert into table1 select * from table1;*
>>>>>
>>>>> If you want to run server-side upsert select on a target table that is
>>>>> different from the source table, you need to set
>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>> The are some other restrictions: the table can't have any global
>>>>> indexes, and the statement can't have a join or where subquery.  We need
to
>>>>> update the documentation with this information.
>>>>>
>>>>> The reason there are all these hurdles is because it's generally not
>>>>> recommended to do server-side upsert select across different tables,
>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>> from a region of sourcetable, and write to a region of targettable on
a
>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>> way.
>>>>>
>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnlijob@gmail.com>
wrote:
>>>>>
>>>>>> Hi Jaanai,
>>>>>>
>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>> the target table matches the source table, and b) the select performs
no
>>>>>> aggregation, then the population of the target table will be done
>>>>>> completely on the server-side (with constraint violations logged,
but
>>>>>> otherwise ignored). Otherwise, data is buffered on the client and,
if auto
>>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase
config
>>>>>> property which defaults to 10000 rows)"
>>>>>>
>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>> table1; *which should match the first case, all operations should
be
>>>>>> in server site. But the memory usage on the client machine is higher
than
>>>>>> "upsert select limit" clause. And the memory usage is check by run
'top'
>>>>>> command under Linux. So we are sure it is caused by "select upsert"
in
>>>>>> Phoenix and not others, and can't explain why there is so high memory
usage
>>>>>> on client/gateway machine when all operations are supposed to happen
on the
>>>>>> serve side.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.poster@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>>
>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>> clause, only query 1 table, the query is doing aggregation, no
sequences
>>>>>>> and auto commit is on. Please check your SQL ... and you can
also check
>>>>>>> whether some resources have not been released.
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>    Jaanai Zhang
>>>>>>>    Best regards!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
下午12:10写道:
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> Thanks for putting your thought. The behavior you describe
is
>>>>>>>> correct on the Hbase region sever side. The memory usage
for blockcache and
>>>>>>>> memstore will be high under such high throughput. But our
phoenix client is
>>>>>>>> on a gateway machine (no hbase region server sitting on it
or any Hbase
>>>>>>>> service on it), so not sure how to explain such high memory
usage for
>>>>>>>> upsert select without "limit" clause. The high memory usage
behavior like
>>>>>>>> all select results send to client machine, cached in client
machine's
>>>>>>>> memory, and then insert back to target table, which is not
like the
>>>>>>>> behavior that should happen, all of this should be done on
the server side
>>>>>>>> as the table schema is exactly the same. By the way, this
happens on both
>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>> cloud.poster@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For the upsert without limit,  which will read the source
table
>>>>>>>>> and write the target tables on the server side.  I think
the higher memory
>>>>>>>>> usage is caused by using scan cache and memstore under
the higher
>>>>>>>>> throughput.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <shawnlijob@gmail.com> 于2018年12月13日周四
上午10:13写道:
>>>>>>>>>
>>>>>>>>>> Hi Vincent,
>>>>>>>>>>
>>>>>>>>>> So you describe limit will sent result to client
side and then
>>>>>>>>>> write to server, this might explain why upsert with
limit is slower than
>>>>>>>>>> without limit. But looks like it can't explain the
memory usage? The memory
>>>>>>>>>> usage on client machine is 8gb (without "limit")
vs 2gb (with limit),
>>>>>>>>>> sometime upsert without "limit" can even reach 20gb
for big table.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>> vincentpoon@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think it's done client-side if you have LIMIT.
 If you have
>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each
regionserver to upsert
>>>>>>>>>>> 100, if you have more than one regionserver.
 So instead results are sent
>>>>>>>>>>> back to the client, where the LIMIT is applied
and then written back to the
>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnlijob@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The table creation statement is similar to
below. We have about
>>>>>>>>>>>> 200 fields. Table is mutable and don’t
have any index on the table.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population
(
>>>>>>>>>>>>
>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>
>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>
>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>
>>>>>>>>>>>>       …
>>>>>>>>>>>>
>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon
<
>>>>>>>>>>>> vincentpoon@apache.org wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> For #2, can you provide the table definition
and the statement
>>>>>>>>>>>>> used?  e.g. Is the table immutable, or
does it have indexes?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang
Li <
>>>>>>>>>>>>> shawnlijob@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.       Want to check what is underlying
running for limit
>>>>>>>>>>>>>> clause used in the following Upsert
statement (is it involving any
>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *                               
  upsert into table2 select
>>>>>>>>>>>>>> * from table1 limit 3000000; * (table
1 and table 2 have
>>>>>>>>>>>>>> same schema)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>               The above statement
is running a lot slower
>>>>>>>>>>>>>> than without “limit”  clause
as shown in following, even the above
>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *                               
upsert into table2 select *
>>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2.       We also observe memory usable
is pretty high
>>>>>>>>>>>>>> without the limit clause (8gb vs
2gb), sometimes for big table it can reach
>>>>>>>>>>>>>> 20gb without using limit clause.
 According to phoenix website description
>>>>>>>>>>>>>> for upsert select “If auto commit
is on, and both a) the target table
>>>>>>>>>>>>>> matches the source table, and b)
the select performs no aggregation, then
>>>>>>>>>>>>>> the population of the target table
will be done completely on the
>>>>>>>>>>>>>> server-side (with constraint violations
logged, but otherwise ignored).”
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                My question is If
everything is done on
>>>>>>>>>>>>>> server-side, how come we have such
high memory usage on the client machine?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Mime
View raw message