phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Leech <jonat...@gmail.com>
Subject Re: Best strategy for UPSERT SELECT in large table
Date Sun, 18 Jun 2017 17:50:39 GMT
Also, if you're updating that many values and not doing it in bulk / mapreduce / straight to
hfiles, you'll want to give the region servers as much heap as possible, set store files and
blocking store files astronomically high, and set the memory size for the table before Hbase
flushes to disk as large as possible. This is to avoid compactions slowing you down and causing
timeouts. You can also break up the upsert selects into smaller chunks and manually compact
in between to mitigate. The above strategy also applies for other large updates in the regular
Hbase write path, such as building or rebuilding indexes.

> On Jun 18, 2017, at 11:41 AM, Jonathan Leech <jonathaz@gmail.com> wrote:
> 
> Another thing to consider, but only if your 1:1 mapping keeps the primary keys the same,
is to snapshot the table and restore it with the new name, and a schema that is the union
of the old and new schemas. I would put the new columns in a new column family. Then use upsert
select, mapreduce, or Spark to transform the data, then drop the columns from the old schema.
This strategy could cut the amount of work to be done by half and not send data over the network.
> 
>> On Jun 17, 2017, at 5:06 PM, Randy Hu <ruweih@gmail.com> wrote:
>> 
>> If I count the number of tailing zeros correctly, it's 15 billion records,
>> any solution based on HBase PUT interaction (UPSERT SELECT) would probably
>> take way more time than your expectation. It would be better to use the
>> map/reduce based bulk importer provided by Phoenix:
>> 
>> https://phoenix.apache.org/bulk_dataload.html
>> 
>> The importer leverages HBase bulk mode to convert all data into HBase
>> storage file, then hand it over to HBase in the final stage, thus avoids
>> all network and disk random access cost when going through HBase region
>> servers.
>> 
>> Randy
>> 
>> On Fri, Jun 16, 2017 at 9:51 AM, Pedro Boado [via Apache Phoenix User List]
>> <ml+s1124778n3675h74@n5.nabble.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> We are trying to populate a Phoenix table based on a 1:1 projection of
>>> another table with around 15.000.000.000 records via an UPSERT SELECT in
>>> phoenix client. We've noticed a very poor performance ( I suspect the
>>> client is using a single-threaded approach ) and lots of issues with client
>>> timeouts.
>>> 
>>> Is there a better way of approaching this problem?
>>> 
>>> Cheers!
>>> Pedro
>>> 
>>> 
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-phoenix-user-list.1124778.n5.nabble.com/
>>> Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675.html
>>> To start a new topic under Apache Phoenix User List, email
>>> ml+s1124778n1h80@n5.nabble.com
>>> To unsubscribe from Apache Phoenix User List, click here
>>> <http://apache-phoenix-user-list.1124778.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cnV3ZWloQGdtYWlsLmNvbXwxfC04OTI3ODY3NTc=>
>>> .
>>> NAML
>>> <http://apache-phoenix-user-list.1124778.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-phoenix-user-list.1124778.n5.nabble.com/Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675p3683.html
>> Sent from the Apache Phoenix User List mailing list archive at Nabble.com.

Mime
View raw message