phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rajeshbabu chintaguntla <rajeshbabu.chintagun...@huawei.com>
Subject RE: Re: Local index related data bulkload
Date Thu, 11 Sep 2014 15:45:35 GMT
Hi Sun,
The code snippet(PhoenixIndexBuilder#batchStarted) you have pointed out is not specific to
local indexing, generic for any index. The main idea of the method is to keep the rows to
index in block cache. So next time when ever scan the rows while preparing index updates we
can get it from cache.
        // The entire purpose of this method impl is to get the existing rows for the
        // table rows being indexed into the block cache, as the index maintenance code
        // does a point scan per row

This gives good performance when a table has more than one index.  One more thing with psql
tool we do upserts in batches and each batch have 1000 updates by default(if you don't specify
any value to phoenix.mutate.batchSize). Lets suppose if all the rows are different we scan
the region until we cache all the 1000 records. That's why
  hasMore = scanner.nextRaw(results);     //Here....  might be taking more time.
Can you tell me how many indexes you have created? One improvement we can do here is if we
have only one index we can skip the scan in PhoenixIndexBuilder#batchStarted.

@James, currently we are scanning the data region while preparing index updates?why don't
we prepare them without scanning data region if we can have get all index columns data from
hooks?


bq. If someone had successfully done loading data through CsvBulkload using Spark and HDFS,
please provide us more kindly suggesion.
Please refer "http://phoenix.apache.org/bulk_dataload.html#Loading via MapReduce" to run the
bulkload from HDFS. Here we can pass index table to build as --index-table parameter.
But currently there is a problem with local indexing. I will raise an issue and work on it.


Thanks,
Rajeshbabu.

This e-mail and its attachments contain confidential information from HUAWEI, which
is intended only for the person or entity whose address is listed above. Any use of the
information contained herein in any way (including, but not limited to, total or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
by
phone or email immediately and delete it!
________________________________
From: sunfl@certusnet.com.cn [sunfl@certusnet.com.cn]
Sent: Thursday, September 11, 2014 6:34 AM
To: user
Subject: Re: Re: Local index related data bulkload

Very thanks.

________________________________
________________________________


From: rajesh babu Chintaguntla<mailto:chrajeshbabu32@gmail.com>
Date: 2014-09-10 21:09
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Local index related data bulkload
Hi Sun I am not accessible to code. Tomorrow morning I will check and let you know.

Thanks,
Rajeshbabu

On Wednesday, September 10, 2014, sunfl@certusnet.com.cn<mailto:sunfl@certusnet.com.cn>
<sunfl@certusnet.com.cn<mailto:sunfl@certusnet.com.cn>> wrote:
Any available suggestion?

________________________________

发件人: sunfl@certusnet.com.cn<UrlBlockedError.aspx>
发送时间: 2014-09-09 14:24
收件人: user<UrlBlockedError.aspx>
主题: 回复: Local index related data bulkload
BTW.
The stacktrace info illustrates that our job running performance bottleneck mainly lies in
the following code :
     region.startRegionOperation();
          try {
               boolean hasMore;
               do {
                  List<Cell> results = Lists.newArrayList();
             // Results are potentially returned even when the return value of s.next is false
             // since this is an indication of whether or not there are more values after
the
            // ones returned
                 hasMore = scanner.nextRaw(results);     //Here....
              } while (hasMore);
            } finally {
               try {
                 scanner.close();
               } finally {
                  region.closeRegionOperation();
                }
            }
         }

________________________________

发件人: sunfl@certusnet.com.cn<UrlBlockedError.aspx>
发送时间: 2014-09-09 14:18
收件人: user<UrlBlockedError.aspx>
抄送: rajeshbabu chintaguntla<UrlBlockedError.aspx>
主题: Local index related data bulkload
Hi all and rajeshbabu,
   Recently our job has encountered severe problems with trying to load data with local indexes
into phoenix. The data load performance looks very bad compared with our previous data
loading with gloable indexes. That seems quite absurd because phoenix local index targets
scenarios with heavy write and space constraint use case, which is just our job application.
   Observing stack trace during our job running, we can find the following info:
   [cid:_Foxmail.1@0a4a2e99-3654-7128-e254-cbc1e8b6ea0d]

We then refer to the org.apache.phoenix.index.PhoenixIndexBuilder and commented the batchStarted
method. After recompiling the phoenix and restart cluster,
our job loading performance get significant advance. Following is the code for batcStarted
method:
Here are my questions:
1 Can these code committor explain the concrete functionality for this method? Especially
concerning to local index data loading...
2 If we modify these codes (e.g. comment this method like what we do), are there any potential
influence for phoenix work?
3 More helpful work..Can any guys share their codes about how to complete data bulkload with
local indexes while data file are storaged within HDFS?
I know that CsvBulkload can do index related data upserting while map-reduce bulkload didnot
support that. Maybe our job is more likely to map-refuce bulkload? So, If someone
had successfully done loading data through CsvBulkload using Spark and HDFS, please provide
us more kindly suggesion.

Best Regards,
Sun

/**
* Index builder for covered-columns index that ties into phoenix for faster use.
*/
public class PhoenixIndexBuilder extends CoveredColumnsIndexBuilder {

@Override
public void batchStarted(MiniBatchOperationInProgress<Mutation> miniBatchOp) throws
IOException {
// The entire purpose of this method impl is to get the existing rows for the
// table rows being indexed into the block cache, as the index maintenance code
// does a point scan per row
List<KeyRange> keys = Lists.newArrayListWithExpectedSize(miniBatchOp.size());
List<IndexMaintainer> maintainers = new ArrayList<IndexMaintainer>();
for (int i = 0; i < miniBatchOp.size(); i++) {
Mutation m = miniBatchOp.getOperation(i);
keys.add(PDataType.VARBINARY.getKeyRange(m.getRow()));
maintainers.addAll(getCodec().getIndexMaintainers(m.getAttributesMap()));
}
Scan scan = IndexManagementUtil.newLocalStateScan(maintainers);
ScanRanges scanRanges = ScanRanges.create(Collections.singletonList(keys), SchemaUtil.VAR_BINARY_SCHEMA);
scanRanges.setScanStartStopRow(scan);
scan.setFilter(scanRanges.getSkipScanFilter());
HRegion region = this.env.getRegion();
RegionScanner scanner = region.getScanner(scan);
// Run through the scanner using internal nextRaw method
region.startRegionOperation();
try {
boolean hasMore;
do {
List<Cell> results = Lists.newArrayList();
// Results are potentially returned even when the return value of s.next is false
// since this is an indication of whether or not there are more values after the
// ones returned
hasMore = scanner.nextRaw(results);
} while (hasMore);
} finally {
try {
scanner.close();
} finally {
region.closeRegionOperation();
}
}
}
________________________________
________________________________

Mime
View raw message