phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ciureanu Constantin <ciureanu.constan...@gmail.com>
Subject Re: PHOENIX SPARK - Load Table as DataFrame
Date Wed, 18 May 2016 19:24:17 GMT
Hello Mohan,

Since you haven't mentioned anything about any tricks for the join, I would
assume the entire table would be streamed through Spark and joined there
with your file.

Tricks to improve the speed, you can pick one that works in your case, or
imagine something of your own since you know better your use case:
- use the join from Phoenix (so the file data should be inserted into a new
temp table and join the two tables)
- use to join the hbase table keys and / or scans with filters that would
be applied in parallel by coprocessors
- use some form of bucketing to scan repeatedly in the hbase table just
some areas where data from the file could produce results
- use some index for the table in case it would improve the speed
- Etc.

Good luck,
Constantin
Pe 18 mai 2016 15:28, "Josh Mahonin" <jmahonin@gmail.com> a scris:

> Hi Mohan,
>
> Generally speaking, you can treat the Phoenix RDD / DataFrames the same as
> any other type, regardless of the source. If you look at the Spark
> programming guide [1], they have great documentation on how and when data
> is loaded into memory.
>
> Josh
>
> [1]
> http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
>
> On Tue, May 17, 2016 at 6:21 AM, Mohanraj Ragupathiraj <
> mohanaugust@gmail.com> wrote:
>
>> I have created a DataFrame from a HBase Table (PHOENIX) which has 500
>> million rows. From the DataFrame I created an RDD of JavaBean and use it
>> for joining with data from a file.
>>
>> Map<String, String> phoenixInfoMap = new HashMap<String, String>();
>> phoenixInfoMap.put("table", tableName);
>> phoenixInfoMap.put("zkUrl", zkURL);
>> DataFrame df =
>> sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
>> JavaRDD<Row> tableRows = df.toJavaRDD();
>> JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
>> new PairFunction<Row, String, String>()
>> {
>> @Override
>> public Tuple2<String, String> call(Row row) throws Exception
>> {
>> return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
>> }
>> });
>>
>> Now my question - Lets say the file has 2 unique million entries matching
>> with the table. Is the entire table loaded into memory as RDD or only the
>> matching 2 million records from the table will be loaded into memory as RDD
>> ?
>>
>> --
>> Thanks and Regards
>> Mohan
>>
>
>

Mime
View raw message