Hi Mohan,

Generally speaking, you can treat the Phoenix RDD / DataFrames the same as any other type, regardless of the source. If you look at the Spark programming guide [1], they have great documentation on how and when data is loaded into memory.


[1] http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

On Tue, May 17, 2016 at 6:21 AM, Mohanraj Ragupathiraj <mohanaugust@gmail.com> wrote:
I have created a DataFrame from a HBase Table (PHOENIX) which has 500 million rows. From the DataFrame I created an RDD of JavaBean and use it for joining with data from a file.

Map<String, String> phoenixInfoMap = new HashMap<String, String>();
phoenixInfoMap.put("table", tableName);
phoenixInfoMap.put("zkUrl", zkURL);
DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(phoenixInfoMap).load();
JavaRDD<Row> tableRows = df.toJavaRDD();
JavaPairRDD<String, AccountModel> dbData = tableRows.mapToPair(
new PairFunction<Row, String, String>()
public Tuple2<String, String> call(Row row) throws Exception
return new Tuple2<String, String>(row.getAs("ID"), row.getAs("NAME"));
Now my question - Lets say the file has 2 unique million entries matching with the table. Is the entire table loaded into memory as RDD or only the matching 2 million records from the table will be loaded into memory as RDD ?

Thanks and Regards