phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jones <pajo...@adobe.com>
Subject Re: Phoenix spark and dynamic columns
Date Wed, 27 Jul 2016 16:32:11 GMT
Josh,

Thank you for your reply. I will take a look at your suggestions.

Thanks,
Paul



Hi Paul,

Unfortunately out of the box the Spark integration doesn't support saving to dynamic columns.
It's worth filing a JIRA enhancement over, and if you're interested in contributing a patch,
here's the following spots I think would need enhancing:

The saving code derives the column names to use with Phoenix from the DataFrame itself here
[1] as `fieldArray`. We would likely need a new DataFrame parameter to pass in the column
list (with dynamic columns included) here [2]

[1] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L32-L35
[2] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L38
The output configuration, which takes care of getting the MapReduce bits ready for saving,
would also need to be updated to support the dynamic column definitions here [3], and then
the 'UPSERT' statement construction would need to be adjusted to support those as well here
[4]

[3] https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/ConfigurationUtil.scala#L25-L38
[4] https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java#L259

Thanks,

Josh

On Mon, Jul 25, 2016 at 5:49 PM, Paul Jones <pajones@adobe.com<mailto:pajones@adobe.com>>
wrote:
Is it possible to save a dataframe into a table where the columns are dynamic?

For instance, I have a loaded a CSV file with header (key, cat1, cat2) into a dataframe. All
values are strings. I created a table like this: create table mytable ("KEY" varchar not null
primary key); The code is as follows:

    val df = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .option("delimiter", "\t")
        .load("saint.tsv")

    df.write
        .format("org.apache.phoenix.spark")
        .mode("overwrite")
        .option("table", "mytable")
        .option("zkUrl", "servier:2181/hbase")
        .save()

The CSV files I process always have a key column but I don’t know what the other columns
will be until I start processing. The code above fails my example unless I create static columns
named cat1 and cat2. Can I change the save somehow to run an upsert specifying the names/column
types thus saving into dynamic columns?

Thanks in advance,
Paul

Mime
View raw message