phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Mahonin <jmaho...@gmail.com>
Subject Re: Phoenix spark and dynamic columns
Date Wed, 27 Jul 2016 14:35:20 GMT
Hi Paul,

Unfortunately out of the box the Spark integration doesn't support saving
to dynamic columns. It's worth filing a JIRA enhancement over, and if
you're interested in contributing a patch, here's the following spots I
think would need enhancing:

The saving code derives the column names to use with Phoenix from the
DataFrame itself here [1] as `fieldArray`. We would likely need a new
DataFrame parameter to pass in the column list (with dynamic columns
included) here [2]

[1]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L32-L35
[2]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L38

The output configuration, which takes care of getting the MapReduce bits
ready for saving, would also need to be updated to support the dynamic
column definitions here [3], and then the 'UPSERT' statement construction
would need to be adjusted to support those as well here [4]

[3]
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/ConfigurationUtil.scala#L25-L38
[4]
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java#L259

Thanks,

Josh


On Mon, Jul 25, 2016 at 5:49 PM, Paul Jones <pajones@adobe.com> wrote:

> Is it possible to save a dataframe into a table where the columns are
> dynamic?
>
> For instance, I have a loaded a CSV file with header (key, cat1, cat2)
> into a dataframe. All values are strings. I created a table like this:
> create table mytable ("KEY" varchar not null primary key); The code is as
> follows:
>
>     val df = sqlContext.read
>         .format("com.databricks.spark.csv")
>         .option("header", "true")
>         .option("inferSchema", "true")
>         .option("delimiter", "\t")
>         .load("saint.tsv")
>
>     df.write
>         .format("org.apache.phoenix.spark")
>         .mode("overwrite")
>         .option("table", "mytable")
>         .option("zkUrl", "servier:2181/hbase")
>         .save()
>
> The CSV files I process always have a key column but I don’t know what the
> other columns will be until I start processing. The code above fails my
> example unless I create static columns named cat1 and cat2. Can I change
> the save somehow to run an upsert specifying the names/column types thus
> saving into dynamic columns?
>
> Thanks in advance,
> Paul
>
>

Mime
View raw message