livy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wandong Wu <wu...@husky.neu.edu>
Subject Re: Some questions about cached data in Livy
Date Fri, 13 Jul 2018 08:22:24 GMT
Hi:

	I’m sorry that I’m still confused about Livy's shared object mechanism. As for our product
design. the shared object is a DataFrame or  table which is created from some parquet files
in AWS S3. When I use Python programmatic API to submit a job like this:
(1) I don’t cache df and tmpTable into spark memory.
def job1(context):
    spark = context.spark_session
    df = spark.read.parquet('s3://stg-elasticlog-backup-ex4/type=facet_bandwidth_total/ts_hour=41784*/*/*.parquet')
\
         .select('in_byte', 'out_byte', 'user_name').registerTempTable('tmpTable')


(2) Then, I will reuse the “tmpTable” to do some SQL queries, like:
def job4(context):
    spark = context.spark_session
    tmpRes3 = spark.sql("select sum(in_byte) as sum_in, user_name from tmpTable \
                           group by user_name \
                           having sum_in > 1000000000 \
                           order by sum_in").toJSON().collect()
    return tmpRes3

(3) Then, I do another query:
def job4(context):
    session = context.spark_session
    tmpRes3 = session.sql("select sum(out_byte) as sum_out, user_name from tmpTable \
                           group by user_name \
                           having sum_out > 1000000000 \
                           order by sum_out").toJSON().collect()
    return tmpRes3
My understanding is : For (1), It will find all the parquet files path and create a DataFrame,
then create a table. But those will not be cached into Spark memory. For (2) and (3), every
time we use “tmpTable” it knows what “tmpTable” is, but it will re-find all the files
to re-build this table? Is that right?


As for Livy's shared object mechanism. You said "user could store this object Foo into JobContext
with a provided name”. So, it will store the data into memory, or just store the reference
to the data?  Maybe my question confuses you……

In our previous design, we use AWS s3 and Athena. For a report, about 20 SQL queries use the
same files in AWS s3, but every query must re-load the files. So, it will be very slow. We
want to speed up these queries. 

Because these queries use the same files, we actually only need load the files just once.
 Can Livy's shared object mechanism help us? or we need use Spark cache mechanism? 


Thank you so much!


Yours
Wandong


> 在 2018年7月11日,20:17,Saisai Shao <sai.sai.shao@gmail.com> 写道:
> 
> Hi Wandong,
> 
> Livy's shared object mechanism mainly used to share objects between different Livy jobs,
this is mainly used for Job API. For example job A create a object Foo which wants to be accessed
by Job B, then user could store this object Foo into JobContext with a provided name, after
that Job B could get this object by the name.
> 
> This is different from Spark's cache mechanism. What you mentioned above (tmp table)
is a Spark provided table cache mechanism, which is unrelated to Livy.
> 
> 
> 
> Wandong Wu <wu.wa@husky.neu.edu <mailto:wu.wa@husky.neu.edu>> 于2018年7月11日周三
下午5:46写道:
> Dear Sir or Madam:
>  
>       I am a Livy beginner. I use Livy, because within an interactive session, different
spark jobs could share cached RDDs or DataFrames.
>  
>       When I read some parquet files and create a table called “TmpTable”. The following
queries will use this table. Does it mean this table has been cached?
>       If cached, where is the table cached? The table is cached in Livy or Spark cluster?
>  
>       Spark also supports cache function.  When I read some parquet files and create
a table called “TmpTable2”. I add such code: sql_ctx.cacheTable('tmpTable2').
>       In the next query using this table. It will be cached in Spark cluster. Then the
following queries could use this cached table.
>  
>       What is the difference between cached in Livy and cached in Spark cluster?
>  
> Thanks!
>  
> Yours
> Wandong
>  


Mime
View raw message