phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jins George <jins.geo...@aeris.net>
Subject Re: Efficient way to get the row count of a table
Date Wed, 20 Dec 2017 00:27:37 GMT
Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good 
in my case. But when I tried this, it gave me inconsistent results.  for 
example.

Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348090

Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348080

So I was expecting the stats count to go up but surprisingly, the count 
went down.  Is there a specific configuration or something else that I 
am missing?

I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George


On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
> Another alternate outside Phoenix is to use 
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html 
> M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <jamestaylor@apache.org 
> <mailto:jamestaylor@apache.org>> wrote:
>
>     If it needs to be 100% accurate, then count(*) is the only way. If
>     your data is write-once data, you might be able to track the row
>     count at the application level through some kind of atomic counter
>     in a different table (but this will likely be brittle). If you can
>     live with an estimate, you could enable statistics [1], optionally
>     configuring Phoenix not to use stats for parallelization [2], and
>     query the SYSTEM.STATS table to get an estimate [3].
>
>     Another interesting alternative if you want the approximate row
>     count when you have a where clause would be to use the new table
>     sampling feature [4]. You'd also want stats enabled for this to be
>     more accurate too.
>
>     Thanks,
>     James
>
>
>     [1] https://phoenix.apache.org/update_statistics.html
>     <https://phoenix.apache.org/update_statistics.html>
>     [2] phoenix.use.stats.parallelization=false
>     [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>     physical_name='my_schema.my_table'
>          and COLUMN_FAMILY='my_first_column_family' -- necessary only
>     if you have multiple column families
>     [4] https://phoenix.apache.org/tablesample.html
>     <https://phoenix.apache.org/tablesample.html>
>
>     On Tue, Dec 19, 2017 at 2:57 PM, Jins George
>     <jins.george@aeris.net <mailto:jins.george@aeris.net>> wrote:
>
>         Hi,
>
>         Is there a way to get the total row count of a phoenix table
>         without running select count(*) from table ?
>         my use case is to monitor the record count in a table every x
>         minutes, so didn't want to put load on the system by running a
>         select count(*) query.
>
>         Thanks,
>         Jins George
>
>
>


Mime
View raw message