phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: how to desgin my table?
Date Thu, 16 Jan 2014 08:19:17 GMT
Hi,
That's some really good information about your use case.  Here's an idea I
had for your similar question on the HBase mailing list:
http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E

Thanks,
James


On Wed, Jan 15, 2014 at 3:37 AM, Li Li <fancyerii@gmail.com> wrote:

> 1. goal
>     I want to implement a distributed vertical(topical) crawler. it
> will only store webpages of a certan topic. I will have a classifier
> to do this.
>     I estimated the amount of  webpages that need be store is about
> tens of millions(maybe hundreds of millions as time goes).
>     for vertical crawler, it should crawl the pages most likely
> related to my target topics. So I need a frontier that can dispatch
> task by priorities.
>     for now, the priority is simple but we hope it can deal with
> complicated priority algorithms.
>     1.1 host priority
>           we should crawl many hosts rather than only one single host
> at the same time. initally, each hosts should be equally crawled. but
> after time, we can calculate the priority of host dynamically
>           e.g. we can control the speed of a certain host by it's
> crawl history(some site will ban our crawler if we use too many
> concurrent thread to it). or we can adjust the priority of a host by
> whether it
>           is relevant to our topic(we can calculate the relevance of
> crawled page).
>     1.2 enqueue time
>           first enqueued webpages should get higher priority
>     1.3 depth
>           webpages with small depth will get higher priority(something
> like BFS traverse)
>     1.4 other page priorities
>           e.g. page rank, list page/detail page ...
>
> 2. archeitecture
>    see picture:   http://www.flickr.com/photos/114261973@N07/
>     2.1 Seed Discover
>           use google or other website to find some seed urls
>     2.2 Url DB
>           a distributed DB to store all metadata about urls(that's the
> most hbase related)
>     2.3 Task Scheduler
>          as described before, the task scheduler select top N priority
> webpages and dispatch them to fetcher clusters
>     2.4 Message Queues
>          we use ActiveMQ to decouple different modules and also load
> balance
>     2.5 Fetchers
>           Download webpages
>     2.6 WebPageDB
>           store webpages crawled and extracted metadata(such as
> title,content, pub_time, author, etc ....) of this webpage. we
> consider using hbase too.
>     2.7 Extractors
>           Using classifier to judge whether this page is related to
> our topics and extracting metadata from it and store them to WebPageDB
>
>
> 3. main challenges
>     3.1 Url DB
>        as described before, this store(maybe hbase) should support
> sophisticated pirority algorithms. and also we use it to avoid
> crawling a webpage more than once.
>     3.2 task scheduler
>        how to achieve our goal
>
> 4. current solution
>     4.1 use hbase(maybe together with phoenix) to store urls(we now
> have not done the schema design, hoping get some advice here)
>     4.2 scheduler algorithm
>           int batchSize=10000;
>           //dispatch batchSize tasks to different hosts by host priorities;
>           Map<String,Integer> hostCount=...
>           //select top priority urls from each host
>           List<String> toBeCrawledUrls=new ArrayList<String>(batchSize);
>           for(Entry<String,Integer> entry:hostCount.entrySet()){
>                //select top priority N urls from a given host
>                List<String>
> urls=selectTopNUrlsFromHost(
> entry.getKey(), entry.getValue());
>                toBeCrawledUrls.addAll(urls);
>           }
>           //dispatch this urls to message queue
>           //monitor the message queue status
>           //if the queue is all(or 3/4) consumed, goto top  and
> dispatch another batch urls
>
> 5. table colums
> 5.1 url  varchar  e.g. http://www.google.com/index.html
> 5.2 status tinyint 0: not_crawled  1: crawling   2: success  3: fail
> 5.3 host  varchar  www.google.com
> 5.4 depth tinyint
> 5.5 crawl_time date(what's the difference of date, time and timestamp?)
> 5.6 add_time date
> 5.7 priority int
>
> 6. table schema 1
>     primary key: url
>     index (host,status) including(pirority, add_time, url)
> 6.1 host
>       select distinct host from url_db;
>
> 6.2 dedup
>       select url from url_db where url='http://www.google.com/index.html'
>       btw, how to use upsert to avoid update previous version?
> something like mysql statement:
>       insert into on duplicate key ignore...
> 6.3  select
>        select url from url_db where (host,
> status)=('www.google.com',0) order by priority,add_time limit 100;
> 6.4  update status
>        update url_db set status=1 where url='
> http://www.google.com/index.html'
>        performance analysis:
>            1. select host can use rowkey scan?
>            2. dedup by url very quick because it's rowkey
>            3. select is quick because secondary index
>            4. update will not remove a row(compare with schema2) but
> Put new values in main table
>            5. but I guess update will cause row deletion and insertion
>
> 7. table schema 2
>          primary key: url
>          index(host) column_family(status) including(priority, add_time,
> url)
>          is this valid in phoenix? I mean create index on host and
> create column_family by status?
>
> 8. table schema 3
>          primary key: (host,status,priority,add_time)
>          index(url)
>      all sql statements are the same as above.
>
>        performance analysis:
>            1. select host can use rowkey scan?
>            2. select fast because of rowkey
>            3. dedup by secondary index
>            4. update will cause main table deletion and insertion
>            5. I guess update will also cause index table update
>
> 9. schema 4
>       using an dedicated host table to store only host, host count is
> far less than url
>       but using another table will client code to keep consistency.
>       and also may be in future we need some query like:
>       select host,count(host) as cc from url_db group by host order by cc
> desc;
>
> hope you give me some advices, thanks.
>

Mime
View raw message