phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <>
Subject how to desgin my table?
Date Wed, 15 Jan 2014 11:37:18 GMT
1. goal
    I want to implement a distributed vertical(topical) crawler. it
will only store webpages of a certan topic. I will have a classifier
to do this.
    I estimated the amount of  webpages that need be store is about
tens of millions(maybe hundreds of millions as time goes).
    for vertical crawler, it should crawl the pages most likely
related to my target topics. So I need a frontier that can dispatch
task by priorities.
    for now, the priority is simple but we hope it can deal with
complicated priority algorithms.
    1.1 host priority
          we should crawl many hosts rather than only one single host
at the same time. initally, each hosts should be equally crawled. but
after time, we can calculate the priority of host dynamically
          e.g. we can control the speed of a certain host by it's
crawl history(some site will ban our crawler if we use too many
concurrent thread to it). or we can adjust the priority of a host by
whether it
          is relevant to our topic(we can calculate the relevance of
crawled page).
    1.2 enqueue time
          first enqueued webpages should get higher priority
    1.3 depth
          webpages with small depth will get higher priority(something
like BFS traverse)
    1.4 other page priorities
          e.g. page rank, list page/detail page ...

2. archeitecture
   see picture:
    2.1 Seed Discover
          use google or other website to find some seed urls
    2.2 Url DB
          a distributed DB to store all metadata about urls(that's the
most hbase related)
    2.3 Task Scheduler
         as described before, the task scheduler select top N priority
webpages and dispatch them to fetcher clusters
    2.4 Message Queues
         we use ActiveMQ to decouple different modules and also load balance
    2.5 Fetchers
          Download webpages
    2.6 WebPageDB
          store webpages crawled and extracted metadata(such as
title,content, pub_time, author, etc ....) of this webpage. we
consider using hbase too.
    2.7 Extractors
          Using classifier to judge whether this page is related to
our topics and extracting metadata from it and store them to WebPageDB

3. main challenges
    3.1 Url DB
       as described before, this store(maybe hbase) should support
sophisticated pirority algorithms. and also we use it to avoid
crawling a webpage more than once.
    3.2 task scheduler
       how to achieve our goal

4. current solution
    4.1 use hbase(maybe together with phoenix) to store urls(we now
have not done the schema design, hoping get some advice here)
    4.2 scheduler algorithm
          int batchSize=10000;
          //dispatch batchSize tasks to different hosts by host priorities;
          Map<String,Integer> hostCount=...
          //select top priority urls from each host
          List<String> toBeCrawledUrls=new ArrayList<String>(batchSize);
          for(Entry<String,Integer> entry:hostCount.entrySet()){
               //select top priority N urls from a given host
entry.getKey(), entry.getValue());
          //dispatch this urls to message queue
          //monitor the message queue status
          //if the queue is all(or 3/4) consumed, goto top  and
dispatch another batch urls

5. table colums
5.1 url  varchar  e.g.
5.2 status tinyint 0: not_crawled  1: crawling   2: success  3: fail
5.3 host  varchar
5.4 depth tinyint
5.5 crawl_time date(what's the difference of date, time and timestamp?)
5.6 add_time date
5.7 priority int

6. table schema 1
    primary key: url
    index (host,status) including(pirority, add_time, url)
6.1 host
      select distinct host from url_db;

6.2 dedup
      select url from url_db where url=''
      btw, how to use upsert to avoid update previous version?
something like mysql statement:
      insert into on duplicate key ignore...
6.3  select
       select url from url_db where (host,
status)=('',0) order by priority,add_time limit 100;
6.4  update status
       update url_db set status=1 where url=''
       performance analysis:
           1. select host can use rowkey scan?
           2. dedup by url very quick because it's rowkey
           3. select is quick because secondary index
           4. update will not remove a row(compare with schema2) but
Put new values in main table
           5. but I guess update will cause row deletion and insertion

7. table schema 2
         primary key: url
         index(host) column_family(status) including(priority, add_time, url)
         is this valid in phoenix? I mean create index on host and
create column_family by status?

8. table schema 3
         primary key: (host,status,priority,add_time)
     all sql statements are the same as above.

       performance analysis:
           1. select host can use rowkey scan?
           2. select fast because of rowkey
           3. dedup by secondary index
           4. update will cause main table deletion and insertion
           5. I guess update will also cause index table update

9. schema 4
      using an dedicated host table to store only host, host count is
far less than url
      but using another table will client code to keep consistency.
      and also may be in future we need some query like:
      select host,count(host) as cc from url_db group by host order by cc desc;

hope you give me some advices, thanks.

View raw message