phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: how to desgin my table?
Date Thu, 16 Jan 2014 08:54:43 GMT
I have read this. But my question is slightly different with prevoius
one. I have four table operations:
1. check whether a url already exists. if exists, then do nothing.
else insert it. I can implement it by get first and then checkAndPut
in hbase.
2. select distinct hosts(maybe in future group them and couting)
3. select urls where host=? and status=? order by priority,add_time limit ?
4. update table set status=1 where url=?

If I follow your advice, I should create table and index like this:

CREATE TABLE url_db (
    status TINYINT,
    priority INTEGER NOT NULL,
    added_time DATE,
    host VARCHAR NOT NULL,
    url VARCHAR NOT NULL
    CONSTRAINT pk PRIMARY KEY (host, status, priority, added_time, url));

create index my_index on url_db(url);

performance analysis:
1. check url exists using secondary index, it's fast
2. select host is fast by rowkey range scan?
3. select urls by host and status and ordering by priority and
add_time is fast by carefully designed rowkey range scan
4. update status by secondary index

my question is what will be done by phoenix?
update status will first look up rowkey in secondary table and find
the original rowkey of url_db(that's my guess), do a Delete and a Put
with url_db and update secondary index by Put?
am I right?


On Thu, Jan 16, 2014 at 4:19 PM, James Taylor <jamestaylor@apache.org> wrote:
> Hi,
> That's some really good information about your use case.  Here's an idea I
> had for your similar question on the HBase mailing list:
> http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E
>
> Thanks,
> James
>
>
> On Wed, Jan 15, 2014 at 3:37 AM, Li Li <fancyerii@gmail.com> wrote:
>>
>> 1. goal
>>     I want to implement a distributed vertical(topical) crawler. it
>> will only store webpages of a certan topic. I will have a classifier
>> to do this.
>>     I estimated the amount of  webpages that need be store is about
>> tens of millions(maybe hundreds of millions as time goes).
>>     for vertical crawler, it should crawl the pages most likely
>> related to my target topics. So I need a frontier that can dispatch
>> task by priorities.
>>     for now, the priority is simple but we hope it can deal with
>> complicated priority algorithms.
>>     1.1 host priority
>>           we should crawl many hosts rather than only one single host
>> at the same time. initally, each hosts should be equally crawled. but
>> after time, we can calculate the priority of host dynamically
>>           e.g. we can control the speed of a certain host by it's
>> crawl history(some site will ban our crawler if we use too many
>> concurrent thread to it). or we can adjust the priority of a host by
>> whether it
>>           is relevant to our topic(we can calculate the relevance of
>> crawled page).
>>     1.2 enqueue time
>>           first enqueued webpages should get higher priority
>>     1.3 depth
>>           webpages with small depth will get higher priority(something
>> like BFS traverse)
>>     1.4 other page priorities
>>           e.g. page rank, list page/detail page ...
>>
>> 2. archeitecture
>>    see picture:   http://www.flickr.com/photos/114261973@N07/
>>     2.1 Seed Discover
>>           use google or other website to find some seed urls
>>     2.2 Url DB
>>           a distributed DB to store all metadata about urls(that's the
>> most hbase related)
>>     2.3 Task Scheduler
>>          as described before, the task scheduler select top N priority
>> webpages and dispatch them to fetcher clusters
>>     2.4 Message Queues
>>          we use ActiveMQ to decouple different modules and also load
>> balance
>>     2.5 Fetchers
>>           Download webpages
>>     2.6 WebPageDB
>>           store webpages crawled and extracted metadata(such as
>> title,content, pub_time, author, etc ....) of this webpage. we
>> consider using hbase too.
>>     2.7 Extractors
>>           Using classifier to judge whether this page is related to
>> our topics and extracting metadata from it and store them to WebPageDB
>>
>>
>> 3. main challenges
>>     3.1 Url DB
>>        as described before, this store(maybe hbase) should support
>> sophisticated pirority algorithms. and also we use it to avoid
>> crawling a webpage more than once.
>>     3.2 task scheduler
>>        how to achieve our goal
>>
>> 4. current solution
>>     4.1 use hbase(maybe together with phoenix) to store urls(we now
>> have not done the schema design, hoping get some advice here)
>>     4.2 scheduler algorithm
>>           int batchSize=10000;
>>           //dispatch batchSize tasks to different hosts by host
>> priorities;
>>           Map<String,Integer> hostCount=...
>>           //select top priority urls from each host
>>           List<String> toBeCrawledUrls=new ArrayList<String>(batchSize);
>>           for(Entry<String,Integer> entry:hostCount.entrySet()){
>>                //select top priority N urls from a given host
>>                List<String>
>> urls=selectTopNUrlsFromHost(
>> entry.getKey(), entry.getValue());
>>                toBeCrawledUrls.addAll(urls);
>>           }
>>           //dispatch this urls to message queue
>>           //monitor the message queue status
>>           //if the queue is all(or 3/4) consumed, goto top  and
>> dispatch another batch urls
>>
>> 5. table colums
>> 5.1 url  varchar  e.g. http://www.google.com/index.html
>> 5.2 status tinyint 0: not_crawled  1: crawling   2: success  3: fail
>> 5.3 host  varchar  www.google.com
>> 5.4 depth tinyint
>> 5.5 crawl_time date(what's the difference of date, time and timestamp?)
>> 5.6 add_time date
>> 5.7 priority int
>>
>> 6. table schema 1
>>     primary key: url
>>     index (host,status) including(pirority, add_time, url)
>> 6.1 host
>>       select distinct host from url_db;
>>
>> 6.2 dedup
>>       select url from url_db where url='http://www.google.com/index.html'
>>       btw, how to use upsert to avoid update previous version?
>> something like mysql statement:
>>       insert into on duplicate key ignore...
>> 6.3  select
>>        select url from url_db where (host,
>> status)=('www.google.com',0) order by priority,add_time limit 100;
>> 6.4  update status
>>        update url_db set status=1 where
>> url='http://www.google.com/index.html'
>>        performance analysis:
>>            1. select host can use rowkey scan?
>>            2. dedup by url very quick because it's rowkey
>>            3. select is quick because secondary index
>>            4. update will not remove a row(compare with schema2) but
>> Put new values in main table
>>            5. but I guess update will cause row deletion and insertion
>>
>> 7. table schema 2
>>          primary key: url
>>          index(host) column_family(status) including(priority, add_time,
>> url)
>>          is this valid in phoenix? I mean create index on host and
>> create column_family by status?
>>
>> 8. table schema 3
>>          primary key: (host,status,priority,add_time)
>>          index(url)
>>      all sql statements are the same as above.
>>
>>        performance analysis:
>>            1. select host can use rowkey scan?
>>            2. select fast because of rowkey
>>            3. dedup by secondary index
>>            4. update will cause main table deletion and insertion
>>            5. I guess update will also cause index table update
>>
>> 9. schema 4
>>       using an dedicated host table to store only host, host count is
>> far less than url
>>       but using another table will client code to keep consistency.
>>       and also may be in future we need some query like:
>>       select host,count(host) as cc from url_db group by host order by cc
>> desc;
>>
>> hope you give me some advices, thanks.
>
>

Mime
View raw message