phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Li <fancye...@gmail.com>
Subject Re: how to desgin my table?
Date Thu, 16 Jan 2014 08:56:16 GMT
btw, how to implement checkAndPut semantic in phoenix with sql?

On Thu, Jan 16, 2014 at 4:54 PM, Li Li <fancyerii@gmail.com> wrote:
> I have read this. But my question is slightly different with prevoius
> one. I have four table operations:
> 1. check whether a url already exists. if exists, then do nothing.
> else insert it. I can implement it by get first and then checkAndPut
> in hbase.
> 2. select distinct hosts(maybe in future group them and couting)
> 3. select urls where host=? and status=? order by priority,add_time limit ?
> 4. update table set status=1 where url=?
>
> If I follow your advice, I should create table and index like this:
>
> CREATE TABLE url_db (
>     status TINYINT,
>     priority INTEGER NOT NULL,
>     added_time DATE,
>     host VARCHAR NOT NULL,
>     url VARCHAR NOT NULL
>     CONSTRAINT pk PRIMARY KEY (host, status, priority, added_time, url));
>
> create index my_index on url_db(url);
>
> performance analysis:
> 1. check url exists using secondary index, it's fast
> 2. select host is fast by rowkey range scan?
> 3. select urls by host and status and ordering by priority and
> add_time is fast by carefully designed rowkey range scan
> 4. update status by secondary index
>
> my question is what will be done by phoenix?
> update status will first look up rowkey in secondary table and find
> the original rowkey of url_db(that's my guess), do a Delete and a Put
> with url_db and update secondary index by Put?
> am I right?
>
>
> On Thu, Jan 16, 2014 at 4:19 PM, James Taylor <jamestaylor@apache.org> wrote:
>> Hi,
>> That's some really good information about your use case.  Here's an idea I
>> had for your similar question on the HBase mailing list:
>> http://mail-archives.apache.org/mod_mbox/hbase-user/201401.mbox/%3CCAG_TOPDpp%2BTWm0wZ3es6dPFwtn5x1grzrrXA-YsSHdXzRnVf%2BQ%40mail.gmail.com%3E
>>
>> Thanks,
>> James
>>
>>
>> On Wed, Jan 15, 2014 at 3:37 AM, Li Li <fancyerii@gmail.com> wrote:
>>>
>>> 1. goal
>>>     I want to implement a distributed vertical(topical) crawler. it
>>> will only store webpages of a certan topic. I will have a classifier
>>> to do this.
>>>     I estimated the amount of  webpages that need be store is about
>>> tens of millions(maybe hundreds of millions as time goes).
>>>     for vertical crawler, it should crawl the pages most likely
>>> related to my target topics. So I need a frontier that can dispatch
>>> task by priorities.
>>>     for now, the priority is simple but we hope it can deal with
>>> complicated priority algorithms.
>>>     1.1 host priority
>>>           we should crawl many hosts rather than only one single host
>>> at the same time. initally, each hosts should be equally crawled. but
>>> after time, we can calculate the priority of host dynamically
>>>           e.g. we can control the speed of a certain host by it's
>>> crawl history(some site will ban our crawler if we use too many
>>> concurrent thread to it). or we can adjust the priority of a host by
>>> whether it
>>>           is relevant to our topic(we can calculate the relevance of
>>> crawled page).
>>>     1.2 enqueue time
>>>           first enqueued webpages should get higher priority
>>>     1.3 depth
>>>           webpages with small depth will get higher priority(something
>>> like BFS traverse)
>>>     1.4 other page priorities
>>>           e.g. page rank, list page/detail page ...
>>>
>>> 2. archeitecture
>>>    see picture:   http://www.flickr.com/photos/114261973@N07/
>>>     2.1 Seed Discover
>>>           use google or other website to find some seed urls
>>>     2.2 Url DB
>>>           a distributed DB to store all metadata about urls(that's the
>>> most hbase related)
>>>     2.3 Task Scheduler
>>>          as described before, the task scheduler select top N priority
>>> webpages and dispatch them to fetcher clusters
>>>     2.4 Message Queues
>>>          we use ActiveMQ to decouple different modules and also load
>>> balance
>>>     2.5 Fetchers
>>>           Download webpages
>>>     2.6 WebPageDB
>>>           store webpages crawled and extracted metadata(such as
>>> title,content, pub_time, author, etc ....) of this webpage. we
>>> consider using hbase too.
>>>     2.7 Extractors
>>>           Using classifier to judge whether this page is related to
>>> our topics and extracting metadata from it and store them to WebPageDB
>>>
>>>
>>> 3. main challenges
>>>     3.1 Url DB
>>>        as described before, this store(maybe hbase) should support
>>> sophisticated pirority algorithms. and also we use it to avoid
>>> crawling a webpage more than once.
>>>     3.2 task scheduler
>>>        how to achieve our goal
>>>
>>> 4. current solution
>>>     4.1 use hbase(maybe together with phoenix) to store urls(we now
>>> have not done the schema design, hoping get some advice here)
>>>     4.2 scheduler algorithm
>>>           int batchSize=10000;
>>>           //dispatch batchSize tasks to different hosts by host
>>> priorities;
>>>           Map<String,Integer> hostCount=...
>>>           //select top priority urls from each host
>>>           List<String> toBeCrawledUrls=new ArrayList<String>(batchSize);
>>>           for(Entry<String,Integer> entry:hostCount.entrySet()){
>>>                //select top priority N urls from a given host
>>>                List<String>
>>> urls=selectTopNUrlsFromHost(
>>> entry.getKey(), entry.getValue());
>>>                toBeCrawledUrls.addAll(urls);
>>>           }
>>>           //dispatch this urls to message queue
>>>           //monitor the message queue status
>>>           //if the queue is all(or 3/4) consumed, goto top  and
>>> dispatch another batch urls
>>>
>>> 5. table colums
>>> 5.1 url  varchar  e.g. http://www.google.com/index.html
>>> 5.2 status tinyint 0: not_crawled  1: crawling   2: success  3: fail
>>> 5.3 host  varchar  www.google.com
>>> 5.4 depth tinyint
>>> 5.5 crawl_time date(what's the difference of date, time and timestamp?)
>>> 5.6 add_time date
>>> 5.7 priority int
>>>
>>> 6. table schema 1
>>>     primary key: url
>>>     index (host,status) including(pirority, add_time, url)
>>> 6.1 host
>>>       select distinct host from url_db;
>>>
>>> 6.2 dedup
>>>       select url from url_db where url='http://www.google.com/index.html'
>>>       btw, how to use upsert to avoid update previous version?
>>> something like mysql statement:
>>>       insert into on duplicate key ignore...
>>> 6.3  select
>>>        select url from url_db where (host,
>>> status)=('www.google.com',0) order by priority,add_time limit 100;
>>> 6.4  update status
>>>        update url_db set status=1 where
>>> url='http://www.google.com/index.html'
>>>        performance analysis:
>>>            1. select host can use rowkey scan?
>>>            2. dedup by url very quick because it's rowkey
>>>            3. select is quick because secondary index
>>>            4. update will not remove a row(compare with schema2) but
>>> Put new values in main table
>>>            5. but I guess update will cause row deletion and insertion
>>>
>>> 7. table schema 2
>>>          primary key: url
>>>          index(host) column_family(status) including(priority, add_time,
>>> url)
>>>          is this valid in phoenix? I mean create index on host and
>>> create column_family by status?
>>>
>>> 8. table schema 3
>>>          primary key: (host,status,priority,add_time)
>>>          index(url)
>>>      all sql statements are the same as above.
>>>
>>>        performance analysis:
>>>            1. select host can use rowkey scan?
>>>            2. select fast because of rowkey
>>>            3. dedup by secondary index
>>>            4. update will cause main table deletion and insertion
>>>            5. I guess update will also cause index table update
>>>
>>> 9. schema 4
>>>       using an dedicated host table to store only host, host count is
>>> far less than url
>>>       but using another table will client code to keep consistency.
>>>       and also may be in future we need some query like:
>>>       select host,count(host) as cc from url_db group by host order by cc
>>> desc;
>>>
>>> hope you give me some advices, thanks.
>>
>>

Mime
View raw message