incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "fp" ...@lucene.cn>
Subject 回复: [Proposal] lxdb - proposal for Apache Incubation
Date Sat, 27 Feb 2021 13:55:37 GMT
Hi 吴晟
Thank you for your reply,In response to your question, my answers are as follows.(我英语不怎么好请您多多包涵.)


1.Since you are proposing a new project to a global foundation, you should at
least keep your documentation in English. 
>Of course, if Apache accepts this project, I will complete all the documents and translate
them into English. Although my English is not very good, many of our company come back from
Australia. This should not be a problem
2:Your provided links are Chinese,which for most IPMC people, it is not readable.
>In addition to the source code, what other documents are needed? Do you want me to
provide some basic project use or introduction first?
3:And since this project is close-source, please provide the dependencies.
>The version to be open source is 100% rewritten. It relies on Hadoop, HBase, spark,
zookeeper, and does not rely on any code from my previous company
4:And as you repeated said the original projects, is this project created 100% on your own,
is it including something from Alibaba/Tencent? 
>the current version of lxdb is 100% created on my own . it isn`t including anything
form Alibaba/Tencent.  
>The previous version of lxdb relies on the mdrill of Alibaba. I am the author of mdrill
project and mdrill is an open source project.
>About Tencent Hermes is my work in Tencent, but after I started my business, I didn't
use the source code of Hermes, and I informed Tencent before I started my business
5:As there is no open-source, I can't verify.
>If you are interested, I can provide the source code to PMC members separately for
auditing
6:Due to this is close-source, we also need you to be clear about whether you
are going to submit SGA and open source to the public.
>I haven't open source the project yet, mainly to see if PMC is interested in my project.
If interested, I will open source. In this way, I can persuade my investors. If PMC is not
interested, I may consider opening source later. At present, the project has about 100000
lines of code, which can be provided to PMC for review
7:The most important, `lucene` is an Apache trademark and Apache project,this makes me have
concerns about the branding violation.
>I just like Lucene. If the name offends PMC, I can correct it for the right name.
8:At last, typically, we(incubator) expect you to have open-sourced the project, and at least
have a small community and first adoption out of your company.
Our company is a commercial company. The community of previous projects here may be different
from what you said. We have organized a QQ communication group with about 1000 people. Many
students here have been our users for many years, and they are looking forward to the development
of our project
9:To join the incubator, you also need at least 3 IPMC members and 1 Champion(Apache member
or officer) to help you understand the incubator.
Can you help me? I really have language problems. There is less communication in this area.
I have done a lot of sharing in China before. I hope you can help me if you can.If you like
this project, you can also join us. It's a very good opportunity in China's database market
my telnum is 17099831107






yannian mu 母延年
luxin,muyannian




------------------ 原始邮件 ------------------
发件人:                                                                               
                                        "general"                                        
                                           <wu.sheng.841108@gmail.com&gt;;
发送时间:&nbsp;2021年2月27日(星期六) 晚上9:06
收件人:&nbsp;"Incubator"<general@incubator.apache.org&gt;;

主题:&nbsp;Re: [Proposal] lxdb - proposal for Apache Incubation



Hi

Since you are proposing a new project to a global foundation, you should at
least keep your documentation in English. Your provided links are Chinese,
which for most IPMC people, it is not readable.
And since this project is close-source, please provide the dependencies.
And as you repeated said the original projects, is this project created
100% on your own, is it including something from Alibaba/Tencent? As there
is no open-source, I can't verify.
Due to this is close-source, we also need you to be clear about whether you
are going to submit SGA and open source to the public.

The most important, `lucene` is an Apache trademark and Apache project,
this makes me have concerns about the branding violation.

At last, typically, we(incubator) expect you to have open-sourced the
project, and at least have a small community and first adoption out of your
company.

To join the incubator, you also need at least 3 IPMC members and 1
Champion(Apache member or officer) to help you understand the incubator.

Sheng Wu 吴晟
Twitter, wusheng1108


fp <fp@lucene.cn&gt; 于2021年2月27日周六 下午6:40写道:

&gt; Dear Apache Incubator Community,
&gt;
&gt;
&gt; Please accept the following proposal for presentation and discussion:
&gt; https://github.com/lucene-cn/lxdb/wiki
&gt;
&gt;
&gt; LXDB is a high-performance,OLAP,full text search database.it`s base on
&gt; hbase,but replaced hfile with lucene index to support more effective
&gt; secondary indexes,it`s also base on spark sql,so that you can used sql api
&gt; to visit data and do olap calculate. and also the lucene index is store on
&gt; hdfs (not local disk).
&gt;
&gt;
&gt; In our Production System, LXDB supported 200+ clusters,some of the single
&gt; cluster is 1000+ nodes,insert 200 billion rows&amp;nbsp; per day ( 20000
&gt; billion rows for total), one of the biggest single table has 200million
&gt; lucene index on LXDB.
&gt;
&gt;
&gt; Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS,
&gt; Lucene.We have merged these separated projects again,LXDB equals spark
&gt; sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10 years to
&gt; complete these merging operations.But the purpose is no longer a search
&gt; engine, but a database.
&gt;
&gt;
&gt;
&gt;
&gt; Best regards
&gt; &amp;nbsp; yannian mu
&gt;
&gt;
&gt;
&gt;
&gt; LXDB Proposal
&gt; == Abstract ==
&gt; LXDB is a high-performance,OLAP,full text search database.
&gt;
&gt;
&gt; === it`s base on hbase,but replaced hfile with lucene index to support
&gt; more effective secondary indexes.===
&gt; we modify hbase region server ,we&amp;nbsp; change hfile to lucene,when put
&gt; data we put&amp;nbsp; document to lucene instande of&amp;nbsp; put data to
hfile
&gt; lucene index store on region server&amp;nbsp; (it is not sote in different
&gt; cluster like elstice search+hbase ,it takes to copy of data)
&gt;
&gt;
&gt; === it`s base on spark sql for olap===
&gt; we Integrated spark and hbase together ,it`s useage like this ,
&gt; 1.unpackage lxdb.tar.gz
&gt; 2.config hadoop_config path,
&gt; 3.run start-all.sh to start cluster.
&gt; lxdb can startup spark through hadoop yarn ,and then spark executor
&gt; process Embedded start hbase region server service .
&gt;
&gt;
&gt; you can operate lxdb database throuth spark sql api(hive) or mysql api.
&gt; 1.the sql used spark rdd+hbase scaner&amp;nbsp; to visit hbase .
&gt; 2.the sql`s condition (filter or group by agg) will predicate to hbase ,
&gt; 3.hbase used lucene index to filter data in region server.
&gt; all of the spark,hbase,lucene is Embedded Integrated together,it is
&gt; not&amp;nbsp; a&amp;nbsp; seperate cluster ,that is the different with solr/es
+
&gt; hbase+spark Solution.
&gt;
&gt;
&gt; == Background ==
&gt; === Multiple copies of data ===
&gt; Apache HBase+Elastic Search is the most popular Solution on full text
&gt; search ,but it`s weak on Online AnalyticalProcessing.
&gt; so most of the time the Production System used spark(or hive or impala or
&gt; presto) ,hbase,solr/es at the same time.Multiple copies of data are stored
&gt; in multiple systems,multiple systems has different Api .Data consistency is
&gt; difficult to guarantee.For the above reasons we merger spark,hbase,elastic
&gt; into one project .it`s target is used one copy of data,one cluster,one api
&gt; to solve olap,kv,full text...database scenarios.
&gt;
&gt;
&gt; === Merging and splitting of lucene indexes(hstore) acrocess different
&gt; machine on hdfs ===
&gt; As we all know solr/es store file in local fileSystem,it`s shard num must
&gt; be a fix num,but if we store index on hdfs,the index can split able like
&gt; hbase hstore,it can split or merge acorss machine nodes ,this is very
&gt; usefull for distribute database ,it depend malloc how much resource on a
&gt; table,most of time the records of a table is different by time by time so
&gt; the num of shards always need adjust,if index store local it can`t split
&gt; acroces throw different machine ,but lucene index store on hdfs it`s can do
&gt; it.
&gt; whether the number of pieces can be flexibly adjusted, whether it has the
&gt; ability of elastic scaling, in a distributed database is particularly
&gt; important
&gt;
&gt;
&gt; === solved Insufficient of&amp;nbsp; secondary indexes ===
&gt; some people use hbase secondary index like Phoenix prjoect. but those
&gt; programme base on the hbase rowkey has a lot of redundancy,He can't create
&gt; too many indexes,Data inflation rate is too high,so used lucene index
&gt; instand of secondary is the best chooses.
&gt;
&gt;
&gt; === we add an lucene index for spark olap===
&gt; Most of OLAP systems has violent scanning problems and Poor timeliness of
&gt; data like hive,spark sql,impala or some of the mpp database.
&gt; 1.They used violent scans to calculate the data.but another choice is add
&gt; index to the big data.some of the time using index can greatly improve the
&gt; performance of the original brute force scanning. i think&amp;nbsp; that just
&gt; like the traditional database, indexing technology can greatly improve the
&gt; performance of the speed database.
&gt; 2.Another problem of thoses database or system, Most of them are an
&gt; offline system or batch system,lxdb `s target is realtime append ,realtime
&gt; kv update just like hbase.
&gt;
&gt;
&gt; ==future==
&gt; === lucene on parquet ===
&gt; recenetly i will change lucene&amp;nbsp; tim,tip(invert index) ,dvd,dvm files
&gt; to&amp;nbsp; like parquet or orc format.
&gt; To solve the performance problem of traversing Lucene index.To solve the
&gt; problem that opening Lucene file needs to load files such as tip into
&gt; memory, which leads to slow opening Lucene index file,To enable Lucene to
&gt; store multi column joint index by column, which is used to handle some
&gt; logic such as multi table join and materialized view ,mulity fields group
&gt; by by invert index,The current Lucene index has many problems because of
&gt; too many file pointers and single column problems,We want to modify Lucene
&gt; to make it more suitable for HDFS, not only for full-text retrieval, but
&gt; also better at statistical analysis, which is a real database level
&gt; index,We want Lucene to be splitable, which can separate storage from
&gt; computation.
&gt;
&gt;
&gt; ===&amp;nbsp; supporting all kinds of Predicate pushdown calculation ===
&gt; We find that if we can combine the calculation method with the data
&gt; closely, we can give more play to the performance of the database. Index is
&gt; only a way of calculating push down. For example, storage push down, we can
&gt; store the index on the SSD device, and the data part on the SATA device. We
&gt; can store the data that are often grouped together in advance, instead of
&gt; calculating line by line, We can give important tables or columns to
&gt; dedicated devices and resources, but these hbases are still lacking, which
&gt; we need to further improve
&gt;
&gt;
&gt; === Distribution of intervention data ===
&gt; we can used row key to intervention data to different nodes ,it can do
&gt; many interestest things
&gt;
&gt;
&gt; === Resource control, resource isolation ===
&gt; lucene recent is not support resource isolation,but&amp;nbsp; on hdfs&amp;nbsp;
we
&gt; can do it , I can control the priority of SQL so that Lucene with higher
&gt; priority can get faster IO resources.
&gt;
&gt;
&gt; == Status ==
&gt; since 2011 I released the first open source version on Alibaba&amp;nbsp; ,At
&gt; that time, mdrill used 10 nodes 48g machines to support 400 billion data.
&gt; the first index on hdfs is from this version.it`s one year ahead of the
&gt; community.&amp;nbsp; https://github.com/alibaba/mdrill .
&gt;
&gt;
&gt; since 2014 i stoped mdrill project update for the reason of i join into
&gt; tencent . in our team we developed&amp;nbsp; hermes project ,we also build
&gt; lucene on hdfs , hermes now realtime import 1000 billion rows of data per
&gt; day.It's the largest database I've ever developed ,
&gt; https://plus.tencent.com/bigdata/hermes
&gt;
&gt;
&gt; since 2018 I set up my own company called luxin, Lu Xin is the Chinese
&gt; pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
&gt; lucene.xin ,mail domain is lucene.cn.
&gt; luxin`s first version of lxdb is called lsql,it`s means lucene sql.&amp;nbsp;
&gt; it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of cluster
&gt; use lsql. it`s process about 200 billions per day ,amount of 20000 billions
&gt; rows in one&amp;nbsp; single cluster. (1000 nodes)
&gt;
&gt;
&gt; since 2010 In the case of COVID-19 our team decide to developed the next
&gt; generation of lsql called lxdb(lx=lucene pronunciation ). we add hbase to
&gt; lsql To solve the update problem.nowadays we have finish the first version
&gt; of lxdb. https://github.com/lucene-cn/lxdb/wiki
&gt;
&gt;
&gt;
&gt;
&gt; == Known Risks ==
&gt; ==Meritocracy ==
&gt;
&gt;
&gt; lxdb has been deployed in production and is applying more than 200 lines
&gt; of business. It has demonstrated great performance benefits and has proved
&gt; to be a better way for reporting and analysis based big data. Still We look
&gt; forward to growing a rich user and developer community.
&gt; === Orphaned products ===
&gt;
&gt;
&gt; The core developers currently work full-time for Luxin.
&gt; lxdb is widely adopted by many companies and individuals. There's no
&gt; realistic chance of it becoming orphaned. and we have a number of 1000
&gt; person tencent qq Instant messaging group
&gt;
&gt;
&gt; === Inexperience with Open Source===
&gt; The core developers are all active users and followers of open source.
&gt; They are already committers and contributors to the lxdb project.&amp;nbsp;
&gt; developed yannian mu has tens years on open source project,&amp;nbsp; jstorm
&gt; https://github.com/alibaba/jstorm and mdrill
&gt; https://github.com/alibaba/mdrill
&gt;
&gt;
&gt;
&gt;
&gt; === Homogenous Developers ===
&gt;
&gt;
&gt; The most of core developers are from luxin for the Closed source products
&gt; reason, but when lxdb was open sourced, lxdb will received a lot of bug
&gt; fixes and enhancements from other developers not working at luxin.Where did
&gt; you learn it from and where did you return it.
&gt;
&gt;
&gt;
&gt;
&gt; ===Reliance on Salaried Developers ===
&gt;
&gt;
&gt; Lxin invested in lxdb as the&amp;nbsp; solution and some of its key engineers
&gt; are working full time on the project. In addition, since there is a growing
&gt; Big Data need for scalable solutions, we look forward to other Apache
&gt; developers and researchers to contribute to the project. Also key to
&gt; addressing the risk associated with relying on Salaried developers from a
&gt; single entity is to increase the diversity of the contributors and actively
&gt; lobby , Apache lxdb intends to do this.
&gt;
&gt;
&gt; === An Excessive Fascination with the Apache Brand ===
&gt;
&gt;
&gt; Lxdb is proposing to enter incubation at Apache in order to help efforts
&gt; to diversify the committer-base, not so much to capitalize on the Apache
&gt; brand. The Lxdb project is in production use already inside lxdb, but is
&gt; not expected to be an lxdb product for external customers. As such, the
&gt; lxdb project is not seeking to use the Apache brand as a marketing tool.
&gt;
&gt;
&gt;
&gt;
&gt; === Documentation===
&gt;
&gt;
&gt; Information about Palo can be found at https://github.com/lucene-cn/lxdb.
&gt; The following links provide more information about lxdb in open source:
&gt;
&gt;
&gt; * wiki site: https://github.com/lucene-cn/lxdb/wiki
&gt; * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
&gt; * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
&gt; * lxin home page: http://www.lucene.xin
&gt; * lsql document: http://docs.lucene.xin/lsql/v21/
&gt;
&gt;
&gt; ##Initial Source
&gt;
&gt;
&gt; lxdb will development source code under an Apache license at
&gt; https://github.com/lucene-cn/lxdb.
&gt;
&gt;
&gt;
&gt;
&gt; === Core Developers ===
&gt;
&gt;
&gt; Currently most of the core developers of LXDB are working in the research
&gt; Team of luxin.
&gt;
&gt;
&gt; - yannian mu (dev)
&gt; - yu chen (dev)
&gt; - guangshi hao (dev)
&gt; - wei sun (dev)
&gt; - qihua zheng (dev)
&gt; - xin wang (dev)
&gt; - qingsong liu (dev)
&gt; - anxing zhou (Tester)
&gt; - jiajun duan (Tester)
&gt;
&gt;
&gt; == External Dependencies ==
&gt; As all dependencies are managed using Apache Maven
&gt; Dependency&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
License&amp;nbsp; &amp;nbsp; &amp;nbsp;
&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
&amp;nbsp; &amp;nbsp;Optional?
&gt; lucene&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Apache
License 2.0&amp;nbsp; &amp;nbsp;
&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
&gt; zookeeper&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
&amp;nbsp;Apache License 2.0&amp;nbsp;
&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
&gt; hbase&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
Apache License 2.0&amp;nbsp;
&gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
&gt; spark&amp;nbsp; &amp;nbsp;Apache License 2.0&amp;nbsp; &amp;nbsp;
&amp;nbsp; &amp;nbsp; &amp;nbsp; true
&gt; hadoop&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
&amp;nbsp; &amp;nbsp; &amp;nbsp; Apache
&gt; License 2.0&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
true
&gt; hive&amp;nbsp; &amp;nbsp;Apache License 2.0&amp;nbsp; &amp;nbsp;
&amp;nbsp; &amp;nbsp; &amp;nbsp; true
&gt;
&gt;
&gt; == Required Resources ==
&gt;
&gt;
&gt; === Mailing lists ===
&gt;
&gt;
&gt; &amp;nbsp;* lxdb-private (PMC discussion)
&gt; &amp;nbsp;* lxdb-dev (developer discussion)
&gt; &amp;nbsp;* lxdb-user (user discussion)
&gt; &amp;nbsp;* lxdb-commits (SCM commits)
&gt; &amp;nbsp;* lxdb-issues (JIRA issue feed)
&gt;
&gt;
&gt; === Subversion Directory ===
&gt;
&gt;
&gt; Instead of subversion, LXDB prefers to git as source control
&gt; management system: git://git.apache.org/lxdb
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message