From general-return-24385-apmail-incubator-general-archive=incubator.apache.org@incubator.apache.org Sat Jan 09 14:29:49 2010 Return-Path: Delivered-To: apmail-incubator-general-archive@www.apache.org Received: (qmail 18498 invoked from network); 9 Jan 2010 14:29:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Jan 2010 14:29:47 -0000 Received: (qmail 19776 invoked by uid 500); 9 Jan 2010 14:29:46 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 19577 invoked by uid 500); 9 Jan 2010 14:29:46 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 19567 invoked by uid 99); 9 Jan 2010 14:29:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jan 2010 14:29:45 +0000 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=FUZZY_MERIDIA,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of list@toolazydogs.com does not designate 209.85.160.41 as permitted sender) Received: from [209.85.160.41] (HELO mail-pw0-f41.google.com) (209.85.160.41) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jan 2010 14:29:34 +0000 Received: by pwj4 with SMTP id 4so631639pwj.20 for ; Sat, 09 Jan 2010 06:29:12 -0800 (PST) Received: by 10.143.20.40 with SMTP id x40mr10881390wfi.226.1263047351088; Sat, 09 Jan 2010 06:29:11 -0800 (PST) Received: from ?192.168.1.150? (dagmar.corp.linkedin.com [69.28.149.29]) by mx.google.com with ESMTPS id 23sm19673348pzk.12.2010.01.09.06.29.10 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 09 Jan 2010 06:29:10 -0800 (PST) Message-Id: From: "Alan D. Cabrera" To: general@incubator.apache.org In-Reply-To: <9D040A2C-EC44-4936-94E7-A00E805FF0FD@apache.org> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: [VOTE] Incubate Lucene Connector Framework Date: Sat, 9 Jan 2010 06:29:09 -0800 References: <9D040A2C-EC44-4936-94E7-A00E805FF0FD@apache.org> X-Mailer: Apple Mail (2.936) X-Virus-Checked: Checked by ClamAV on apache.org +1 Regards, Alan On Jan 8, 2010, at 5:51 AM, Grant Ingersoll wrote: > Hi, > > Given the lack of response on the proposal, I'll assume lazy =20 > consensus and call a vote. > > On behalf of the Lucene PMC, I'd like to propose incubation for a =20 > new Lucene > subproject called the Lucene Connector Framework (LCF). I think we =20 > have all the > necessary bits in place for the proposal to go forward. > > Proposal: = http://wiki.apache.org/incubator/LuceneConnectorFrameworkProposal > > [] +1. Accept LCF into the Incubator. > [] 0. Don't care. > [] -1. Do not accept (and why.) > > Here's my +1. > > Thanks, Grant Ingersoll > > > > ------ Wiki Text Copied Below ----- > > Lucene Connector Framework > > Abstract > > Many, many search engines, as well as other applications, have a =20 > need to connect > with content repositories (SharePoint, CMS, Documentum, etc.) in a =20 > standard > manner. The Lucene Connector Framework (LCF) is a project aimed at =20 > building out > these connectors in open source under the Apache brand. > > Proposal > > The goal of LCF is to create a viable Lucene subproject aimed at =20 > delivering a > best of breed connector framework under the Apache Lucene name. As a =20= > framework, > the project will not only provide a way to connect to individual =20 > repositories, > but also a mechanism for plugging in new connectors or custom =20 > connectors in a > straightforward manner. > > A connector framework is vital for search engines and other tools =20 > that need to > access data located in corporate repositories. By abstracting the =20 > problem into a > framework, applications can code to a set of well-defined interfaces =20= > instead of > having to use a different interface for each connector. > > Connector Framework is an extendible incremental crawler, which uses =20= > a database > to manage configuration and crawl history, and provides reasonably =20 > high > performance in accessing content in multiple repositories for the =20 > main purpose > of search engine indexing. Connector Framework also establishes a > repository-specific security model which can be used to limit search =20= > user access > to repository content based on a user's identity. Connector =20 > Framework also > includes existing connectors and authorities for: > > =95 File system =95 Windows shares =95 JDBC-supported databases =95 = RSS =20 > feeds =95 General websites =95 LiveLink [from OpenText] > > =95 Documentum [from EMC] =95 SharePoint [from Microsoft] > > =95 Meridio [from Meridio] =95 Memex [from Memex] =95 FileNet [from = IBM] > > Key design points for Connector Framework are as follows: > > =95 Extendability - you can add new connectors for new repositories, =20= > and new > authorities for specific repository security models =95 Incrementality = =20 > - the ability to process only what changed between crawls, in > a repository-specific manner =95 Restartability - using a database =20 > with ACID properties to insure that crawls > are safe against process interruption or machine shutdown =95 Security = =20 > - establishing a model of security tokens that allows a search > engine to enforce a repository's security model =95 Limited footprint =20= > - ability to operate reliably within a fixed amount of > process memory, regardless of configuration =95 Performance - =20 > management of connector-specific resources to maximize overall > thoughput =95 Transparency - ability to generate reports on the =20 > activity of all crawls and > repository connections > > Background > > MetaCarta originally approached Grant Ingersoll from the Lucene PMC =20= > about > donating their existing connector framework to the Lucene PMC. After =20= > some > discussion about accepting it as a software grant, the PMC decided =20 > it would be > best to incubate the project first. > > Rationale > > The Connector Framework fills an often significant gap in the Lucene =20= > experience, > namely, how to get content locked away in a content repository into > Lucene/Solr/Nutch/Mahout/Tika. Naturally, many other tools (search =20 > engines and > others) will also have this same problem. A Connector Framework =20 > would also be > useful for someone wishing to migrate between content repositories, =20= > too. > > Current Status > > Connector Framework has been under development and in use in the =20 > field for close > to five years, deployed on a MetaCarta search appliance. Almost all =20= > development > of the project has been done by Karl Wright =20 > ( kwri...@metacarta.com ). Some > individual connectors were developed initially by contractors hired by > MetaCarta, Inc., but maintenance and further development is =20 > currently handled by > the MetaCarta team. > > Development of Connector Framework can therefore be viewed as core =20 > framework > development, plus development of individual connectors. Core framework > development is currently not a terribly collaborative process, as =20 > there are no > maintainers of the core functionality other than Mr. Wright. =20 > Development of new > connectors has been done in the past in a much more collaborative =20 > way by > supplying a developer with a "development kit", and then integrating =20= > the > resulting connector (with whatever changes might have been =20 > necessary) into the > source tree. > > Reasonable efforts have been made to maintain the generality of the =20= > code base > during the time that MetaCarta has owned it. Nevertheless, certain > MetaCarta-specific changes have been made which may require review and > modification. The following areas probably need to be addressed in =20 > the code > before graduation can occur: > > =95 Branding. The UI brands it as a MetaCarta project. > > =95 Package names. Package names would have to be changed. =95 How =20 > Connector Framework handles document delivery needs to be =20 > generalized, at > least for a single, configurable target output connector, and =20 > perhaps for > multiple, independently-configurable targets. Simple example output =20= > connectors > need to be written. Work in this direction is currently underway at =20= > MetaCarta > and may or may not be complete at the time of the code handover. > > =95 Connector Framework-specific dependent package modifications need =20= > to be > addressed somehow. For instance, the following projects that =20 > Connector Framework > depends upon have been modified, but the modifications have not been =20= > accepted > upstream: commons-httpclient NTLMv2 and NTLM2 support [RSS, Web, =20 > SharePoint, > Meridio, and Livelink connectors]; commons-httpclient custom HTTPS =20 > protocol > factory support [Web, SharePoint, Meridio, and Livelink connectors]; =20= > xerces > ability to handle non-legal RSS feeds [RSS and Web connectors] > > =95 MetaCarta-specific features, like document templates, are =20 > explicitly handled > by the UI and the infrastructure. These features should be =20 > generalized so that > they are controlled by the choice of output connector. > > =95 Some specific hooks, namely support for configuration change =20 > notification, > and for database maintenance notification, may need to be made more =20= > generic. =95 Share Connector has a "fingerprinting" feature, which =20 > prefilters documents > based on a document type it surmises using a document inspection =20 > technique. This > feature is only viable at the moment for very basic document types. =20= > It should > either be removed, or generalized significantly to be much more =20 > flexible. =95 Documentation needs to be fleshed out, including javadoc = =20 > and overall usage > documents. =95 Tests need to be written and/or ported from MetaCarta's = =20 > test suite. > > Longer term, the project will likely grow into a more distributed =20 > crawler, where > multiple machines might well be involved in coordinated crawling =20 > activity. > > Meritocracy > > Building the community using a meritocratic approach is very =20 > important to the > success of LCF. We know many, many people in the search space (and =20 > otherwise) > have either written their own connectors or are in need of =20 > connectors. Thus, we > expect a meritocratic community will lead to widespread participation. > > Community > > Our hope is that our existing code, features and capabilities will =20 > attract a > large community of both developers and users. We also believe that =20 > other > organizations will find this project interesting and relevant, and =20 > contribute > resources. > > The user community of LCF would be similar to that of the other =20 > Lucene projects, > and in many cases they would overlap. > > Core Developers > > See the initial committer list below. > > Alignment > > We expect LCF will align quite well with the existing Lucene =20 > community and will > also provide significant value to other ASF and non-ASF projects as =20= > well as many > companies and individuals looking to access their content =20 > repositories in a > programmatic fashion. > > Known Risks > > Orphaned Products > > The Connector Framework is an important piece of any search engine, =20= > including > MetaCarta's, as it provides the primary mechanism for getting =20 > content out of a > repository and into the search engine's index. Thus, we don't expect =20= > it will be > orphaned anytime soon. Once the project is established and the code is > available, we expect to attract not only other search companies, but =20= > others with > similar needs. > > Inexperience with Open Source > > Grant Ingersoll, Ryan McKinley and Simon Willnauer provide the =20 > majority of the > experience with Open Source at the ASF, but all of the initial =20 > committers are > familiar with Open Source and have contributed to other open source =20= > projects. > > Homogeneous Developers > > The current list of committers are mostly members of either the =20 > MetaCarta or > Lucid Imagination developer team, but several are not. Additionally, =20= > we are > actively recruiting other developers. > > Reliance on Salaried Developers > > We have a variety of committers represented. Some are being paid to =20= > work on the > project and some are not. > > Cryptography > > Connector Framework itself has no real cryptography component, =20 > although it does > currently obfuscate passwords it saves to the database or to a =20 > configuration > file using a proprietary algorithm. The algorithm is present simply =20= > to avoid > using cleartext and is not secure in any sense other than by =20 > obscurity. > > Various connectors, such as Share Connector, Web Connector, RSS =20 > Connector, > SharePoint Connector, LiveLink Connector, and Meridio Connector make =20= > use of > cryptographic principles via secondary libraries. Specifically, =20 > these connectors > support NTLM, NTLMv2, and NTLM2 Session authentication via commons-=20 > httpclient > and jCIFS. The changes to commons-httpclient necessary to support =20 > these > varieties of Windows protocols have not yet been accepted upstream =20 > by the Apache > httpclient project. > > It is unknown at this time exactly to what degree the Oracle JDBC =20 > driver, the > jtds JDBC driver, or the Postgresql JDBC driver uses cryptography. =20 > Also, the > FileNet API class, the Memex API classes, the OpenText LAPI api =20 > classes, and the > Documentum DFC classes all may or may not use cryptography. > > Legal Concerns > > Some of the connectors in the existing framework require paid =20 > licenses to use. > We will need to evaluate each connector to see what can be =20 > appropriately > included. For those connectors that require a paid license, we will =20= > need to > determine a plan for including the wrapper code without the =20 > underlying bindings > in a legal manner. We expect we can provide the wrapper code without =20= > the binding > and that the code will thus only be compilable by someone who has =20 > access to the > binding. (This is what Google has done for their individual =20 > connectors). Longer > term, we expect to demonstrate to the companies with proprietary =20 > connectors why > it is more valuable for them to open up their specific connector =20 > pieces to give > broader access to people looking to leverage their content in the =20 > repository. > > Trademark > > The project is being rebranded from a MetaCarta internal name to the =20= > Lucene > Connector Framework, which will be an ASF mark. > > Relationships with Other Apache Products > > We expect almost all of the Apache Lucene ecosystem will benefit =20 > from having a > standard way of connecting to content repositories. Additionally, =20 > users of UIMA > should also benefit. We also see an especially tight connection with =20= > Tika, as > much of the content in these types of repositories are "rich" =20 > document types > which will then need their content extracted. > > An Excessive Fascination with the Apache Brand > > All of us are familiar with the value that Apache brings to a =20 > project in > building out a community. We also are all significant users of =20 > Apache Lucene and > related tools (Solr, Nutch, Mahout, Tika) and expect a close =20 > relationship with > those projects will help significantly grow the LCF community. > > Documentation > > MetaCarta has end-user documentation for Lucene Connector Framework, =20= > which might > function as the core the open-source end-user documentation. The =20 > documentation > is in LaTeX form, and thus usable sources can readily be extracted. =20= > Research as > to any ownership issues for the documentation as it stands still =20 > needs to be > examined. > > The existing java doc of the code, while fairly extensive, needs =20 > review and > perhaps augmentation to insure it meets the needs of an ASF project. =20= > Significant > attention to maintaining its accuracy was made during MetaCarta's =20 > ownership of > the code base. > > Initial Source > > All initial sources will be coming from MetaCarta, Inc., with the =20 > goal of > folding in changes from others shortly thereafter. > > Source and Intellectual Property Submission Plan > > Code IP grants need to be made from MetaCarta, Inc. But, in =20 > addition, several > connectors (notably Documentum, LiveLink, Memex, and FileNet) rely =20 > directly on > client API's in order to be compiled. Another connector (JDBC) =20 > relies on the > existence of the Oracle JDBC Driver in the classpath in order to =20 > enable crawls > against Oracle databases. > > It is unlikely that EMC, OpenText, Memex, or IBM would grant > Apache-license-compatible use of these client libraries. Thus, the =20 > expectation > is that users of these connectors obtain the necessary client =20 > libraries from the > owners prior to building or using the corresponding connector. An =20 > alternative > would be to undertake a clean-room implementation of the client =20 > API's, which may > well yield suitable results in some cases (LiveLink, Memex, =20 > FileNet), while > being out of reach in others (Documentum). Conditional compilation, =20= > for the > short term, is thus likely to be a necessity. > > Other external dependencies, such as jCIFS for the Share Connector, =20= > are licensed > with LGPL, and thus may need to be treated in a manner similar to =20 > the closed > API's even though they are open source. These include the postgresql =20= > JDBC > driver, and JTDS. > > The Lucene Connector Framework core and individual connectors are =20 > completely > separable, and many of the connectors require no third party licenses. > Therefore, there is significant utility for this project even in the =20= > absence of > any third-party software grants, or clean-room engineering. > > The software grant will be faxed to the Apache Software Foundation =20 > if and when > the proposal herein described is accepted. MetaCarta patents are not =20= > infringed > by this grant. Also, MetaCarta trademarks are not included in this =20 > grant. > > External Dependencies > > The project dependencies, other than on other Apache projects, are =20 > as follows: > > The ConnectorFramework core currently uses the Bitmechanic JDBC pool =20= > driver, > which is BSD licensed, and the Postgresql JDBC driver, which is also =20= > BSD > licensed. > > The LiveLink Connector relies on LAPI, which is privately licensed =20 > by OpenText. > The Documentum Connector relies on DFC, which is privately licensed =20= > by EMC. The > Share Connector relies on jCIFS, which is LGPL. The Memex Connector =20= > relies on > privately licensed java libraries from Memex. The FileNet Connector =20= > relies on > privately licensed java libraries from IBM. > > Required Resources > > =95 Mailing lists =95 connectors-private (with moderated = subscriptions) =20 > =95 connectors-user@ =95 connectors-dev@ =95 connectors-commit@ =95 =20= > Subversion directory =95 = https://svn.apache.org/repos/asf/incubator/connectors > > =95 Website =95 Confluence (CONNECTORS) =95 Issue Tracking =95 JIRA =20= > (CONNECTORS) > > Initial Committers > > Names of initial committers with affiliation and current ASF status: > > =95 Karl Wright (kwright at metacarta) =95 Josiah Strandberg =20 > (jstrandberg at metacarta) =95 Ken Baker (bakerkj at metacarta) =95 = Marc =20 > Meadows (mam at metacarta) =95 Grant Ingersoll ( gsingers@a.o Lucid =20= > Imagination, ASF Member) > > =95 Brian Pinkerton (brian.pinkerton at Lucid Imagination) =95 Simon =20= > Willnauer (simonw at apache org, Committer on Lucene Java and Lucene > Open Relevance Project) =95 Ryan McKinley (ryan at apache org, =20 > Committer on Lucene and Solr) > > =95 Robert Muir (rmuir at apache org, Committer on Lucene and Open =20 > Relevance) =95 Sami Siren ( siren@a.o , Committer on Nutch and Tika) > > =95 Otis Gospodnetic ( otis@a.o , Committer on Lucene, Solr, Nutch, =20= > Mahout, and > Open Relevance Project) > > =95 Shalin Shekhar Mangar ( shalin@a.o , AOL, Committer on Apache = Solr) > > =95 Noble Paul ( noble@a.o , AOL, Committer on Apache Solr) > > =95 George Aroush (george at aroush.net, Committer on Lucene.Net) > > Sponsors > > Champion > > =95 Grant Ingersoll > > Nominated Mentors > > =95 Grant Ingersoll =95 Jukka Zitting =95 Gianugo Rabellino > > Sponsoring Entity > > =95 Apache Lucene PMC: Message ID: AF7E...@gmail.com > in private@lucene.a.o > > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org