flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anatharaman, Srinatha (Contractor)" <Srinatha_Ananthara...@comcast.com>
Subject RE: Ingestion to Solr is very slow
Date Thu, 23 Feb 2017 18:43:46 GMT
Denes,

Please find below my Morphline config file. I had tried Memory channel but  found it runs
faster with File Channel.

solrLocator: {

collection : esearch

zkHost : "codesolr-as-r2p:2181"

}

morphlines :
[

  {

    id : morphline1

    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands :
    [

      { detectMimeType { includeDefaultMimeTypes : true } }

      {

        solrCell {

          solrLocator : ${solrLocator}

          captureAttr : true

          lowernames : true

          capture : [_attachment_body, _attachment_mimetype, basename, content, content_encoding,
content_type, file, meta,text]

          parsers : [
                                { parser : org.apache.tika.parser.txt.TXTParser }
                    ]

         fmap : { content : text }
         }

      }
      { generateUUID { field : id } }

      { sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }


      { logDebug { format : "output record: {}", args : ["@{}"] } }

      { loadSolr: { solrLocator : ${solrLocator} } }

    ]

  }

]


Sample text file looks like below

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Received: from abc.net ([11.222.333.444])
        by abc.abc.net with bizsmtp
        id djfAJSD*jKDHJKD; Sun, 01 Jan 2010 12:31:51 +0000
Received: from xya.xyz.net ([99.888.777.666])
        by xyz.xyz.net with SMTP
        id jhcfhchABHDJHDD*HDJhsdjcfjh; Sun, 01 Jan 2019 02:31:50 +0000
Received: from smtp.abccbc.abcbcbcb.com ([11.111.22.34])
        by pqrs.pqrs.net with SMTP
        id JHDJHJDHJHD*USDHCFJNHSD*; Sun, 01 Jan 2010 02:31:51 +0000
X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=1;DMARC=
Received: from portalmail (unknown [777.33.2.90])
        by smtp.ajhjhdjjdfh-ajhdjkjsd.com (Postfix) with ESMTP id HDJHDJDSJKS
        for <PQRS@abc.net>; Sat, 31 Dec 2010 18:31:49 -0800 (PST)
From: "abc_abc@abc.com"
To: qqqq@abc.net
Message-ID: <999999999.888.3449859489586.JavaMail.VV@mortalmail>
Subject: 111-2343444434  You got a email, LLC ("abc")
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-CMAE-Envelope: kjsjdsjdjdjvf9jd/12djhfjhd83hjnr38/jfjjvgf95kjg905j95ygjmt59ytjmgh95ijmhjkt6h
9085jghty89jhn596ijyiuh96ijmhj90t5ui9kjio6i5uy096i5jki650ui6o7kuoki

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

ABC of jdjdhjhjdhjvfh Use of fkjkfjo9r5nmfkf90trmbklgftob
ABC ID: 111-34345454545
Action Date: 01 Jan 2010 02:31:33 GMT

ABC Corporation

Dear Sir or Madam:

dhjdhjfddsfjufnkdfjkjdjnhjfdjk832nhjkfg8nsdvjvnhjvjkffdjkvjhdfhjfjbhjfhnb
jchjvhfjvjhjjnxj4328uiwejf3uivcnj3490uncvrgu890jvkfjviujrfig94uvnjfvvgjhg89
hdfg9urvnjfijuhvirjsgu9rjdnvidj9ujbvgbi9rbdfgjbi9tujfbvkrniujv bnrtbjiuj
jdfjvb9utrjgnbg90ujrjmf043ikvjkfjvfrjopfr0gjvkfdjvfjovgfdovofdodopigif04jvkerj
ibjhidfjbikjfdbjibr9gikfdjgvr905jfkjgvgvj9ufkjbvfiugtjgkjb90tvbjkjfdjbffkjjfb
kjffkjbkfjkjff9g4rjdf044jn v90dfjvgr0irkjkvjfb09ua[vbjksoohfrijugb9jkvjkjkfjf


Regards,

XYZ

*pgp public key is available on the key server at http://xyz.git.edu

Note: The information transmitted in this Notice is intended only for the p=
erson or entity to which it is addressed and may contain confidential and/o=
r privileged material.  Any review, reproduction, retransmission, dissemina=
tion or other use of, or taking of any action in reliance upon, this inform=
ation by persons or entities other than the intended recipient is prohibite=
d.  If you received this in error, please contact the sender and delete the=
material from all computers.

This infringement notice contains an XML tag that can be used to automate t=
he processing of this data.  If you would like more information on how to u=
se this tag please contact XYZ.


- - ---Start ACNS XML
<?xml version=3D"1.0" encoding=3D"UTF-8"?>
<Infringement xmlns=3D"http://www.acns.net/ACNS" xmlns:xsi=3D"http://www.w3=
.org/2001/XMLSchema-instance" xsi:schemaLocation=3D"http://www.acns.net/ACN=
S http://www.acns.net/v1.2/ACNS2v1_2.xsd">
    <Case>
        <ID>00000000</ID>
        <Status>Open</Status>
    </Case>
    <Complainant>
        <Entity>XYZ USA, Inc</Entity>
        <Contact>XYZ</Contact>
        <Address>P.O. Box 000, North XYZ, KA 00000</Address>
        <Phone>999999999</Phone>
        <Email>abc@abc.com</Email>
    </Complainant>
   <Service_Provider>
        <Entity>ABC Corporation</Entity>
        <Email>abc@abc.net</Email>
    </Service_Provider>
    <Source>
        <TimeStamp>2016-12-31T23:15:40.000Z</TimeStamp>
        <IP_Address>11.22.33.444</IP_Address>
        <Port>55555</Port>
        <Type>BitTorrent</Type>
        <Number_Files>1</Number_Files>
        <Deja_Vu>No</Deja_Vu>
    </Source>
    <Content>
        <Item>
            <TimeStamp>2016-12-31T23:15:40.000Z</TimeStamp>
            <Title>Power</Title>
            <FileName>Power </FileName>
            <FileSize>000000000</FileSize>
            <URL>dht</URL>
        </Item>
    </Content>
</Infringement>
- - ---End ACNS XML
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (MingW32)

xjsdh78h23e7u2he3279y3hjdhe7823jhd3783gddey373hyfu37ru3rh892rhf2
23897EBHCA8ENHD         q0jc39ujdkjd9rj8287hcd833hrnj390unce90ru3jrifj9r
930jh3ier390hnd9d23ujf3249u9uifoje9frjfij90fvu394ujfjc0f9u9vjfv9

-----END PGP SIGNATURE-----

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

I will try profiling it.

Regards,
~Sri

From: Denes Arvay [mailto:denes@cloudera.com]
Sent: Thursday, February 23, 2017 10:40 AM
To: user@flume.apache.org
Subject: Re: Ingestion to Solr is very slow

Hi,

The Flume config seems OK for me, one minor thing: I'd suggest to try the memory channel,
it can speed up the things a little bit.
The morphline part might be a bottleneck, could you please share its config as well?
Some sample input files might also be useful to be able to help with the debugging.

Beside these I'd recommend to try to profile it with a Java profiler (e.g. jvisualvm).

Regards,
Denes


On Fri, Feb 17, 2017 at 12:00 AM Anatharaman, Srinatha (Contractor) <Srinatha_Anantharaman@comcast.com<mailto:Srinatha_Anantharaman@comcast.com>>
wrote:
Hi,

I have large set of small files , each file is around 7 – 10 K in size
Total I have 350K files with around 6 GB.

I have changed my flume configuration with many options but whatever the config change Solr
takes 2 sec for each file to ingest


agent.sources = SpoolDirSrc
agent.channels = FileChannel
agent.sinks = SolrSink

# Configure Source

agent.sources.SpoolDirSrc.channels = fileChannel
agent.sources.SpoolDirSrc.type = spooldir
agent.sources.SpoolDirSrc.spoolDir = /app/home/solr/final
agent.sources.SpoolDirSrc.basenameHeader = true
#agent.sources.SpoolDirSrc.batchSize = 100000

agent.sources.SpoolDirSrc.fileHeader = true
agent.sources.SpoolDirSrc.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder


# Use a channel that buffers events in memory
agent.channels.FileChannel.type = file
agent.channels.FileChannel.capacity = 1000
agent.channels.FileChannel.transactionCapacity = 1000

#agent.channels.FileChannel.transactionCapacity = 10000

# Configure Solr Sink

agent.sinks.SolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.SolrSink.morphlineFile = /etc/flume/conf/morphline.conf
#agent.sinks.SolrSink.batchsize = 100000
#agent.sinks.SolrSink.batchDurationMillis = 5000
agent.sinks.SolrSink.channel = fileChannel
agent.sinks.SolrSink.morphlineId = morphline1
agent.sinks.SolrSink.tika.config = tikaConfig.xml
agent.sinks.SolrSink.rollCount = 0
agent.sinks.SolrSink.rollInterval = 0
agent.sinks.SolrSink.rollsize = 100000000
agent.sinks.SolrSink.idleTimeout = 0
agent.sinks.SolrSink.batchSize = 100000
agent.sinks.SolrSink.txnEventMax = 10000000

agent.sources.SpoolDirSrc.channels = FileChannel
agent.sinks.SolrSink.channel = FileChannel

My Collection is on 2 shards and 1 replication

Kindly let me know how do I make this better

Regards,
~Sri
Mime
View raw message