groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Merlin Beedell <MBeed...@cryoserver.com>
Subject Help! - GPARS with eachFile & database concurrency
Date Fri, 24 Jan 2020 13:10:58 GMT
Hi wonderful Groovy team,

I am really struggling to determine a "straight forward" groovy way to amend a simple linear
script into one using some level of concurrency.  I cannot find suitable examples for the
task at hand, namely:

I use derby database to collect data from email files on a server.  So

  1.  I walk the directory tree using a mix of eachDirRecurse and eachFileMatch.
  2.  High level directories are names of mailboxes - so I add these to the database, first
checking that the mailbox is not already in the db.



               int addUser(def Username)

               {

                              def res = sql.firstRow ("select id from user_info where user_name
= ?", [Username])

                              if (!res) {

                                             def keys = sql.executeInsert("insert into user_info
(user_name) "

                                                                           + "VALUES (?)",
[Username])

                                             return keys[0][0]  //return the auto-generated
row id number from the db

                              } else {

                                             return res.id

                              }

               }



I guess I could just insert the data - and if it errors with 'duplicate key' then I know it
already exists, but then I would still need to obtain the row ID to return back to the caller.


  1.  And when I find a file that is an email type, I read it line-by-line until I obtain
the required header details (date: / subject: / from: / message-id: ) or a blank line ( end
of headers).  I add these details to the database (again, checking that the item is not already
there).

So I currently use a single SQL connection and a simple loop over the directory and files
- its simple and works well.  But as there are several million files, I really need to use
multiple threads.

I read that a "DataSource" is a way to pool Database connections.  I just cant see how this
works - does it just dynamically create connections on demand [def sql = new Sql(mydatasource)],
and when the 'sql' variable is garbage collected, the connection is returned to the 'pool'?
 Is each sql instance "thread safe" from each other?
And are prepared statements also in the 'pool' - so the sql statements are not parsed every
time regardless of the connection used?
As for concurrency...
I have previously used, in a basic sense, threads.  And then I looked at GPARS, which seems
to be the appropriate way to go. So how might the 'eachDirRecurse' and 'eachFileMatch' be
altered to a GPARs "withPool" collection loop?  How should each loop call the sql routines
so they are thread safe - presumably by creating sql connection from datasource (pool) >
do sql > done.  The withPool will create upto cpu-count + 1 - but should I use more with
this type of process logic?  I assume that I could use the "withPool" within another  "withPool",
so that I can process [the pool count] some mailboxes in concurrently and also the files within
each mailbox in parallel.

Is there some metric that determines how effective concurrent disk actions (just reading in
this case) can be - e.g. so I could determine a sensible limit on the number of [email] files
being read at the same time.  What monitoring method would help?

I don't think I need to use "actors" here, nor the "dataflow" feature.
Even after reading Groovy in Action (2ed), it is still not really clear how to proceed.  I
have googled a lot, but still cannot map my ideas into a GPARS solution.  So I thought I should
ask the experts - the groovy community - for some suggestions or appropriate reading material.

The nearest I have found to a useful template on this topic is https://stackoverflow.com/questions/35702351/concurrent-parallel-database-queries-using-groovy
But I just cannot see how or why the db connection pool interacts with the GPARs so that the
same connection is not grabbed by each concurrent process.

Yours, hopefully,

Merlin Beedell


Mime
View raw message