community-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Created] (COMDEV-163) mailglomper.py takes ages to run
Date Sat, 26 Sep 2015 10:35:04 GMT
Sebb created COMDEV-163:
---------------------------

             Summary: mailglomper.py takes ages to run
                 Key: COMDEV-163
                 URL: https://issues.apache.org/jira/browse/COMDEV-163
             Project: Community Development
          Issue Type: Bug
            Reporter: Sebb


mailglomper takes a very long time to run (several hours)

This is mainly because it has to download the last 7 mailboxes for each mailing list; some
of these mailboxes can be quite large.

Most of this is wasted processing because only the mailbox for the current month is ever updated;
once a new month starts, emails are added to the new mailbox only and the earlier mailboxes
are not updated further.

It would be more efficient to cache the counts/times for the previous months and use those
instead of re-reading them. If the cache entry is missing, then the file is read.

How much information needs to be cached for each mailbox?
For exact compatibility with the current code, it would be necessary to store the counts for
each day, but if this results in too much storage, then it would be possible to store just
the weekly counts. This would not affect the historic weekly stats.

However the running quarterly stats currently allocate the email to the quaterly buckets on
a daily rather than weekly basis, so some precision would be lost if only the weekly merged
counts were available for past months.

The cache itself would need managing to ensure that the oldest entries were dropped, otherwise
it would grow very large.

Note: since contributions to the weekly buckets may come from more than one month, it's likely
not feasible to use the existing data. This is because the current month is processed multiple
times, so its data needs to be replaced each time. If its first week overlaps with the last
week of the previous month, that would result in lost data. This problem might even affect
dailiy accumulations; it depends exactly when the mailboxes are flipped. Having a separate
cache entries for each monthly mailbox would also make it easier to manage the cache. The
downside is that it would require more storage, but the cost of re-reading the historic mailboxes
every day is relatively large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message