Procrastinative Ninja

what I do, when I don't do, what I should do.

iOS Mail Rules and Spam Filter

Jan 12, 2021

Okay, heads up, not really on iOS but on a Raspberry Pi - but it affects iOS Mail app, does that count?

I’m using a plain old IMAP mail provider and was long frustrated by the fact that macOS Mail.app rules are only applied while it is running and every time it is not and I check my inbox on iOS it is a pure mess. Let’s not even talk about the state of Apple’s spam filtering. I finally got around to set up a combination of imapfilter and bogofilter on a Raspberry Pi as an always-on and always-connected IMAP client. It’s my new centralized mail sorting and spam management solution.

Imapfilter

The centerpiece is imapfilter, a headless IMAP client with a simple but powerful rule engine in Lua. I wasn’t completely satisified with the way it handles password entry and wanted a way to edit the rules conveniently in the browser. So I developed a web UI. The installation is (hopefully) described sufficiently in the readme. If it’s not, please open an issue.

My configuration is basically a slightly customized version of the example in the repository.

Bogofilter

bogofilter is a mature Bayesian spam filter which can neatly be integrated into the imapfilter setup.

Preparation

To train bogofilter a corpus of mails is needed, the bigger the better. Luckily I’ve been collecting mails for more than a decade - including spam, because why not?

However it’s no good on the server, it’s needed locally and in the mbox format. Apple Mail can export mailboxes as mbox, but it failed on my multi GB Archive mailbox.

Buuut I’m using mbsync to periodically create local backups of my mails on my Mac. You can install it with brew install isync (yes, isync, not mbsync, because “isync is the project name, mbsync is the current executable name”…). I have the following configuration in ~/.mbsyncrc and the program can be run with mbsync -a.

# Remote IMAP account
IMAPAccount me@domain.com
Host my.mailserver.com
Port 993
User <username>
PassCmd "security find-generic-password -s mbsync -a me@domain.com -w"
SSLType IMAPS
SSLVersions TLSv1.2

IMAPStore me@domain.com-remote
Account me@domain.com

# This section describes the local storage
MaildirStore me@domain.com-backup
Path "/path/to/local/backup/me@domain.com/"
Inbox "/path/to/local/backup/me@domain.com/INBOX"
# The SubFolders option allows to represent all
# IMAP subfolders as local subfolders
SubFolders Verbatim

# This section defines a channel, a connection between remote and local
Channel me@domain.com
Master :me@domain.com-remote:
Slave :me@domain.com-backup:
Patterns *
CopyArrivalDate yes
Sync All
Create Slave
Expunge Slave
SyncState *

However mbsync stores the mails in the Maildir format. I used maildir2mbox to convert my Archive and Junk mailboxs into the mbox format.

pip3 install maildir2mbox
python3 -m maildir2mbox /path/to/local/backup/me@domain.com/Archive Archive.mbox
python3 -m maildir2mbox /path/to/local/backup/me@domain.com/Junk Junk.mbox

Installation

It could have been so easy…

sudo apt-get install bogofilter

but that only installs version 1.2.4 instead of the latest 1.2.5. Now, does that one patch version make such a big difference? Well, .4 is six years older than .5. In the meantime a bunch of security and memory leak fixes accumulated and I wanted to have those. So instead of a single apt-get install I built bogofilter from source.

After downloading 1.2.5 from SourceForge and sending it over to the Pi

scp Downloads/bogofilter-1.2.5.tar.xz pi@<Pi IP>:

building was pretty straight forward

sudo apt-get install sqlite3 libsqlite3-dev
tar -xf bogofilter-1.2.5.tar.xz
cd bogofilter-1.2.5
./configure --with-database=sqlite
make all check
sudo make install

Configuration

# Create the directory where the wordlist will be stored
sudo mkdir /var/spool/bogofilter
sudo chgrp pi /var/spool/bogofilter/
sudo chmod g+w /var/spool/bogofilter/
# Update the configuration
sudo cp /etc/bogofilter.cf.example /etc/bogofilter.cf
sudo nano /etc/bogofilter.cf

In the example configuration the following values were modified

...
bogofilter_dir = /var/spool/bogofilter
...
ham_cutoff = 0.6
spam_cutoff = 0.85
...

Bogofilter scores every mail on its spammyness and it can operate in two-state (spam, not spam) or tri-state mode (spam, not spam and unsure). I want to use tri-state mode and the following thresholds to create three intervals

  • [0..0.6) - not spam
  • (0.6..0.85) - unsure
  • (0.85..1] - spam

A cutoff value of 0.6 for good mails is very conservative and I intend to tighten it up in the future.

Training

The Raspberry Pi could not handle my mailbox. It failed with:

bzcat -f ./Archive.mbox                               
bzcat: Can't open input file ./Archive.mbox: Value too large for defined data type.

To work around this I decided to train on my Mac and transfer the final database to the Pi. However the Homebrew version of bogofilter uses the default Berkeley DB and its version was many versions ahead of what was available via apt-get. I didn’t want to go down that rabbit hole and decided to use SQLite instead because it seemed simpler and I knew the versions would be compatible. And I like SQLite.

So, on to building bogofilter from source on my Mac:

brew install sqlite
export LDFLAGS="$LDFLAGS -L/usr/local/opt/sqlite/lib"
export CPPFLAGS="$CPPFLAGS -I/usr/local/opt/sqlite/include"

tar -xf bogofilter-1.2.5.tar.xz
cd bogofilter-1.2.5
./configure --with-database=sqlite

The first attempt failed with

clang: error: '-I-' not supported, please use -iquote instead

This is probably because I have Xcode installed and clang 12.0.0 fails the “is GCC4?” check in configure.ac. To fix this I modified src/Makefile

#AM_CPPFLAGS = -I$(top_srcdir)/gnugetopt -I$(top_srcdir)/trio -I- -I. \
#       -I$(srcdir)  -I$(top_srcdir)/gsl/specfunc -I$(top_srcdir)

AM_CPPFLAGS = -iquote$(top_srcdir)/gnugetopt -iquote$(top_srcdir)/trio \
        -I$(srcdir)  -I$(top_srcdir)/gsl/specfunc -I$(top_srcdir)

I also had to specify LC_CTYPE to make the tests pass

LC_CTYPE=C make all check

The actual training was performed with some error margins and slightly stricter values to gain some leeway for production. I used the script bogominitrain.pl and targeted a spam threshold of 0.95 and a non-spam threshold of 0.3.

cd src
export PATH=.:$PATH
curl https://gitlab.com/bogofilter/bogofilter/-/raw/main/bogofilter/contrib/bogominitrain.pl -o bogomintrain.pl
chmod a+x bogomintrain.pl
./bogomintrain.pl -fv ./ /path/to/Archive.mbox /path/to/Junk.mbox '-o 0.95,0.3'

This ran for a while and after it finished I validated the results by sampling a few mails from other mailboxes which were not part of the training set

bogofilter -v < /path/to/mbsync/backup/mailbox/cur/something
echo $? # 0 means spam, 1 is not spam, 2 is unsure

Not all of them were classified correctly, but I was happy enough. Accuracy will increase over time.

Finally the database needed to be transferred to the Pi

scp wordlist.db pi@<Pi IP>:/var/spool/bogofilter/

Spam Filtering

The algorithm for spam filtering which needs to be expressed in imapfilter rules is

  • Let bogofilter evaluate every newly arrived message in the inbox:
    • If it is not spam, leave it alone and let the user handle it as usual.
    • If it is spam, mark it as Junk and bogofilter-junk and move it into the Junk mailbox.
    • If unsure, mark it as Junk and bogofilter-unsure but leave it in the inbox for the user to review.
    • In any case, mark the message as evaluated so it’s only processed once.

The Junk label causes macOS Mail to display the message in yellow and show the “Mail thinks this message is Junk Mail” header.

Mail thinks this message is Junk Mail

Unfortunately iOS Mail does not have such an indication for spam mails. Or any indication at all. But it displays flags, so unsure messages in the inbox will also get a yellow flag.

Now, nobody is perfect and neither is bogofilter. All three classification results can be wrong, “unsure” can even be wrong both ways. For bogofilter to improve it’s important to provide feedback so it learns.

When designing the feedback loop I thought about it from a user perspective and how I want to deal with it in Mail.app.

  • Good mail misclassified as spam (false positive)
    • Has been moved to the Junk folder.
    • Will be moved back into the Inbox by clicking the “Move to Inbox” button.
    • This will remove the macOS Junk label.
    • Each message in the Inbox without the Junk label but with the bogofilter-junk label needs to be un-learned as spam and learned as good.
  • Spam mail misclassified as good (false negative)
    • Has been left in the Inbox.
    • Will be moved to the Junk mailbox by clicking the junk-mail button in the toolbar.
    • This will add the macOS Junk label.
    • Each message in the Junk mailbox without the bogofilter-junk label needs to be un-learned as good and learned as spam.
  • Good mails with an unsure result (unsure negatives)
    • Has been left in the Inbox but marked as macOS Junk.
    • Will be marked as good by clicking the “Not Junk” button.
    • This will remove the macOS Junk label.
    • Each message in the Inbox with the bogofilter-unsure label but without the macOS Junk label needs to be learned as good.
  • Spam mails with an unsure result (unsure positives)
    • Has been left in the Inbox but marked as macOS Junk.
    • Will be moved to the Junk mailbox by clicking the “Move to Junk” button.
    • OR will directly be deleted.
    • Each message in the Junk or Trash mailboxes with the bogofilter-unsure label will be learned as spam.

In code this looks like

BOGOFILTER_EVALUATED = "bogofilter-evaluated"
BOGOFILTER_UNSURE = "bogofilter-unsure"
BOGOFILTER_JUNK = "bogofilter-junk"
YELLOW_FLAG = "$MailFlagBit1"

inbox = my_account.INBOX
junk_mailbox = my_account["Junk"]
trash_mailbox = my_account["Trash"]

-- mark as spam so macOS Mail recognizes it as such
function mark_as_junk(messages)
  messages:remove_flags({'NotJunk', '$NotJunk'})
  messages:add_flags({'Junk', '$Junk'})
end
function mark_as_good(messages)
  messages:remove_flags({'Junk', '$Junk'})
  messages:add_flags({'NotJunk', '$NotJunk'})
end

function junk(messages)
  messages:mark_seen()
  mark_as_junk(messages)
  messages:move_messages(junk_mailbox)
end

-- based on https://gist.github.com/sthalik/344d3a0db54c4c9051e4
function filter_junk()
  MIN_SIZE = 1024 * 1024 -- only evaluate mails smaller than 1 MB (spam with embedded images can be surprisingly large...)

  inbox_messages = inbox:is_smaller(MIN_SIZE)

  -- false positives
  -- messages which bogofilter previously classified as junk but since have
  -- been marked as clean in macOS Mail
  false_positives = inbox_messages:has_keyword(BOGOFILTER_JUNK):has_unkeyword("Junk")
  for _, mesg in ipairs(false_positives) do
    mbox, uid = unpack(mesg)
    message = mbox[uid]
    -- unlearn that it was spam (-S) and learn that it was okay (-n)
    pipe_to('bogofilter -nS', message:fetch_message())
  end
  false_positives:remove_flags({ BOGOFILTER_JUNK })

  -- unsure negatives
  -- messages which bogofilter classified as unsure but since have
  -- been marked as clean in macOS Mail or iOS Mail
  inbox_unsure = inbox_messages:has_keyword(BOGOFILTER_UNSURE)
  unsure_negatives = inbox_unsure:has_unkeyword("Junk") + inbox_unsure:has_unkeyword(YELLOW_FLAG)
  for _, mesg in ipairs(unsure_negatives) do
    mbox, uid = unpack(mesg)
    message = mbox[uid]
    -- learn that it was not spam (-n)
    pipe_to('bogofilter -n', message:fetch_message())
  end
  mark_as_good(unsure_negatives)
  unsure_negatives:remove_flags({ BOGOFILTER_UNSURE })
  unsure_negatives:unmark_flagged()
  unsure_negatives:remove_flags({ YELLOW_FLAG })

  -- false negatives
  -- messages which have _not_ been classified as junk by bogofilter
  -- but were moved there by macOS Mail
  false_negatives = junk_mailbox:has_unkeyword(BOGOFILTER_JUNK):is_smaller(MIN_SIZE)
  for _, mesg in ipairs(false_negatives) do
    mbox, uid = unpack(mesg)
    message = mbox[uid]
    -- unlearn that was okay (-N) and learn that it was spam (-s)
    pipe_to('bogofilter -Ns', message:fetch_message())
  end
  false_negatives:add_flags({ BOGOFILTER_JUNK })

  -- unsure positives
  -- messages which have been classified as unsure by bogofilter
  -- but were either moved to Junk or deleted
  unsure_positives = junk_mailbox:has_keyword(BOGOFILTER_UNSURE) + trash_mailbox:has_keyword(BOGOFILTER_UNSURE)
  for _, mesg in ipairs(unsure_positives) do
    mbox, uid = unpack(mesg)
    message = mbox[uid]
    -- learn that was spam (-s)
    pipe_to('bogofilter -s', message:fetch_message())
  end
  unsure_positives:remove_flags({ BOGOFILTER_UNSURE })
  unsure_positives:add_flags({ BOGOFILTER_JUNK })

  -- new messages
  new_messages = inbox_messages:has_unkeyword(BOGOFILTER_EVALUATED)
  for _, mesg in ipairs(new_messages) do
    mbox, uid = unpack(mesg)
    message = mbox[uid]
    text = message:fetch_message()
    if type(text) == 'string' then
      -- 0 for spam; 1 for non-spam; 2 for unsure
      classification = pipe_to('bogofilter -u', text)
      s = Set {mesg}
      if classification == 0 then -- spam
        s:add_flags({ BOGOFILTER_JUNK })
        junk(s)
      elseif classification == 2 then -- unsure
        s:add_flags({ BOGOFILTER_UNSURE })
        mark_as_junk(s)
        -- also add a yellow flag so it's identifiable in iOS Mail
        s:mark_flagged()
        s:add_flags({ YELLOW_FLAG })
      end
    end
  end
  new_messages:add_flags({ BOGOFILTER_EVALUATED })
end

Final Thoughts

Directly starting with the above code would learn every mail in the Junk mailbox as spam - again - because it matches the false negatives search. And “again” because bogofilter already learned them in the training phase. To avoid this every message needs the bogofilter-junk label. I ran the following helper method once before turning on junk filtering.

function setup_junk_filtering()
  -- Optional: mark everything in the inbox as evaluated
  -- inbox:select_all():add_flags({ BOGOFILTER_EVALUATED })
  -- mark everything in Junk as evaluated and junk
  junk_mailbox:select_all():add_flags({ BOGOFILTER_EVALUATED, BOGOFILTER_JUNK })
end

With all this in place it’s time to disable junk mail filtering in the macOS Mail.app preferences.

And because I’m chicken I didn’t start with the real Junk and Trash folders. I created imapfilter-Junk and imapfilter-Trash and used them for a while to keep an eye on what imapfilter and bogofilter are up to. So far they are doing great!

comments powered by Disqus