Marko is a simple set of tools that can be used to calculate the probabilities of bodies of text belonging to one group or another. In a nutshell, say you have to collections of text, say speeches by Noam Chomsky and the federalist papers. Given a new piece of text, marko can be used to determine the probability that it belongs in one of the two collections. Another example, and one that this is inspired by, is that of spam filtering. One body of text is that of your 'desirable, previously received email' and that of your 'undesirable, previously read email'. Note that it doesn't necessarily have to be spam, it just has to be what you want and what you don't. Marko, when given a new message, will determine which of the two previously existing sets it *most likely* belongs to. It isn't 100% accurate, but it has very promising results. What makes marko, marko, is that the tokenizer can be set to produce couplets, or tuples for tokens. These, if used in the 'forward' direction, are the input to a markov chainer. We're using it backwards to determine which markov chainer most likely produced the text in question. At the moment, marko is a mess of different perl files that were created as a proof of concept, and only the hardiest of explorers ought to play with it. Eventually it will become easier to use as I am doing this work to help manage my own mailbox. INSTALLATION: For most of the tools to work, they need to reside together and be locatable via the environment variable $MARKO_HOME. Marko is not ready for production use and so installation is meant to be done in a user's directory. Specifically '~/.marko'. For easiest experimentation that is the best place to put the bin/ and db/ directories. USING: Here are the series of commands needed to create a couple of marko databases and then compare an unknown text to them. The output of marko-db in -c (compare) mode, is the probability of the unknown text belonging to the databasea on the "right", that is, the second database. mstrip file1 | mstream2mtext > file1.mtext mstrip file2 | mstream2mtext > file2.mtext mstrip file3 | mstream2mtext > file3.mtext marko-db -a file3.mtext -a file2.mtext -a file1.mtext -d foo mstrip fileA | mstream2mtext > fileA.mtext mstrip fileB | mstream2mtext > fileB.mtext mstrip fileC | mstream2mtext > fileC.mtext marko-db -a fileA.mtext -a fileB.mtext -a fileC.mtext -d bar mstrip unknown | mstream2mtext > unknown.mtext marko-db -c unknown.mtext -d foo -d bar There are a couple of very crude scripts that (at the expense of large numbers of processes) make creation of the db easier on linux boxes or any other os that supports the /proc/self/fd/0 mechanism. These are madd and mwhich. madd takes in a database and a file to add, mwhich reads a file from standard in and compares it to the two databases on the command line. There is nothing preventing you from having many different databases the comparison of a file can be made between any two. The databases themselves, however, can grow rather large. My experimental ones take up over 50MB, contain only about 2,500 emails and took several hours to build. Note that the build process involves at least four forks, three opens and two writes to the databases. The current implementation is a proof of concept and the focus was on calculation of the correct values, not speed. USING ON EMAIL: If you want to experiment with marko on a body of email, you can use several new commands added in 1.1. maildir2mdb -- this program will take a maildir and add every email in it to a given mdb. $ maildir2mdb ~/.Maildir/.ReadMail read-mail This will add every email in the .ReadMail maildir to the read-mail mdb that resides in $MARKO_HOME/db/ maildirSort -- This program may simply delete all the email on your system so use with care. It works with my email and I use it to experiment with my mdbs but I make no guarantees or warranties. That said, its command line can be a little complex: $ maildirSort ~/.Maildir/ foo:2:0.1:fooMail bar:1:0.9:barMail This command says for all the email in ~/.Maildir to be sorted by comparing it with the foo and bar mdb databases in $MARKO_HOME/db/. It will weight entries in foo twice as much as entries in bar, and if the resulting probability of it being in 'bar' (or the right hand db) is less than 0.1 it will move the message to the maildir fooMail. If the probability is greater than 0.9 it will move it into barMail, otherwise it will leave the mail where it found it. Right now it does MUA style manipulation of the maildir where it assumes that no other programs are operating on the maildir. This should NOT be run at the same time as your mail reader. Not to mention that you shouldn't be working on your actual mailbox at all. qmwhich -- this program does nothing, but it will turn into a qmail-command compatible version of maildirSort so that qmail can handle the actual deliveries thus eliminating the risk of manipulating maildirs directly. USING MARKO-DB: Marko-db is the general purpose program that interacts with the marko databases. Marko databases (mdbs) are simply a runtime-usable version of an MTEXT--currently a pair of GDBM files. An MTEXT is the only kind of input that marko-db accepts and it is created using the mstrip and mstream2mtext utilities. Once you have an mtext, you can do one of two things with it. The first is add it to one or more mdbs. $ mstrip source.txt | mstream2mtext > file.mtext $ marko-db -a file.mtext -d foo You can specify more than one mdb to add it to at the same time by giving multiple -d commands. $ marko-db -a file.mtext -d foo -d bar Having done that a few times, you end up with two different mdbs that you can then use with the compare parameter. $ marko-db -c file.mtext -d foo -d bar This will output something like the following: $ marko-db -c file.mtext -d foo -d bar P(right) = 0.634207169138555 This means that the probability of the given text having come from the body of text on the "right" (the second -d command line parameter) is 0.6342 or 63.42%. The probability of it coming from the "left" database is simply 1 minus that, or approximatley 35.58%. Certainly not a clear winner either way, but this was a contrived example. A recent copy of the "Nigerian Scam" when compared to my spam mdb and my read-mail mdb had a probability of 9.48517594772734e-283. Certainly not something I'd be likely to want to read. The last thing you can do with marko-db is export an mdb you have created. This essentially turns it back into an MTEXT. This is convenient if you want to use the data in some other application. This is what an export looks like: $ marko-db -x -d foo For every additional -x you add to the command line, you will increase the 'export threshold'. Essentially this is the minimum value than an entry must have for it to be exported. THE MSTREAM and MTEXT FORMATS: Mstream and mtext exist as a means to decouple the general storage and comparison activities of marko-db from the tokenizing activities which are highly application-specific. This is easy to see if you consider the difference betweent tokenizing a family of documents written in latex with a collection of plain text files. Both files have the same general format. Here is an mtext of a file consisting of the single line "hello world". MARKOTEXT/1.0 Format: mtext Token-Count: 3 Source-Id: 6f5902ac237024bdd0c176cb93063dc4 hello 1 hello world 1 world 1 As you can see, there is an opening line, a header and a body. The opening line signals readers what format the following data is in, the header contains information about the data and a blank line separates the data from the header. The data itself varies depending upon whether the stream is mtext, mstream or mdb. In an mstream, the data is one token per line without any count information. The only restriction being that the newline and tab character may not be part of any token. The mstream2mtext utility serves to condense this information by collecting and counting identical tokens to shorten the data stream. Note that an mtext may still contain replicated tokens. The mstream2mtext utility also serves to condense multiple different sources into a single file with a separate Source-Id header entry for each source encountered. The mdb format is nearly identical to the mtext format except for a few additional headers. Namely the corpus-count and the date/time that the mdb was exported. The date is in unix time format (the number of seconds since Jan 1, 1970 UTC), followed optionally by a human readable time display.