Introducing fmmap

I recently polished some code I had lying around and can now introduce fmmap. It is a Python module that can be used instead of the built-in mmap module and offers better performance. My own interest was specifically in a faster .find() method. The “f” in fmmap might refer to “find”, “fast”, or someone’s name.

Memory mapping is an approach of accessing a file as if it is just an array in memory. No explicit file reading or writing is required. As you access this area of memory, the operating system manages the input and output to the underlying file as necessary. In some circumstances it can result in better performance.

A few years ago I tried mmap in a toy program, and got some performance gains. Then I noticed that the .find() method in CPython, while implemented in C, used a naive algorithm, and I wanted to see if I could improve performance more. First I tried to implement some of glibc’s algorithms in a Cython module, but eventually got the best performance by using the optimised functions in glibc. The whole process also refreshed and improved some of my knowledge of the C APIs for strings and memory.

Now I decided to take this code out into a project of its own, and decided that providing a drop-in replacement for the built-in mmap module would be the most pleasant way to expose this to other people. This of course brings all of the advantages and not so pleasant overhead of project infrastructure, tests, CI setup, and rediscovering how to package Python packages in the ever changing environment. However, I’m subclassing the built-in class, so I only had to implement the parts I’m interested in — or so I thought. In an attempt to do this well, I got the tests from the standard library (CPython) to test my implementation against. The tests from the PyPy project is not exactly the same, so I just decided to drop that in as well. That ended up being a good decision, as it caught a few bugs that the tests in the standard library did not catch.

The test suite in the standard library develops over time, and the current version in git is aimed at the upcoming release (Python 3.9). In an attempt to be a good citizen of the Python world, I try to support as many Python versions as are feasible. However, each Python version added tests that don’t work on previous versions due to bug fixes, features and API changes. So I can’t test on older Python versions (not even Python 3.8) on this single test suite without some difficulty. And so my project that wanted to expose one function to the world, became a project to backport all the latest features. According to the test suite, it now works in Python 3.5 – 3.8, and I will fix the one remaining failure on Python 3.4 easily as part of providing an improved version of .rfind().

I haven’t yet done proper benchmarking, but you might find .find() substantially faster with fmmap. The exact performance is heavily dependent on your C library. Let me know how it works for you!