Described new online substring search algorithm which allows faster string traversal. Presented here implementation is substantially faster than any other online substring search algorithms for average case.

Algorithm

Substring (needle) SS of length M is sought in source (haystack) string S of length N. Algorithm sequentially steps through string S, and probes word W (2 or more bytes) if it belongs to SS. Step length is    SSsize - Wsize + 1. The Wsize is small number (typically 2, 4 or 8 bytes).

Test if probed word belongs to SS is done in structure, similar to hash table with open addressing which is generated in preprocessed step.

Empty hash cell indicates that W is not in SS. Value in hash cell tells how far W is from SS head.

For multi-string matching, cell value is pair of substring index and distance from its head.

After it was determined that probed word is in SS, we can find address of substring at    probed_position - hash_cell_value    and compare it with SS. If equal, substring is found.

Complexity

Preprocessing phase in O(M) space and average time complexity. Worst preprocessing time complexity is O(M2) . Searching phase has average O(N) time complexity and O(NM) worst case complexity.

Reference C++ implementation

Reference implementation will work only on platforms which allow access W on misaligned address and have little-endian architecture (like x86/x86_64). API is the same as in std::search() - strings are defined by two pointers: address of first byte and address of last byte + 1. Reference implementation aims for code simplicity, not speed or genericity.

#include <stdint.h>
#include <algorithm>
#include <limits>


const char*  volnitsky(const char*  S,  const char*  Se,  const char*  SS,  const char*  SSe) {
        // only for little-endian platforms and where access to misaligned W is allowed

                typedef         uint16_t        W_t;
                typedef         uint8_t         offset_t;

        const size_t    W_size   = sizeof(W_t);
        const size_t    S_size   = Se - S;
        const size_t    SS_size  = SSe - SS;
        const size_t    step     = SS_size - W_size + 1;

        // args sizes check
        if  (
                    S_size   <   20000       //  see startup costs;  algo limit: 2*SS_size
                ||  SS_size  <   2*W_size-1
                ||  SS_size  >=  std::numeric_limits<offset_t>::max()
        )
                return  std::search(S,Se,SS,SSe);       // fall-back to std::search

        // make hash
        const size_t    H_size     = 64*1024;
        offset_t        H[H_size]  = {0};       // {0} - initializes ALL buckets with 0
        for  ( int  i = SS_size-W_size;    i >= 0;   i-- )   {
                size_t  h  =  *(W_t*) (SS+i)  % H_size;
                while (H[h])    h = (h+1) % H_size;             // find free cell
                H[h] = i+1;
        }

        // step through text
        const char*   P = S + SS_size - W_size;
        for  (;   P <= Se-SS_size;   P += step)  {
                for  (size_t  h = *(W_t*) P  %  H_size;    H[h];    h = (h+1) % H_size )  {
                        const char*    R  =  P - (H[h] - 1);
                        for  ( size_t i=0;  i<SS_size;  i++)  {
                                if  (R[i] != SS[i])   goto next_hash_cell;
                        }
                        return R;                       // found
                        next_hash_cell:;
                }
        }

        // check tail
        return  std::search(P-step+1,Se,SS,SSe);
};
Tip MS Visual C++ does not have stdint.h; MSVC++2010 does have equivalent cstdint, or you can get one from here

Endianness and Alignment

Endianness and alignment effects example:

char      S[] =  "ABC";
char*     P   =  & S[1];                // points at B
uint16_t  W   =  * (uint16_t*) P;       // big endian - "AB";   little endian - "BC"
                                        // only aligned data - hw fault

Modifications to reference implementation needed for big-endian platforms are trivial.

For platforms that don’t allow access W on misaligned address there are several approach to overcome this. One is to use only aligned words - stepping starts at aligned address and step is equal to multiple of platform word size. Another approach: from two aligned words extract an unaligned word (doing in software what CPU does in microcode).

Table 1. Endianness and misaligned access
Platform Endianness Can access misaligned

x86, x86_64

Little

yes

Itanium/IA-64

Bi

no

ARM

Bi

?

iPhone ARM

Little

?

Xbox 1

Little

?

Xbox 360

Big

?

SPARC V9

Bi

no

PowerPC

Bi

yes

IBM POWER

Big

?

IBM z/Architecture

Big

?

PlayStation 3

Big

?

Nintendo Wii

Big

?

Java VM

Big

?

Algorithm and Implementation Variations

  • For general case (binary data search) — hash table should fit into L1 (L2 or L3 for long SS).

  • Main factor in construction cost is zeroing out hash table, that is construction cost is proportional to H_size. For faster preprocessing — hash table can be made smaller.

  • For small alphabet search (like online DNA search) — W_t type should be biggest native platform type (uint64_t for x86_64).

  • For long SS or big multi-string set (like in virus scans) — it might be faster to do in-SS-test in parallel to hash table bitmap than in hash table. Hash construction time might be faster too as we don’t need to zero it out.

  • For unicode and case-insensitive search — hash construction should use all possible word variants. For example for case-insensitive search for string with word "ab", we need to add to hash all possible case combinations: "ab", "Ab", "aB" and "AB". Though not yet tested, it seems that this algorithm should offer even bigger speed advantage for these cases over other algorithm.

Benchmark

Table-2 shows how many CPU ticks spent per byte of S. Tick refer to CPU cycle and is about 0.45 nano seconds for 2200Mhz CPU. Tick count read with RDTSC instruction.

Times are shown for different implementations and different substring length. Benchmarking was done on 50MB of text corpus taken from wikipedia text dump. Substrings are first M bytes from string:

8'E . It consists of a number of low-lying, largely mangrove covered islands
covering an area of around 665 km².  The population of Bakassi is the subject
of some dispute, but is generally put at between 150,000 and 300,000 people

OS power management functionality was turned off. Each test was run 8 times and best time was taken. Everything was compiled with with GCC-4.5.1 -O3 -native and run on x86_64 linux on Core2 Duo CPU at 2200Mhz. Comparison was done mostly with Boyer-Moore family of algorithms as they are believed to be fastest.

Table 2. CPU Ticks per haystack byte, with N=50MB, M=3..230

Implementation

M - substring length

3

4

6

8

13

30

120

230

GLIBC MEMMEM()

3.85

6.49

7.89

4.53

6.12

4.47

0.772

0.777

NAIVE MEMMEM()

1.45

1.45

1.45

1.45

1.45

1.45

1.45

1.45

STD::SEARCH()

1.31

1.31

1.31

1.31

1.31

1.31

1.31

1.31

BM_HORSPOOL()

2.99

3.02

2.24

1.51

1.12

1.01

0.78

0.774

BM_TUNED()

3.12

4.98

3.62

2.04

1.44

1.52

0.877

0.876

BM_HORSPOOL_YLILUOMA()

3.22

4.71

3.43

1.95

1.35

1.42

0.843

0.848

BM_HAL()

4.7

3.17

1.97

1.41

0.963

0.606

0.75

0.771

VOLNITSKY()

1.05

0.802

0.582

0.517

0.504

0.493

0.538

0.496

VOLNITSKY_X86_64()

1.05

0.802

0.591

0.525

0.434

0.423

0.282

0.132

Compared Implementations
  • GLIBC MEMMEM() — GNU LIBC 2.11.2. See [6],[7] about bad performance. Uses Two-Way(?) algorithm.

  • NAIVE MEMMEM() — naive implementation - source in naive.cc.

  • STD::SEARCH() — STL algorithm from libstdc++ 3.3.6 (bundled with GCC-4.5.1).

  • BM_HORSPOOL() — Boyer-Moore-Horspool C implementation from [17]

  • BM_TUNED — Tuned Boyer-More C implementation from [18]

  • BM_HORSPOOL_YLILUOMA() — Boyer-Moore-Horspool implementation by Joel Yliluoma with Hongli Lai optimization for Intel [20]

  • BM_HAL() — Boyer-Moore Hashed Accelerated Linear algorithm C++ implementation [21]

  • VOLNITSKY() — reference implementation of described here algorithm

  • VOLNITSKY_X86_64() — optimized for Intel closed source implementation. There is still room for improvement — no assembler or SSE used so far.

Several other implementation were tested but results for them not shown. They were slow or buggy.

Tested but not shown
  • boost::find_first() v1.41 — about x2 slower than std::search.

  • KLIBC memmem()/strstr() — slow

  • boyer_moore() — from [3] (wikipedia), if compiled with optimization — returns incorrect results.

  • strstr() — from [12], according to author, it was fastest implementation of strstr() in 2007. In my tests, it is only slightly faster than naive_strstr().

  • GLIBC strstr() — GNU LIBC 2.11.2. Very slow, about x20 times slower than described here algorithm.

  • xc_boyer_moore() — from [19]. Sometimes incorrect result was returned (not found).

Note C-style zero-terminated string functions are at disadvantage, because check for string end is more expensive.

Construction cost overhead

Below are results for testing startup costs for different N with fixed M=6. BM_HAL was compared because it has fastest construction from all other BMs algorithms. For benchmarks fall-back to std::search() was disabled in volnitsky().

Table 3. Construction costs: CPU Ticks per haystack byte, M=6
N STD::SEARCH() BM_HAL() VOLNITSKY()

1000000

1.24

1.91

0.48

100000

1.22

1.90

0.59

50000

1.23

1.91

0.71

20000

1.34

1.92

1.15

10000

1.39

2.13

1.47

5000

1.44

2.02

2.44

In reference implementation, if S_size is smaller than 20000, algorithm falls back to std::search.

License

License is Public Domain.

Source

To compile benchmark, missing include files can be found at http://github.com/lvv/lvvlib

Change Log

  • 12/15/2010: Fixed: SS not found if SS starts at 0 byte of S.

  • 2/1/2011: Fixed: SS not found if S too short; out of bounds access;

  • 2/12/2011: Corrected worst case preprocessing time complexity.

  • 2/14/2011: Fixed: minor error in naive_memmem()

  • 2/14/2011: Changed S_size threshold to 20000

  • 2/14/2011: Added startup costs benchmark

  • 3/3/2011: Fixed: in some cases not 1st occurrence of SS is found

  • 3/6/2011: Step loop simplification

Caution This algorithm was not peer-reviewed. Any feedback would be appreciated.

References