Results for simplistic startcode scan micro benchmark:
Core 2 Duo:
Found 13437824 matches in 2.1235 seconds three byte
Found 13437824 matches in 3.0758 seconds memchr()
Found 13437824 matches in 6.1899 seconds one byte (current)
Found 13437824 matches in 5.1797 seconds one byte bit shift
Found 13437824 matches in 2.1536 seconds ignore this one
Slightly better than current on C2D
Celeron:
Found 13437824 matches in 42.4303 seconds
Found 13437824 matches in 54.9332 seconds
Found 13437824 matches in 96.5829 seconds
Found 13437824 matches in 109.1286 seconds
Found 13437824 matches in 42.4092 seconds
Worse than current on celeron
C2D:
Found 13437824 matches in 2.1787 seconds three byte
Found 13437824 matches in 3.1294 seconds memchr()
Found 13437824 matches in 6.1963 seconds one byte (current)
Found 13437824 matches in 5.1827 seconds one byte bit shift
Found 13437824 matches in 5.5786 seconds ((_header_pos & 0xffffff00) == 0x00000100)
Celeron:
Found 13437824 matches in 42.4076 seconds
Found 13437824 matches in 54.8853 seconds
Found 13437824 matches in 96.5479 seconds
Found 13437824 matches in 109.0918 seconds one byte bit shift
Found 13437824 matches in 99.1737 seconds ((_header_pos & 0xffffff00) == 0x00000100)
It does appear to be a bit quicker on the celeron