I've got a 170 MB file I want to search for lines that look like:
INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
This code runs in 1.3 seconds:
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1
print count
If I add a pre-filter before the regex, it runs in 0.78 seconds (about twice the speed!)
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1
print count
Every line which contains 'ENQ' also matches the full regex (61425 lines match, out of 2.1 million total). I don't understand why the first way is so much slower.
Once the regex is compiled, you should have a state machine pattern matcher. It should be O(n) in the length of the input to figure out that it doesn't match as far as "ENQ". And that's exactly how long it should take for "if 'ENQ' not in line" to run as well. Why is doing twice the work also twice the speed?
I'm running Python 2.7.3 on Ubuntu Precise, x86_64.