top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Why is regex so slow on python?

0 votes
770 views

I've got a 170 MB file I want to search for lines that look like:

INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 m = pattern.search(line)
 if m:
 count += 1

print count

If I add a pre-filter before the regex, it runs in 0.78 seconds (about twice the speed!)

import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
 if 'ENQ' not in line:
 continue
 m = pattern.search(line)
 if m:
 count += 1

print count

Every line which contains 'ENQ' also matches the full regex (61425 lines match, out of 2.1 million total). I don't understand why the first way is so much slower.

Once the regex is compiled, you should have a state machine pattern matcher. It should be O(n) in the length of the input to figure out that it doesn't match as far as "ENQ". And that's exactly how long it should take for "if 'ENQ' not in line" to run as well. Why is doing twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

posted Jun 18, 2013 by anonymous

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

0 votes

I have no obvious answers, but a couple suggestions:

  1. Can you anchor the pattern at the beginning of the line? (use match() instead of search())
  2. Does it get faster it you eliminate the "(.*)" part of the pattern? It seems that if you find a line matching the first part of the pattern, you could just as easily split the line yourself instead of creating a group.
answer Jun 18, 2013 by anonymous
Similar Questions
+1 vote

I want to do the Boolean search over various sentences or documents. I do not want to use special programs like Whoosh, etc.

May I use any other parser? If anybody may kindly let me know.

+2 votes

I need to search through a directory of text files for a string. Here is a short program I made in the past to search through a single text file for a line of text.

How can I modify the code to search through a directory of files that have different filenames, but the same extension?

fname = raw_input("Enter file name: ") #"*.txt"
fh = open(fname)
lst = list()
biglst=[]
for line in fh:
 line=line.rstrip()
 line=line.split()
 biglst+=line
final=[]
for out in biglst:
 if out not in final:
 final.append(out)
final.sort()
print (final)
+1 vote

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

+4 votes

Probably I'm turning the use of regular expressions upside down with this question. I don't want to write a regex that matches prefixes of other strings, I know how to do that. I want to generate a regex -- given another regex --, that matches all possible strings that are a prefix of a string that matches the given regex.

E.g. You have the regex ^[a-z]*4R$ then the strings "a", "ab", "A4" "ab4" are prefixes of this regex (because there is a way of adding characters that causes the regex to match), but "4a" or "a44" or not.
How do I programmatically create a regex that matches "a", "ab", "A4", etc.. but not "4a", "a44", etc..

Logically, I'd think it should be possible by running the input string against the state machine that the given regex describes, and if at some point all the input characters are consumed, it's a match. (We don't have to run the regex until the end.) But I cannot find any library that does it...

+2 votes

I am trying to search the phrase of numbers in a html page in the
sentence below:

(253 items)

I used this piece of code, but it does not work,

limit= page.search("div[class=Results]").search("div").gsub("items","")

 begin
 Integer(limit)
 rescue
 return 0
 end

Would you give me any suggestion on this?

...