top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

diacritical insensitive search in python

0 votes
477 views

One feature that seems to be missing in the re module (or any tools that I know for searching text) is "diacritical insensitive search". I would like to have a match for something like this:

re.match("franc", "français")

in about the same whay we can have a case incensitive search:

re.match("(?i)fran", "Français").

Another related and more general problem (in the sense that it could easily be used to solve the first problem) would be to translate a string removing any diacritical mark:

nodiac("Français") -> "Francais"

The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious).

posted May 17, 2013 by anonymous

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

0 votes

The handling of diacriticals is especially a nice case study. One can use it to toy with some specific features of Unicode, normalisation, decomposition, ...

... and also to show how Unicode can be badly implemented.

First and quick example that came to my mind (Py325 and Py332):

timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'ᶑḗḖḕḹ'))", "import unicodedata as ud")
[2.929404406789672, 2.923327801150208, 2.923659417064755]
timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'ᶑḗḖḕḹ'))", "import unicodedata as ud")
[3.8437222586746884, 3.829490737203514, 3.819266963414293]
answer May 17, 2013 by anonymous
Similar Questions
+1 vote

I want to do the Boolean search over various sentences or documents. I do not want to use special programs like Whoosh, etc.

May I use any other parser? If anybody may kindly let me know.

0 votes

This is my dilemma, Im trying to get the generated JSON file using the bing api search.

This is the code that Im executing from inside the shell:
http://bin.cakephp.org/view/460660617 [1]

The port doesnt matter to me. Thoughts?

+1 vote

I have about 500 search queries, and about 52000 files in which I have to find all matches for each of the 500 queries.

How should I approach this? Seems like the straightforward way to do it would be to loop through each of the files and go line by line comparing all the terms to the query, but this seems like it would take too long.

Can someone give me a suggestion as to how to minimize the search time?

0 votes

I am analysing designing an abstraction layer over a select few NoSQL and SQL databases.

Specifically:

  • Redis, Neo4j, MongoDB, CouchDB
  • PostgreSQL

Being inexperienced; it is hard to know a nice way of abstracting search. For conciseness in my explanation, think of Table as being table, object, entity or key; and name as being name or type.

Maybe res = Table.name.search()

Or on multiple Table:
`res = AbstractDB().AbstractSearch(

)`

Then: res.paginate(limit=25, offset=5)

Or if you want all: res.all()

And additionally borrow/alias from a relevant subset of PEP249, e.g.: fetchone and fetchmany

Will open-source this once it's of sufficient functionality. Any suggestion

...