top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Joining two tables on string-likeness

+1 vote
331 views

I wish to join two tables on likeness, not equality, of character strings. Soundex does not work. I am using the Levenstein edit distance, written in SQL, a very costly test, and I am in no position to write it in C and link it to MySQL--and joining on equality takes a fraction of a second, and this takes hours. Any good ideas?

posted Jun 3, 2013 by anonymous

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

2 Answers

+1 vote
 
Best answer

Equality checks have a linear cost of O(min(len1,len2)) and can make use of indexes, too, while Levenshtein cost is is almost quadratic O(len1*len2) and can't make any good use of indexes ... even using a C UDF would help only so far with this kind of complexity. It will increase performance by a constant factor, but given long enough input strings the len1*len2 factor will still account for the majority
of the run time increase over simple equality comparions

there are a few possible points of optimization though, first of all you can cut off equal start and end sequences (linear complexity for that part instead of quadratic). You can also add a few more tricks
if you are only interested in matches below a certain distance threshold:

  • if string lengths differ by more than the threshold value you can rule out this pair of strings as being "similar" right away

  • while iterating over the distance array keep track of the min. distance value of the current row ... if at the end of a row is larger than the threshold distance you can terminate right away

  • only calculate operation cost, not operation type

  • do not maintain a full len1*len2 array, having only the previous and current row in two one dimensional arrays is sufficient (this esp. helps in C implementation as the functions working set is more likely to fit into CPU caches)

answer Jun 3, 2013 by anonymous
+1 vote

Looks like Sphinx might be a helpful addition to your stack. Way too much to explain how to do so here, but it's a proper fulltext engine, and has a MySQL-compatible interface.

answer Jun 3, 2013 by anonymous
Similar Questions
0 votes

How can I insert data into one table from two other tables where i have three tables namely users, role and userrole.
Now I want to insert the data into userrole table from users table and role table with a single statement.

+1 vote

I have a test cluster of two machines, on both of them hadoop is installed. I have configured the hadoop cluster but on admin UI (as in the below picture) I see that two nodes are running on the same master machine, and that the other machine has no Hadoop node.

On master machine following services are running:

~$ jps 26310 ResourceManager 27593 Jps 26216 DataNode 26135 NameNode 26557 NodeManager 26701 JobHistoryServer 

On the slave machine:

~$ jps 2614 DataNode 2920 Jps 2707 NodeManager 

I don't why the slave is not joining the cluster (It was before). I tried to shutdown all servers on both machines and format HDFS then restarting everything but that didnot help. Any help to figure whats causing that behavior is appreciated.

...