Convert a string to only have regular alphabet characters, from any language

1 day ago 1
ARTICLE AD BOX

I had an issue this morning where someone was blowing up our school search function (which uses a MySQL MATCH function) by typing '𝓵𝓲𝓵𝓵𝔂𝓬𝓻𝓸𝓯𝓽' into the school search. This is within the context of a Ruby on Rails site. This is the error i got in the logs:

Mysql::Error: syntax error, unexpected $end: SELECT record_id, alternative_name_string, address_and_urn_string, id_string, name, ((((MATCH (name) AGAINST ("+𝓵𝓲𝓵𝓵𝔂𝓬𝓻𝓸𝓯𝓽" IN BOOLEAN MODE))*20.0) + ((MATCH (address_and_urn_string,alternative_name_string,id_string) AGAINST ("+𝓵𝓲𝓵𝓵𝔂𝓬𝓻𝓸𝓯𝓽" IN BOOLEAN MODE))*5.0))*wanted_school_type_rating) AS score FROM squirrel_schools WHERE MATCH (alternative_name_string,address_and_urn_string,id_string,name) AGAINST ("+𝓵𝓲𝓵𝓵𝔂𝓬𝓻𝓸𝓯𝓽" IN BOOLEAN MODE) > 0 ORDER BY score DESC LIMIT 25000

I do want to allow searches for things like "trường đại học", which is Vietnamese. So it's not as simple as "limit to ASCII". It's more like "Limit to the normal set of language characters", but then, what is 'normal', how can we define the boundaries? Is there a kind of 'cutoff' where we say "anything higher than this value is some symbol, not an actual letter?

Note that I'm not trying to convert 𝓵𝓲𝓵𝓵𝔂𝓬𝓻𝓸𝓯𝓽 into "lillycroft" here - if someone's entering that in the search i don't need them to get anything back. I just need it to not crash the search engine.

Is there something I can do that's specific to MySQL, to sanitize the string before passing it to the search? I'm fine if that function just throws up its hands and converts 𝓵𝓲𝓵𝓵𝔂𝓬𝓻𝓸𝓯𝓽 into an empty string. But it does need to still be able to search for "trường đại học".

Read Entire Article