Mysql Fulltext Index Htmlentities

Mysql fulltext indexes and html entities

Using the builtin myisam fulltext index to search for text in a table with htmlencoded text gives bit of a strange results. Perhaps it could be expected, but for convenience, it would have been nice to have it working…but apparently its not…

The following table

CREATE TABLE `texts` (
  `id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  `content` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY  (`id`),
  FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=latin1

…with the following data…

INSERT INTO `texts` VALUES (1,'Har en bra ledare medfödda egenskaper ell');
INSERT INTO `texts` VALUES (2,'frågan är hur det blir');
INSERT INTO `texts` VALUES (3,'o vilken olidlig smärta');

and the following words in the stopwords file(removing those later made me realize it kind of works, but only most of the times, and in this simple example).

smärta
ä

I would expect those queries to return at least a line each

mysql> SELECT * FROM texts WHERE match(content) against('smärta');
Empty SET (0.00 sec)
 
mysql> SELECT * FROM texts WHERE match(content) against('medfödda');
+----+------------------------------------------------+
| id | content                                        |
+----+------------------------------------------------+
|  1 | Har en bra ledare medfödda egenskaper ell |
+----+------------------------------------------------+
1 row IN SET (0.00 sec)
 
mysql> SELECT * FROM texts WHERE match(content) against('frågan');
+----+-----------------------------------+
| id | content                           |
+----+-----------------------------------+
|  2 | frågan är hur det blir |
+----+-----------------------------------+
1 row IN SET (0.00 sec)
 

It appears in the output above that the word “medfödda” returns a proper match(the same applies to “frågan”), but looking at the results from myisam_ftdump below indicates that the actual words matched are really “medf” & “ouml” & “dda” which gives somewhat of a improper match.

Changing ft_min_word_len to 1 character doesnt solve the problem, it gives a bit more output from myisam_ftdump but not the actual words that I expected unfortunately.

ptm:/home/eric# myisam_ftdump -d /var/lib/mysql/fulltext/texts 1
       38            0.9254975 aring
       38            0.9254975 blir
        0            0.9062075 bra
        0            0.9062075 dda
       38            0.9254975 det
        0            0.9062075 egenskaper
        0            0.9062075 ell
        0            0.9062075 en
       38            0.9254975 fr
       38            0.9254975 gan
        0            0.9062075 har
       38            0.9254975 hur
        0            0.9062075 ledare
        0            0.9062075 medf
       64            0.9666505 o
       64            0.9666505 olidlig
        0            0.9062075 ouml
       38            0.9254975 r
       64            0.9666505 vilken
 

Between changes theres a good idea to rebuild the index with

mysql> repair TABLE texts quick;
+----------------+--------+----------+----------+
| TABLE          | Op     | Msg_type | Msg_text |
+----------------+--------+----------+----------+
| fulltext.texts | repair | STATUS   | OK       |
+----------------+--------+----------+----------+
1 row IN SET (0.00 sec)

The workaround seems to be to store the non-encoded text in a table used for just searching, which adds a bit of work on the application side, if the index should be up to date all of the time, which is the preferred way. ;) Another alternative is also to build some sort of own indexing tables, i tried it before, and its good, but its hard to determine the relevance in ordering the results.

Conclusion

As a conclusion I would say the results are very unpredictable and not very reliable and searching in html encoded text should be avoided by all means and instead focus on considering alternatives to this approach.



blog comments powered by Disqus