What are Cloudant Analizers?
From the official documentation:Analyzers are settings which define how to recognize terms within text. This can be helpful if you need to index multiple languages.I decided to try some of the analyzers and put my results here. Test text has two completely different languages (English and Japanese) and some other symbols which I think it is enough to find out how different analysers behave.
How to test Analizers
The_search_analyze
API is for this purpose
FILE=analyze_classic.json echo '{"analyzer": "classic","text": "Evening of the seventh"}' > $FILE # Use _search_analyze CLOUDANT='https://<username>:<password>@<username>.cloudant.com' curl -X POST ${CLOUDANT}/_search_analyze -H 'Content-Type: application/json' --data @${FILE}
{"tokens":["evening","seventh"]}
Experiment
Today is a Japanese festival Tanabata so I will use a piece of this romantic song I like very much Suspicious Mind - Elvis Presley and a part of its japanese translation I found hereWe're caught in a trap\n I can't walk out\n Because I love you too much baby♬.\n Sent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\n はまった罠から\n 出られないんだ\n ほんとに君に首ったけなんだ♬.\n 2016年7月7日18時0分29秒にnacho@email.jpより ★I added new lines
\n
, ♬★ marks, an email and a timestamp to make it more interesting ;-] Results
Analyzer | Result tokens | Length |
---|---|---|
classic | ["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho@email.com", "2016-07-07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho@email.jp", "よ", "り"] | 64 |
["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho@email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho@email.jp", "よ", "り"] | 66 | |
keyword | ["We're caught in a trap\nI can't walk out\nBecause I love you too much baby♬.\nSent by nacho@email.com at 2016-07-07 18:00:29 +0900 ★\nはまった罠から\n出られないんだ\nほんとに君に首ったけなんだ♬.\n2016年7月7日18時0分29秒にnacho@email.jpより ★"] | 1 |
simple | ["we", "re", "caught", "in", "a", "trap", "i", "can", "t", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email", "com", "at", "はまった罠から", "出られないんだ", "ほんとに君に首ったけなんだ", "年", "月", "日", "時", "分", "秒にnacho", "email", "jpより"] | 35 |
standard | ["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho", "email.jp", "よ", "り"] | 68 |
whitespace | ["We're", "caught", "in", "a", "trap", "I", "can't", "walk", "out", "Because", "I", "love", "you", "too", "much", "baby♬.", "Sent", "by", "nacho@email.com", "at", "2016-07-07", "18:00:29", "+0900", "★", "はまった罠から", "出られないんだ", "ほんとに君に首ったけなんだ♬.", "2016年7月7日18時0分29秒にnacho@email.jpより", "★"] | 29 |
english | ["we'r", "caught", "trap", "i", "can't", "walk", "out", "becaus", "i", "love", "you", "too", "much", "babi", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho", "email.jp", "よ", "り"] | 68 |
spanish | ["we'r", "caught", "in", "trap", "i", "can't", "walk", "out", "becaus", "i", "love", "you", "too", "much", "baby", "sent", "by", "nach", "email.com", "at", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nach", "email.jp", "よ", "り"] | 71 |
japanese | ["we", "re", "caught", "in", "a", "trap", "i", "can", "t", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email", "com", "at", "2016", "07", "07", "18", "00", "29", "0900", "はまる", "罠", "出る", "ほんとに", "君", "首ったけ", "2", "0", "1", "6", "年", "7月", "7", "日", "1", "8", "時", "0", "分", "2", "9", "秒", "nacho", "email", "jp"] | 56 |
cjk | ["we're", "caught", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "nacho", "email.com", "2016", "07", "07", "18", "00", "29", "0900", "はま", "まっ", "った", "た罠", "罠か", "から", "出ら", "られ", "れな", "ない", "いん", "んだ", "ほん", "んと", "とに", "に君", "君に", "に首", "首っ", "った", "たけ", "けな", "なん", "んだ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒に", "nacho", "email.jp", "より"] | 63 |
arabic | ["we're", "caught", "in", "a", "trap", "i", "can't", "walk", "out", "because", "i", "love", "you", "too", "much", "baby", "sent", "by", "nacho", "email.com", "at", "2016", "07", "07", "18", "00", "29", "0900", "は", "ま", "っ", "た", "罠", "か", "ら", "出", "ら", "れ", "な", "い", "ん", "だ", "ほ", "ん", "と", "に", "君", "に", "首", "っ", "た", "け", "な", "ん", "だ", "2016", "年", "7", "月", "7", "日", "18", "時", "0", "分", "29", "秒", "に", "nacho", "email.jp", "よ", "り"] | 72 |
Commentaries
Some interesting things to note:-
The only notorious difference between
standard
andclassic
is the treat of "2016-07-07" and emails.standard
got the emails wrong. email
looks likestandard
but with the emails rightwhitespace
looks like a good option when text has symbols (ie.: ♬, ★). There are still there as tokens!keyword
just one token. I think it would be useful for exact match searches- None of the analyzers, not even
keyword
preserved the new line character\n
. - Several occurrences of the same string are valid. (ie.: "i" and "★")
- I find funny "we're" became "we'r" when
english
was used. Also "baby" became "babi". - Japanese words were tokenized as characters in
english
. Makes really hard (if possible) to do a useful search using Japanese - I am depressed that
spanish
didn't get my name "nacho" right. For some reason it became "nach". I guess it is trying to get the root of it, since there is also nacha, nachito, nachita, nachos, nachas, etc. cjk
(chinese/japanese/korean) complete broke the japanese words. I know it is hard to parse Japanese butcjk
is not helping at all here.- I have no idea why I tried
arabic
I literally have 0 knowledge of the language to comment something about it.
More on "nacho"
curl -X POST ${CLOUDANT}/_search_analyze -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"nacho"}'
{"tokens":["nach"]}
curl -X POST ${CLOUDANT}/_search_analyze -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"nachos"}'
{"tokens":["nach"]}
curl -X POST ${CLOUDANT}/_search_analyze -H 'Content-Type: application/json' -d '{"analyzer":"spanish", "text":"Nachos"}'
{"tokens":["nach"]}🇵🇪 Creo que voy a reclamar a los de Cloudant o Apache para que mi nombre salga bien! Ahorita mismo!!🇵🇪 (O sera que esta sacando la raiz de la palabra?)
0 comments :
Post a Comment