Word2Vec in the examples
the fates in my hands got trained on search queries Word2Vec. Below the examples with explanations.
What is Word2Vec?
Word2Vec is a technology from Google that is sharpened on statistical processing of large amounts of textual information. W2V collects statistics on the joint occurrence of words in sentences, after which the methods of neural networks solves the problem of reducing the dimension and outputs a compact vector representation of words, to the greatest extent reflect the relations of these words in the processed text. I advise you to read the original, so as not to rely on my muddled paraphrase of it.
What is Word2Vec studied?
For learning were taken as queries to the domestic Internet search engine, respectively — the bulk of its Russian-speaking. Vector length is 256 elements, available for both algorithm "skipgrams" and "bag of words". The total number of words — more than 2.6 million words of the Russian language, many words from other languages, typos, names, and codes — in short everything that people could search for.
Requests to the search engine are very different in characteristics from conventional texts, and it yields some new results.
Typos
Our services not only the resolution of typos in the query, but a complete list of all typos are human. All the typos are collected in a single cluster, which is convenient. Unlike systems, spell check — a cardinal. After all, the typo is resolved not by calculating the Levenshtein distance (the minimum number of changes required to obtain from the incorrect form correct), but according to statistics, the real mistakes of real users.
Transliteration, not the layout
Although the problem of transliteration and correct the wrong layout and solved more simple and quick methods, it is still nice that Word2Vec is able here to not disappoint.
Names of websites, analogues
Clustering of words is the main function of Word2Vec, and as you can see, it works well.
Semantically similar words
First results — Word2Vec in the "skipgrams" — that is, to select the words according to their environment, whereas the second results — Word2Vec in the "bag of words" — the selection of words together with their surroundings. The first word, interchangeable with coffee, the second words that characterize the coffee. Second, the results are especially useful when we start to think, how to evaluate the importance of words in the query. What word is the main, and what konkretisiert request.
Clustering
A query composed of multiple words can be reduced to one, the most characteristic word. Moreover, it is not necessary that this word is present in the original query at all. You can compare the verbose queries among themselves and without intermediate translation to a single word. Here we see the Aramaic translation and expansion request in action.
Semantic relations between words
The most interesting section of the description Word2Vec Google is how they turned the king into the Queen of simple arithmetical operations on vectors. In search queries of this focus is unable to rotate, a little recently looking for kings and Queens, but some semantic relations really stand out.
You want to find a word that applies to Germany, like Paris belongs to France.
Impressive...
assessment of the importance of words in the query
The principle is simple. It is necessary to determine which cluster is drawn the overall query, and then choose words that are maximally distant from the center of this cluster. These words are the main and the others clarifying.
Vladimir Putin — the word is almost unimportant. Almost all of the fishing season, found on the Internet — Vladimir. But Nikita Putin — on the contrary, Nikita is more important. Because of all of Putin on the Internet to choose Nikita.
Insights
As such here are a few insights. The technology works and works well. I hope that this post will be used as illustrations in the Russian language the opportunities that are hidden in Word2Vec.
Article based on information from habrahabr.ru
What is Word2Vec?
Word2Vec is a technology from Google that is sharpened on statistical processing of large amounts of textual information. W2V collects statistics on the joint occurrence of words in sentences, after which the methods of neural networks solves the problem of reducing the dimension and outputs a compact vector representation of words, to the greatest extent reflect the relations of these words in the processed text. I advise you to read the original, so as not to rely on my muddled paraphrase of it.
What is Word2Vec studied?
For learning were taken as queries to the domestic Internet search engine, respectively — the bulk of its Russian-speaking. Vector length is 256 elements, available for both algorithm "skipgrams" and "bag of words". The total number of words — more than 2.6 million words of the Russian language, many words from other languages, typos, names, and codes — in short everything that people could search for.
Requests to the search engine are very different in characteristics from conventional texts, and it yields some new results.
Typos
./distance vectors.bin
Enter word or sentence (EXIT to break): the journey even more pleasant.
Word journey even more pleasant Position in vocabulary: 124515
Word Cosine distance
adventure 0.748698
prikluchenia 0.726111
adventure 0.692828
prikluchenia 0.670168
priklucenie 0.666706
prikluchenia 0.663286
prklyucheniya 0.660438
adventures 0.659609
Our services not only the resolution of typos in the query, but a complete list of all typos are human. All the typos are collected in a single cluster, which is convenient. Unlike systems, spell check — a cardinal. After all, the typo is resolved not by calculating the Levenshtein distance (the minimum number of changes required to obtain from the incorrect form correct), but according to statistics, the real mistakes of real users.
Transliteration, not the layout
./distance vectors.bin
Enter word or sentence (EXIT to break): avito
Word: avito Position in vocabulary: 1999
Word Cosine distance
— awito 0.693721
avito 0.675299
fvito 0.661414
avita 0.659454
irr 0.642429
ovit 0.606189
аviто 0.598056
./distance vectors.bin
Enter word or sentence (EXIT to break): pwmpgu
— PSPU 0.723194
PSPD 0.721070
PSPD 0.712373
PSPgo 0.704579
psweb 0.695897
pssdk 0.694641
PSPgo 0.692646
psdu 0.681183
pssp 0.660203
pgpd 0.653649
Google 0.649897
pooplo 0.647420
PSPgo 0.643923
pmspl 0.641619
ptuc 0.640587
psptouch 0.631423
pssu 0.620105
gogle 0.616396
ISTP 0.612234
google 0.608240
Although the problem of transliteration and correct the wrong layout and solved more simple and quick methods, it is still nice that Word2Vec is able here to not disappoint.
Names of websites, analogues
./distance vectors.bin
Enter word or sentence (EXIT to break): Google
— googol 0.850174
Google 0.809912
GOP 0.786360
Google 0.760508
Google 0.734248
GUG 0.731465
Google 0.726011
Google 0.725497
gcgl 0.724901
Guguli 0.722874
Goglev 0.719596
GPD 0.719277
Gugel 0.715329
gugal 0.713950
Yandex 0.695366
google 0.690433
0.669867 googl
./distance vectors.bin
Enter word or sentence (EXIT to break): mail
— rambler 0.777771
meil 0.765292
inbox 0.745602
maill 0.741604
yandex 0.696301
maii 0.675455
myrambler 0.674704
zmail 0.657099
mefr 0.655842
jandex 0.655119
gmail 0.652458
вкmail 0.639919
Clustering of words is the main function of Word2Vec, and as you can see, it works well.
Semantically similar words
./distance vectors.bin
Enter word or sentence (EXIT to break): coffee
— Coffey 0.734483
tea 0.690234
tea 0.688656
cappuccino 0.666638
Corr 0.636362
cocoa 0.619801
espresso 0.599390
coffee 0.595211
chicory 0.594247
Kofe 0.593993
capuchino 0.587324
chocolate 0.585655
cappuccino 0.580286
cardamom 0.566781
latte 0.563224
./distance vectors2.bin
Enter word or sentence (EXIT to break): coffee
— bean 0.757635
soluble 0.709936
Coffey 0.704036
mellanrost 0.694822
sublimirovanny 0.694553
ground 0.690066
coffee 0.680409
tea 0.679867
decaffeinated 0.678563
cappuccino 0.677856
monoarabica 0.676757
fresh 0.676544
decaf 0.674104
Gevalia 0.673163
rastvorimy 0.659948
etiopia 0.657329
electrotorque 0.652837
First results — Word2Vec in the "skipgrams" — that is, to select the words according to their environment, whereas the second results — Word2Vec in the "bag of words" — the selection of words together with their surroundings. The first word, interchangeable with coffee, the second words that characterize the coffee. Second, the results are especially useful when we start to think, how to evaluate the importance of words in the query. What word is the main, and what konkretisiert request.
Clustering
./distance vectors2.bin
Enter word or sentence (EXIT to break): mobile phone
cell 0.811114
phone 0.776416
smartphone 0.730191
telfon 0.719766
mobile 0.717972
cell phone 0.706131
phone 0.698894
the phone 0.695520
phone 0.693121
Mobilny 0.692854
teleon 0.688251
phone 0.685480
Telefon 0.674768
cell 0.673612
A query composed of multiple words can be reduced to one, the most characteristic word. Moreover, it is not necessary that this word is present in the original query at all. You can compare the verbose queries among themselves and without intermediate translation to a single word. Here we see the Aramaic translation and expansion request in action.
Semantic relations between words
The most interesting section of the description Word2Vec Google is how they turned the king into the Queen of simple arithmetical operations on vectors. In search queries of this focus is unable to rotate, a little recently looking for kings and Queens, but some semantic relations really stand out.
You want to find a word that applies to Germany, like Paris belongs to France.
./word-analogy vectors2.bin
Enter three words (EXIT to break): France Paris Germany
Munich 0.716158
Berlin 0.671514
Dusseldorf 0.665014
Hamburg 0.661027
Cologne 0.646897
Amsterdam 0.641764
Frankfurt 0.638686
Prague 0.612585
Aschaffenburg 0.609068
Dresden 0.607926
Nuremberg 0.604550
lüdenscheid 0.604543
Gmunden 0.590301
./word-analogy vectors2.bin
Enter three words (EXIT to break): us dollar Ukraine
— UAH 0.622719
dolar 0.607078
hryvnia 0.597969
ruble 0.596636
dollar 0.588882
the hryvnia 0.584129
ruble 0.578501
the ruble 0.574094
the dollar 0.565995
tenge 0.561814
dolar 0.561768
currency 0.556239
dollar 0.548859
the hryvnia 0.544302
Impressive...
assessment of the importance of words in the query
The principle is simple. It is necessary to determine which cluster is drawn the overall query, and then choose words that are maximally distant from the center of this cluster. These words are the main and the others clarifying.
./importance vectors.bin
Enter word or sentence (EXIT to break): buy a pizza in Moscow
Importance to buy = 0.159387
Importance pizza = 1
Importance = 0.403579
Importance Moscow = 0.455351
Enter word or sentence (EXIT to break): download twilight
Importance download = 0.311702
Importance twilight = 1
Enter word or sentence (EXIT to break): Vladimir Putin
Importance of Vladimir = 0.28982
Importance Putin = 1
Enter word or sentence (EXIT to break): Nikita Putin
Importance Nikita = 0.793377
Importance Putin = 0.529835
Vladimir Putin — the word is almost unimportant. Almost all of the fishing season, found on the Internet — Vladimir. But Nikita Putin — on the contrary, Nikita is more important. Because of all of Putin on the Internet to choose Nikita.
Insights
As such here are a few insights. The technology works and works well. I hope that this post will be used as illustrations in the Russian language the opportunities that are hidden in Word2Vec.
Комментарии
Отправить комментарий