Teach the bot! — the layout of the emotions and semantics of the Russian language

On all sides we piled the prospects for a bright robotic future. Or not very bright, in the spirit of the Matrix and the Terminator. In fact machines already confidently cope with translations, no worse and much faster than humans recognize faces and objects of the surrounding world, learning to understand and synthesize speech. Cool? Is not the word!


But curiosity is pushing for risky steps, namely trying to acquaint the computer with our world, and inner — feelings, emotions and experiences.

As we plan to upgrade the consciousness of machines, to teach them about emotions, feelings and evaluative judgments, as well as where you can free download marked
data — read the article.

the

don't want to read, show the result!


You can immediately try to teach the bot the link: Teach the bot!

If you like to answer — create your Map and the result will be stored.

the

limitations of distributional semantics


meme about distributional semantics, word2vec, robot, coffee

What, in fact, the problem of computer understanding of texts, because the machine can study all text cultural heritage and to learn everything out? Will tell the result of word2vec.

For the token "male":
woman 0.650
married 0.594
elderly 0.542
antimycin 0.538
...
pregnant 0.519
nulliparous 0.516
girl 0.498
...

Or for the word "hot":
0.510 warm
...
cold 0.498
cool 0.486
hot 0.467
...

for selenoproteins emotion "delight":
admiration 0.715
...
resentment 0.609
rage 0.597
horror 0.586
despair 0.584
...
awe 0.531
confusion 0.523
confusion 0.522
...
rabies 0.472
...

Or for the broad concept of "technology":
...
technology 0.569
art 0.451
skill 0.410
...
aircraft 0.393
industry 0.392
medicine 0.379
craft 0.375
...
industry 0.370
...
knowledge 0.360
science 0.358
...

Actually, these examples clearly show how much information gives context. Quite a lot, but clearly not enough to light a antonyms, part-whole, General-private, to make a distinction between vertical and horizontal relationships.

Therefore, it is reasonable that many researchers along with approaches distributional semantics (read: word2vec) use thesauri. For English, this resource is WordNet for the Russian — Rates, Wiktionary.

the

the Obvious is not so obvious


meme about the semantics, the lion King

Every researcher who has decided on a daring attempt to explain the car senses sooner or later face the fact that the most seemingly trivial things the computer is completely unclear. Moreover, even in children's books they are not written a single word. The world in a number of ways, known to us through our senses — through sight, hearing, smell, touch, taste and others.

Then we communicate each other very concise and brief context of the situation, which unfolds in a single head in the detailed picture. Each person's situation is revealed in different ways, depending on personal experience, cultural background, personality and attitude.

the

Emotions, feelings, experiences


Words and phrases carry much more meaning than recorded in the dictionaries. First of all it is connected with such a shaky and slabostyami properties, as the assessment and the accompanying emotional overtones. For example, the phrase grievous torments carries a strong negative emotion. And the phrase elation — a strong positive. Not present is something negative, but not too much. And, for example, virtuoso has a fairly strong positive rating.
The difficulty with fixing those characteristics of the words that they are extremely subjective and poorly formalized. For example, the word strategy is positive or neutral? You can disagree only with the fact that it is not negative.

However, the emotional and evaluative attributes are an integral part of language units and play a very important role in human communication. Therefore, if we want to make the machine more humane and pleasant, she also needs to feel these delicate matters.



the

What to do?


To manually create such a dictionary would be extremely time-consuming, because the partition you want not only words but also phrases. In addition, all assessments will be strongly tied to subjective opinion researcher.

Good news! we live in the 2017th year, and available to us such wonderful technologies like the Internet and crowdsourcing. The latter allows you to simultaneously cope with both the complexity and the subjectivity of assessments. Of course this creates the effect of "average hospital", but for a first approximation we allow ourselves to close our eyes to the irregularities of this kind.

the

Teach the bot! — the layout of the emotions and semantics of the Russian language


The idea is implemented in the language framework Card words. The work will be done in several ways:

the

    Estimated markup. the Task is to mark the words and expressions of the Russian language according to the criteria of positive/neutral/negative and intensity of the symptom.

    Emotional markup. the Task — to lay out the emotive words and expressions for the polarization and strength of the emotional background.

    Markup thesaurus. the Task is to partition the vertical and horizontal relationships between words, to assign semantic tags to words and expressions.

    Experimental markup relations of the theory of the "Meaning ⇔ Text", proposed by I. A. Malcolm: MAGN(tea) = strong coffee, MAGN(sense) = strong feeling, etc.


To use human labor for maximum benefit and make the job interesting for the meeting, apply approaches distributional semantics and machine learning. The basis for the system of semantic categories, we took the classification used for the RNC.

How to participate?




An important goal of our initiative is to fill the missing linguistic resources for Russian language, open for use by researchers, linguists and engineers-practitioners. We expect that based on the data markings will be carried out interesting research, written scientific articles, articles on habré, will appear engineering products and open technologies.

You can help the project in the following ways:

the

    to Participate in the training of the bot. It is easy and fun, and allows you to pump your language awareness and to notice interesting features of the Russian language.

    Like cher, Alisher! Share links on social networks, tell us about it in your blog or website.

    Constructive criticism helps to develop and not dive into the swamp of illusions. Discussion is very important in order to adjust the course and create a really useful resource. The only wish: criticize — offer.

    Semantics and cognitive linguistics. We are trying to upgrade their understanding of modern approaches to the semantics and creation of such resources. We will be glad to advice or recommendations about what to read, what to study, with whom to consult.

    information Dissemination. We need your advice on where else can you tell us about the project — it could be your favorite tech blog, online magazine about technology, a group in Vkontakte/Slack/Telegram or something else. the

    Open data


    The aggregated results of the markup will be open for download and is available under license CC BY-NC 4.0.

    To receive and publish the first results we expect by the middle/end of July — everything will depend on the activity meets. So do not miss it, place the asterisk and follow us on github:

    Open data on the Map words

    the

    Where's the money, Zin?


    It's great to try to combine in one project crowdsourcing and crowdfunding, which we did by launching a campaign to raise funds on the Planet.ru

    to Teach a computer to understand our world and emotions



    Important. Project we are doing until the result of their own and the resources available. The data collected, as promised, be open and available to everyone. The only question is the timing and the amount of markup. Now we expect to receive a basic result (10.000 most frequent lemmas) for the three months, marking the full amount will take about two years.

    Additional resources will help to significantly faster results. We need to help developers involved in the creation and improvement of the system of marking, to add a new semantic category, and to conduct research. Also funds are needed to promote the project and conduct of tenders.

    To donate to the campaign any amount of money you'll know that the overall success is your contribution, and every ruble invested will be spent on cool and useful.

    Don't forget that you can help the initiative and no money. Like and share the project in social networks is very simple, absolutely free but very effective way of promotion.

    And remember...
    the Choice is always yours.

    image

    the

    corporate sponsorship


    You represent an established business and you are interested in the development of open linguistic data in Russia? Become a corporate sponsor of the project! You get eternal a graphical link from the project page, additional ads on thousands of people and heavenly respect from the community.

    Every invested dollar we will spend with incredible efficiency, and in a few months ' salary of one programmer in a large company will do the entire project, the results of which will benefit thousands of researchers, scientists and engineers.

    the

    Commercial use


    For commercial use or business-specific markup write on kartaslov@mail.ru or in a PM to the author.

    the

    Acknowledgement


    A big thanks goes to organizers and participants of the Dialogue-2017 — the 23rd international conference on computational linguistics and intellectual technologies.

    It is in unofficial discussions of the event was a clear need for such markup, and also collected a group of likeminded people to discuss experimental marking of relations according to the theory of the "Meaning ⇔ Text". I hope that next year, based on the data collected, you can start a new interesting competition in the framework of the Dialogue Evaluation.

    the

    References


      the
    1. Teach bot! the Map words
    2. the
    3. RusVectōrēs: ready word2vec model for Russian language
    4. the
    5. Russian Thesaurus Ruthes (RuWordNet)
    6. the
    7. Wiktionary
    8. the
    9. lexical and semantic information in the RNC
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Integration of PostgreSQL with MS SQL Server for those who want faster and deeper

Custom database queries in MODx Revolution

Google Web Mercator: a mixed coordinate system