First Steps

Pre-requisites

We assume you have successfully installed polyglot2 (Yay!). Here are some very quick steps to get started

Training a model

To get you started on training a model, we have already provided a sample script that reads in a text corpus and trains a model. The script polyglot2_trainer.py in scripts allows you to train a simple model as below::

polyglot2_trainer.py --files <text files> --output <model file>

You can just run the below to get a list of command line options:

polyglot2_trainer.py --help

Examining the embeddings

Its very easy to load the trained model and examine the embeddings that have been learnt. If you are familiar with the word2vec module in gensim the interface is very similar.

To illustrate assume we have a model named test.model trained on some text. We can easily load the model as below:

In [1]: from polyglot2 import Polyglot
In [2]: model = Polyglot.load_word2vec_format('test.model')

After loading the model, one can easily query the nearest words to a given word(based on thier euclidean distance in embedding space).:

In [3]: model.most_similar('king')
Out[4]:
        [(u'king', 0.0),
         (u'prince', 0.58161491620762695),
         (u'queen', 0.61713733694058359),
         (u'emperor', 0.61844666306850182),
         (u'lord', 0.64116868440576313),
         (u'president', 0.66686299825558359),
         (u'captain', 0.702852998721334),
         (u'prophet', 0.72744270206467843),
         (u'pope', 0.73201129536853193),
         (u'governor', 0.74257922097558959)]

As you can see we get the words most similar to king in increasing order of their distances.

For more details please take a look at the source here: https://bitbucket.org/aboSamoor/polyglot2

Have fun training and exploring your own word embeddings !