First Steps¶
Pre-requisites¶
We assume you have successfully installed polyglot2 (Yay!). Here are some very quick steps to get started
Training a model¶
To get you started on training a model, we have already provided a sample script that reads in a text corpus and trains a model. The script polyglot2_trainer.py in scripts allows you to train a simple model as below::
polyglot2_trainer.py --files <text files> --output <model file>
You can just run the below to get a list of command line options:
polyglot2_trainer.py --help
Examining the embeddings¶
Its very easy to load the trained model and examine the embeddings that have been learnt. If you are familiar with the word2vec module in gensim the interface is very similar.
To illustrate assume we have a model named test.model trained on some text. We can easily load the model as below:
In [1]: from polyglot2 import Polyglot
In [2]: model = Polyglot.load_word2vec_format('test.model')
After loading the model, one can easily query the nearest words to a given word(based on thier euclidean distance in embedding space).:
In [3]: model.most_similar('king')
Out[4]:
[(u'king', 0.0),
(u'prince', 0.58161491620762695),
(u'queen', 0.61713733694058359),
(u'emperor', 0.61844666306850182),
(u'lord', 0.64116868440576313),
(u'president', 0.66686299825558359),
(u'captain', 0.702852998721334),
(u'prophet', 0.72744270206467843),
(u'pope', 0.73201129536853193),
(u'governor', 0.74257922097558959)]
As you can see we get the words most similar to king in increasing order of their distances.
For more details please take a look at the source here: https://bitbucket.org/aboSamoor/polyglot2
Have fun training and exploring your own word embeddings !