Module embeddingsprep.models
Models
Submodules containing the models.
Quick overview
For now, models only contains a simple wrapper for gensim's implementation of word2vec.
Its main advantage is to be directly compatible with our implementation of the preprocessing.
The Word2Vec model takes as an input :
-
A bunch of preprocessed files, represented as a list of path, or the path of a directory containing plain, raw text files.
-
The parameters of the model
The model has 3 main outputs :
-
An
embeddings.tsv
file, containing one embedding per line, with each coordinates separated by a ' ' token. -
A metadata.tsv
file, containing 1 word per line, each line corresponding to the embedding on the same line inembeddings.tsv
. Warning : the first line of this file has a header, that theembeddings.tsv
file doesn't have, resulting in a shift of one line in theembeddings.tsv
-A metadata.tsv
correspondence. -
A word2vec.model file with the trained gensim checkpoints to restore the model from.
Usage example
from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')
Expand source code
"""
# Models
Submodules containing the models.
## Quick overview
For now, models only contains a simple wrapper for gensim's implementation of word2vec.
Its main advantage is to be directly compatible with our implementation of the preprocessing.
The Word2Vec model takes as an input :
- A bunch of preprocessed files, represented as a list of path, or the path of a directory containing plain, raw text files.
- The parameters of the model
The model has 3 main outputs :
- An ```embeddings.tsv``` file, containing one embedding per line, with each coordinates separated by a '\t' token.
- ```A metadata.tsv``` file, containing 1 word per line, each line corresponding to the embedding on the same line in ```embeddings.tsv```. Warning : the first line of this file has a header, that the ```embeddings.tsv``` file doesn't have, resulting in a shift of one line in the ```embeddings.tsv``` - ```A metadata.tsv``` correspondence.
- A word2vec.model file with the trained gensim checkpoints to restore the model from.
## Usage example
```python
from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')
```
"""
Sub-modules
embeddingsprep.models.utils
-
Utils for Word2Vec wrapper
embeddingsprep.models.word2vec
-
Word2Vec …