Module embeddingsprep
Embeddings
This package is designed to provide easy-to-use python class and cli interfaces to:
-
clean corpuses in an efficient way in terms of computation time
-
generate word2vec embeddings (based on gensim) and directly write them to a format that is compatible with Tensorflow Projector
Thus, with two classes, or two commands, anyone should be able clean a corpus and generate embeddings that can be uploaded and visualized with Tensorflow Projector.
Getting started
Installation
To install this package, simply run :
pip install embeddingsprep
Further versions might include conda builds, but it's currently not the case.
Requirements
This packages requires gensim
, nltk
, and docopt
to run. If
pip doesn't install this dependencies automatically, you can install it by
running :
pip install nltk docopt gensim
Main features
Preprocessing
For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible. A detailed version of what is done during the preprocessing is available here
IMPORTANT WARNING : Your text data should be represented one of the two following ways :
-
A simple .txt file, containing all your plain text data. This is not recommended as then, the preprocessing will not be multithreaded.
-
A directory containing many .txt files. The Preprocessor will read all the files in the directory, and will be multithreaded.
Usage example :
First you need to create and save a Preprocessor configuration file:
from embeddingsprep.preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('~/logdir')
config.set_config(writing_dir='~/outputs') # You can additionnally change other preprocessing params.
config.save_config()
Here, '~/logdir'
should be replaced by the path you want to log the preprocessing summary files in. The preprocessing summary files will contain, after fitting the preprocessor:
-
vocabulary.json
, the saved final vocabulary, after word phrases gathering and frequency subsampling. -
WordPhrases.json
, the word phrases vocabulary. -
summary.txt
, a summary containing informations on the preprocessing fitting.
The writing_dir='~/outputs'
argument indicates where the Preprocessor should write the processed files while transforming the data.
prep = Preprocessor('/tmp/logdir') # Loads the config object in /tmp/logdir if it exists
prep.fit('~/mydata/') # Fits the unigram & bigrams occurences - must be done once
prep.filter() # Filters with all the config parameters - can be done multiple times until the best parameters are found
prep.transform('~/mydata') # Transforms the texts with the filtered vocab.
You can additionnally redefine preprocessor parameters multiple times after fitting the data by accessing, for instance for the frequency threshold:
prep.params['freq_threshold'] = 0.3
Word2Vec
For the Word2Vec, we just wrote a simple wrapper that takes the preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)
Usage example:
from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')
'./my-preprocessed-data/'
is the directory where the preprocessed files are stored.
'./my-output-dir'
is the directory where the embeddings and model will be stored.
Future work
Future work will include:
-
Creation of a command line interface
-
Embedding alignements methods
-
More tutorials in the documentation
Contributing
Any github issue, contribution or suggestion is welcomed! You can open issues on the github repository.
Expand source code
"""# Embeddings
This package is designed to provide easy-to-use python class and cli
interfaces to:
- clean corpuses in an efficient way in terms of computation time
- generate word2vec embeddings (based on gensim) and directly write them to a format that is compatible with [Tensorflow Projector](http://projector.tensorflow.org/)
Thus, with two classes, or two commands, anyone should be able clean a corpus and generate embeddings that can be uploaded and visualized with Tensorflow Projector.
## Getting started
### Installation
To install this package, simply run :
```bash
pip install embeddingsprep
```
Further versions might include conda builds, but it's currently not the case.
### Requirements
This packages requires ```gensim```, ```nltk```, and ```docopt``` to run. If
pip doesn't install this dependencies automatically, you can install it by
running :
```bash
pip install nltk docopt gensim
```
## Main features
### Preprocessing
For Word2Vec, we want a soft yet important preprocessing. We want to denoise the text while keeping as much variety and information as possible. A detailed version of what is done during the preprocessing is available [here](./preprocessing/index.html)
**IMPORTANT WARNING** :
Your text data should be represented one of the two following ways :
- A simple .txt file, containing all your plain text data. This is not recommended as then, the preprocessing will not be multithreaded.
- A directory containing many .txt files. The Preprocessor will read all the files in the directory, and will be multithreaded.
#### Usage example :
First you need to create and save a Preprocessor configuration file:
```python
from embeddingsprep.preprocessing.preprocessor import PreprocessorConfig, Preprocessor
config = PreprocessorConfig('~/logdir')
config.set_config(writing_dir='~/outputs') # You can additionnally change other preprocessing params.
config.save_config()
```
Here, ```'~/logdir'``` should be replaced by the path you want to log the preprocessing summary files in. The preprocessing summary files will contain, after fitting the preprocessor:
- ```vocabulary.json```, the saved final vocabulary, after word phrases gathering and frequency subsampling.
- ```WordPhrases.json```, the word phrases vocabulary.
- ```summary.txt```, a summary containing informations on the preprocessing fitting.
The ```writing_dir='~/outputs'``` argument indicates where the Preprocessor should write the processed files while transforming the data.
```python
prep = Preprocessor('/tmp/logdir') # Loads the config object in /tmp/logdir if it exists
prep.fit('~/mydata/') # Fits the unigram & bigrams occurences - must be done once
prep.filter() # Filters with all the config parameters - can be done multiple times until the best parameters are found
prep.transform('~/mydata') # Transforms the texts with the filtered vocab.
```
You can additionnally redefine preprocessor parameters multiple times after fitting the data by accessing, for instance for the frequency threshold:
```python
prep.params['freq_threshold'] = 0.3
```
### Word2Vec
For the Word2Vec, we just wrote a simple wrapper that takes the
preprocessed files as an input, trains a Word2Vec model with gensim and writes the vocab, embeddings .tsv files that can be visualized with tensorflow projector (http://projector.tensorflow.org/)
#### Usage example:
```python
from embeddingsprep.models.word2vec import Word2Vec
model = Word2Vec(emb_size=300, window=5, epochs=3)
model.train('./my-preprocessed-data/')
model.save('./my-output-dir')
```
```'./my-preprocessed-data/'``` is the directory where the preprocessed files are stored.
```'./my-output-dir'``` is the directory where the embeddings and model will be stored.
## Future work
Future work will include:
- Creation of a command line interface
- Embedding alignements methods
- More tutorials in the documentation
## Contributing
Any github issue, contribution or suggestion is welcomed! You can open issues on the [github repository](https://github.com/sally14/embeddings)."""
name = "embeddingsprep"
Sub-modules
embeddingsprep.cli
embeddingsprep.models
-
Models …
embeddingsprep.preprocessing
-
Preprocessing …