As mentioned on Slack, I propose adding a generalized numericalizer interface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existing Vocab class fits perfectly into this interface, so no big changes would be required. The interface would look like this:
class SmartNumericalizer(ABC):
def update(tokens):
pass
def finalize():
pass
def numericalize(tokens):
pass
The name is just a placeholder for now.
One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the update method, similarly to what Vocab does now) and load them when finalize is called. I assume TF-IDF could be implemented in a similar fashion.
The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).
As mentioned on Slack, I propose adding a generalized
numericalizerinterface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existingVocabclass fits perfectly into this interface, so no big changes would be required. The interface would look like this:The name is just a placeholder for now.
One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the
updatemethod, similarly to whatVocabdoes now) and load them whenfinalizeis called. I assume TF-IDF could be implemented in a similar fashion.The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).