WordTokenizers.jl: Basic tools for tokenizing natural language in Julia

WordTokenizers.jl is a tool to help users of the Julia programming language (Bezanson, Edelman, Karpinski, & Shah, 2014) work with natural language. In natural language processing (NLP) tokenization refers to breaking a text up into parts – the tokens. Generally, tokenization refers to breaking a sentence up into words and other tokens such as punctuation. Complementary to word tokenization is sentence segmentation or sentence splitting (occasionally also called sentence tokenization), where a document is broken up into sentences, which can then be tokenized into words. Tokenization and sentence segmentation are some of the most fundamental operations to be performed before applying most NLP or information retrieval algorithms.

WordTokenizers.jl is currently being used by packages like TextAnalysis.jl, Transformers.jl and CorpusLoaders.jl for tokenizing text.

Browse the Paper archive. Researcher: . Research Category: . Bookmark the permalink.