nlptools

txttk.nlptools.sent_tokenize(context)[source]

Cut the given context into sentences. Avoid a linebreak in between paried symbols, float numbers, and some abbrs. Nothing will be discard after sent_tokeinze, simply ‘’.join(sents) will get the original context. Evey whitespace, tab, linebreak will be kept.

>>> context = "I love you. Please don't leave."
>>> sent_tokenize(context)
["I love you. ", "Please don't leave."]
txttk.nlptools.sent_count(context)[source]

Return the sentence counts for given context

>>> context = "I love you. Please don't leave."
>>> sent_count(context)
2
txttk.nlptools.clause_tokenize(sentence)[source]

Split on comma or parenthesis, if there are more then three words for each clause

>>> context = 'While I was walking home, this bird fell down in front of me.'
>>> clause_tokenize(context)
['While I was walking home,', ' this bird fell down in front of me.']
txttk.nlptools.word_tokenize(sentence)[source]

A generator which yields tokens based on the given sentence without deleting anything.

>>> context = "I love you. Please don't leave."
>>> list(word_tokenize(context))
['I', ' ', 'love', ' ', 'you', '.', ' ', 'Please', ' ', 'don', "'", 't', ' ', 'leave', '.']
txttk.nlptools.slim_stem(token)[source]

A very simple stemmer, for entity of GO stemming.

>>> token = 'interaction'
>>> slim_stem(token)
'interact'
txttk.nlptools.powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)[source]
txttk.nlptools.ngram(n, iter_tokens)[source]

Return a generator of n-gram from an iterable

txttk.nlptools.power_ngram(iter_tokens)[source]

Generate unigram, bigram, trigram ... and the max-gram, different from powerset(), this function will not generate skipped combinations such as (1,3)