nlptools¶
-
txttk.nlptools.
sent_tokenize
(context)[source]¶ Cut the given context into sentences. Avoid a linebreak in between paried symbols, float numbers, and some abbrs. Nothing will be discard after sent_tokeinze, simply ‘’.join(sents) will get the original context. Evey whitespace, tab, linebreak will be kept.
>>> context = "I love you. Please don't leave." >>> sent_tokenize(context) ["I love you. ", "Please don't leave."]
-
txttk.nlptools.
sent_count
(context)[source]¶ Return the sentence counts for given context
>>> context = "I love you. Please don't leave." >>> sent_count(context) 2
-
txttk.nlptools.
clause_tokenize
(sentence)[source]¶ Split on comma or parenthesis, if there are more then three words for each clause
>>> context = 'While I was walking home, this bird fell down in front of me.' >>> clause_tokenize(context) ['While I was walking home,', ' this bird fell down in front of me.']
-
txttk.nlptools.
word_tokenize
(sentence)[source]¶ A generator which yields tokens based on the given sentence without deleting anything.
>>> context = "I love you. Please don't leave." >>> list(word_tokenize(context)) ['I', ' ', 'love', ' ', 'you', '.', ' ', 'Please', ' ', 'don', "'", 't', ' ', 'leave', '.']