Python에서 NLTK로 불용어 제거

컴퓨터가 자연어를 처리할 때 사용자의 필요에 맞는 문서를 선택하는 데 거의 도움이 되지 않는 일부 매우 일반적인 단어는 어휘에서 완전히 제외됩니다. 이러한 단어를 중지 단어라고 합니다.

예를 들어 입력 문장을 −

로 지정하면

John is a person who takes care of the people around him.

중지 단어 제거 후 출력을 얻을 수 있습니다 -

['John', 'person', 'takes', 'care', 'people', 'around', '.']

NLTK에는 주어진 문장에서 이를 제거하는 데 사용할 수 있는 이러한 불용어 모음이 있습니다. 이것은 NLTK.corpus 모듈 내부에 있습니다. 이를 사용하여 문장에서 중지 단어를 걸러낼 수 있습니다. 예를 들어,

예시

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

my_sent = "John is a person who takes care of people around him."
tokens = word_tokenize(my_sent)

filtered_sentence = [w for w in tokens if not w in stopwords.words()]

print(filtered_sentence)

출력

이것은 출력을 줄 것입니다 -

['John', 'person', 'takes', 'care', 'people', 'around', '.']