파이썬에서 NLTK를 사용하여 중지 단어로 음성 태깅의 일부?

<시간/>

자연어 처리(Natural Language Processing)의 기본 아이디어는 기계가 최소한 텍스트가 의미하는 바의 일부를 이해하거나 말하려는 것과 같은 수준까지는 사람의 개입 없이 어떤 형태의 분석 또는 처리를 수행할 수 있다는 것입니다.

텍스트를 처리하는 동안 컴퓨터는 텍스트에서 쓸모없거나 덜 중요한 데이터(단어)를 걸러내야 합니다. NLTK에서는 쓸모없는 단어(데이터)를 중지 단어라고 합니다.

필수 라이브러리 설치

먼저 nltk 라이브러리가 필요합니다. 터미널에서 아래 명령을 실행하기만 하면 됩니다.

$pip install nltk

따라서 이러한 불용어를 제거하여 데이터베이스에서 공간을 차지하거나 귀중한 처리 시간을 차지하지 않도록 하겠습니다.

중지 단어로 간주될 수 있는 자신만의 단어 목록을 만들 수 있습니다. 기본적으로 NLTK에는 중지 단어로 간주되는 몇 가지 단어가 포함되어 있습니다. 다음을 사용하여 NLTK 말뭉치를 통해 액세스할 수 있습니다.

>>> import nltk
>>> from nltk.corpus import stopwords

다음은 NLTK 불용어 목록입니다.

>>> set(stopwords.words('english'))
{'not', 'other', 'shan', "hadn't", 'she', 'did', 'through', 'and', 'does', "that'll", "weren't", 'your', "should've", "hasn't", 'myself', 'should', 'because', 'wasn', 'what', 'to', 'this', 'was', 'more', 'y', 'again', "needn't", 'into', 'above', 'themselves', 'd', "won't", 'during', 'haven', 'both', "shan't", 'their', 'on', 'hadn', 'up', 'once', 'its', 'against', 'before', 't', 'while', 'needn', 'doing', "don't", 'yourselves', 'until', 'is', 'all', 's', 'will', "you've", 'being', 'under', 'they', 'ours', 'wouldn', 'of', 'didn', 'below', 'just', 'ma', 'yours', "you'll", 'mightn', 'where', 'are', 'that', 'those', 'most', 'them', 'if', 'you', "shouldn't", 'off', 'for', 'her', 'such', 'now', 'than', 're', 'no', 'm', 'or', "aren't", 'further', 'here', "wasn't", 'after', "haven't", 'my', 'himself', 'at', 'had', 'yourself', 'by', 'weren', 'only', 'have', 'we', 'do', 'same', "isn't", 'herself', 'll', 'down', 'then', 'why', 'own', 'him', 'so', 'having', 'nor', 'isn', 'few', 'how', 'each', 'there', 'with', 'couldn', 'about', 'very', 'am', 'me', "didn't", "doesn't", 'which', "she's", 'doesn', 'were', 'he', 'in', "mightn't", 'when', 'our', 'who', 'his', "couldn't", 'the', "you'd", 'be', 'hers', 'hasn', 'between', 'it', 'mustn', 'but', 'out', 'can', "wouldn't", 'ourselves', 'whom', 'been', 'these', 'aren', 'over', 'itself', 'a', 'i', 'too', 'theirs', 'some', "you're", 'as', 'won', "it's", 'from', 'o', 'don', 'any', 've', 'ain', 'has', 'an', "mustn't", 'shouldn'}

다음은 불용어를 사용하여 텍스트에서 불용어를 제거하는 방법을 보여주는 완전한 프로그램입니다.

예시 코드

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "Python is a powerful high-level, object-oriented programming language created by Guido van Rossum."\
"It has simple easy-to-use syntax, making it the perfect language for someone trying to learn computer programming for the first time."\
"This is a comprehensive guide on how to get started in Python, why you should learn it and how you can learn it. However, if you knowledge "\
"of other programming languages and want to quickly get started with Python."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

출력

텍스트 출력:필터 없음(중단어 포함)

['Python', 'is', 'a', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'by', 'Guido', 'van', 'Rossum.It', 'has', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'it', 'the', 'perfect', 'language', 'for', 'someone', 'trying', 'to', 'learn', 'computer', 'programming', 'for', 'the', 'first', 'time.This', 'is', 'a', 'comprehensive', 'guide', 'on', 'how', 'to', 'get', 'started', 'in', 'Python', ',', 'why', 'you', 'should', 'learn', 'it', 'and', 'how', 'you', 'can', 'learn', 'it', '.', 'However', ',', 'if', 'you', 'knowledge', 'of', 'other', 'programming', 'languages', 'and', 'want', 'to', 'quickly', 'get', 'started', 'with', 'Python', '.']

텍스트 출력:필터 사용(중단어 제거)

['Python', 'powerful', 'high-level', ',', 'object-oriented', 'programming', 'language', 'created', 'Guido', 'van', 'Rossum.It', 'simple', 'easy-to-use', 'syntax', ',', 'making', 'perfect', 'language', 'someone', 'trying', 'learn', 'computer', 'programming', 'first', 'time.This', 'comprehensive', 'guide', 'get', 'started', 'Python', ',', 'learn', 'learn', '.', 'However', ',', 'knowledge', 'programming', 'languages', 'want', 'quickly', 'get', 'started', 'Python', '.']