Python3의 텍스트 분석

<시간/>

이 과제에서 우리는 파일로 작업합니다. 파일은 이 우주 어디에나 있습니다. 컴퓨터 시스템에서 파일은 필수적인 부분입니다. 운영 체제는 많은 파일로 구성됩니다.

Python에는 텍스트 파일과 바이너리 파일의 두 가지 유형의 파일이 있습니다.

여기에서는 텍스트 파일에 대해 설명합니다.

여기에서는 파일에 대한 몇 가지 중요한 기능에 중점을 둡니다.

단어 수
문자 수
평균 단어 길이
중단어의 수
특수 문자 수
숫자 수
대문자 단어 수

테스트 파일 "css3.txt"가 있으며 해당 파일에 대해 작업 중입니다.

단어 수

문장의 단어 수를 셀 때 split를 사용합니다. 기능. 이것은 가장 쉬운 방법입니다. 이 경우 분할 기능도 적용합니다.

예시 코드

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=contents.split()
   number_words=len(words)
   print("Total words of" + filename ,"is" , str(number_words))

출력

Total words of C:/Users/TP/Desktop/css3.txt is 3574

문자 수

여기서 우리는 단어의 문자 수를 계산하고 여기서 단어의 길이를 사용합니다. 길이가 5이면 해당 단어에 5자가 있습니다.

예시 코드

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=0
   characters=0
   wordslist=contents.split()
   words+=len(wordslist)
   characters += sum(len(word) for word in wordslist)
   #print(lineno)
   print("TOTAL CHARACTERS IN A TEXT FILE =",characters)

출력

TOTAL CHARACTERS IN A TEXT FILE = 17783

평균 단어 길이

여기에서 우리는 모든 단어의 길이의 합을 계산하고 그것을 전체 길이로 나눕니다.

예시 코드

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=0
   wordslist=contents.split()
   words=len(wordslist)
   average= sum(len(word) for word in wordslist)/words    
   print("Average=",average)

출력

Average= 4.97

중단어의 수

이 문제를 해결하기 위해 Python에서 NLP 라이브러리를 사용합니다.

예시 코드

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
my_example_sent = "This is a sample sentence"
mystop_words = set(stopwords.words('english')) 
my_word_tokens = word_tokenize(my_example_sent) 
my_filtered_sentence = [w for w in my_word_tokens if not w in mystop_words] 
my_filtered_sentence = []
for w in my_word_tokens: 
   if w not in mystop_words: 
      my_filtered_sentence.append(w) 
print(my_word_tokens) 
print(my_filtered_sentence)

특수 문자 수

여기에서 해시태그 또는 멘션의 수를 계산할 수 있습니다. 이는 텍스트 데이터에서 추가 정보를 추출하는 데 도움이 됩니다.

예시 코드

import collections as ct
filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=contents.split()
   number_words=len(words)
   special_chars = "#"
   new=sum(v for k, v in ct.Counter(words).items() if k in special_chars)
   print("Total Special Characters", new)

출력

Total Special Characters 0

숫자

여기에서 텍스트 파일에 있는 숫자 데이터의 수를 계산할 수 있습니다. 단어의 글자수를 계산하는 것과 같습니다.

예시 코드

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=sum(map(str.isdigit, contents.split())) 
   print("TOTAL NUMERIC IN A TEXT FILE =",words)

출력

TOTAL NUMERIC IN A TEXT FILE = 2

대문자 단어 수

isupper() 함수를 사용하여 텍스트의 대문자 수를 계산할 수 있습니다.

예시 코드

filename="C:/Users/TP/Desktop/css3.txt"
try:
   with open(filename) as file_object:
   contents=file_object.read()
   except FileNotFoundError:
   message="sorry" +filename
   print(message)
else:
   words=sum(map(str.isupper, contents.split())) 
   print("TOTAL UPPERCASE WORDS IN A TEXT FILE =",words)

출력

TOTAL UPPERCASE WORDS IN A TEXT FILE = 121