Python으로 Wikipedia의 Infobox에서 텍스트 가져오기

<시간/>

이 기사에서는 Python의 BeatifulSoup 및 요청을 사용하여 Wikipedia의 Infobox에서 텍스트를 스크랩할 것입니다. 우리는 10분 안에 할 수 있습니다. 간단합니다.

bs4와 요청을 설치해야 합니다. 아래 명령어를 실행하여 설치하세요.

pip install bs4
pip install requests

정보 상자에서 원하는 텍스트를 가져오는 코드를 작성하려면 아래 단계를 따르세요.

bs4 및 요청 모듈을 가져옵니다.
requests.get() 메소드를 사용하여 데이터를 가져오려는 페이지에 HTTP 요청을 보냅니다.
bs4.BeautifulSoup 클래스를 사용하여 응답 텍스트를 구문 분석하고 변수에 저장합니다.
위키피디아 페이지로 이동하여 원하는 요소를 검사합니다.
bs4에서 제공하는 적절한 방법을 사용하여 요소를 찾습니다.

아래 예제 코드를 봅시다.

예시

# importing the module
import requests
import bs4

# URL
URL = "https://en.wikipedia.org/wiki/India"

# sending the request
response = requests.get(URL)

# parsing the response
soup = bs4.BeautifulSoup(response.text, 'html')

# Now, we have paresed HTML with us. I want to get the _motto_ from the wikipedia page.
# Elements structure
# table - class="infobox"
# 3rd tr to get motto

# getting infobox
infobox = soup.find('table', {'class': 'infobox'})

# getting 3rd row element tr
third_tr = infobox.find_all('tr')[2]

# from third_tr we have to find first 'a' element and 'div' element to get required data
first_a = third_tr.div.find('a')
div = third_tr.div.div

# motto
motto = f"{first_a.text} {div.text[:len(div.text) - 3]}"

# printing the motto
print(motto)

위의 프로그램을 실행하면 다음과 같은 결과를 얻을 수 있습니다.

출력

Satyameva Jayate "Truth Alone Triumphs"

결론

Wikipedia 페이지에서 요소를 검사하고 찾으면 원하는 데이터를 얻을 수 있습니다. 튜토리얼과 관련하여 질문이 있는 경우 댓글 섹션에 언급하세요.