Python에서 HTML 테이블 데이터를 CSV로 저장하는 방법

<시간/>

문제:

데이터 과학자에게 가장 어려운 작업 중 하나는 데이터를 수집하는 것입니다. 사실 웹에는 자동화를 통해 데이터를 추출하는 데 사용할 수 있는 데이터가 많이 있습니다.

소개..

https://www.tutorialspoint.com/python/python_basic_operators.htm에서 HTML 테이블에 포함된 기본 작업 데이터를 추출하고 싶었습니다.

흠, 데이터가 여러 HTML 테이블에 흩어져 있습니다. HTML 테이블이 하나만 있으면 분명히 .csv 파일에 복사 및 붙여넣기를 사용할 수 있습니다.

그러나 한 페이지에 5개 이상의 테이블이 있으면 분명히 고통스럽습니다. 그렇지 않습니까?

그것을 하는 방법..

1.csv 파일을 만들고 싶다면 쉽게 csv 파일을 만드는 방법을 빠르게 알려드리겠습니다.

import csv
# Open File in Write mode , if not found it will create one
File = open('test.csv', 'w+')
Data = csv.writer(File)

# My Header
Data.writerow(('Column1', 'Column2', 'Column3'))

# Write data
for i in range(20):
Data.writerow((i, i+1, i+2))

# close my file
File.close()

출력

위의 코드를 실행하면 이 코드와 같은 디렉토리에 test.csv 파일이 생성됩니다.

Python에서 HTML 테이블 데이터를 CSV로 저장하는 방법

2. 이제 https://www.tutorialspoint.com/python/python_dictionary.htm에서 HTML 테이블을 가져와 CSV 파일로 작성해 보겠습니다.

첫 번째 단계는 가져오기를 수행하는 것입니다.

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.tutorialspoint.com/python/python_dictionary.htm'

HTML 파일을 열고 urlopen을 사용하여 html 객체에 저장합니다.

출력

html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

html 테이블 내에서 테이블을 찾고 테이블 데이터를 가져오도록 합시다. 데모 목적으로 첫 번째 테이블만 추출하겠습니다 [0]

출력

table = soup.find_all('table')[0]
rows = table.find_all('tr')

출력

print(rows)

출력

[<tr>
<th style='text-align:center;width:5%'>Sr.No.</th>
<th style='text-align:center;width:95%'>Function with Description</th>
</tr>, 
<tr>
<td class='ts'>1</td>
<td><a href='/python/dictionary_cmp.htm'>cmp(dict1, dict2)</a>
<p>Compares elements of both dict.</p></td>
</tr>, <tr>
<td class='ts'>2</td>
<td><a href='/python/dictionary_len.htm'>len(dict)</a>
<p>Gives the total length of the dictionary. This would be equal to the number of items in the dictionary.</p></td>
</tr>, 
<tr>
<td class='ts'>3</td>
<td><a href='/python/dictionary_str.htm'>str(dict)</a>
<p>Produces a printable string representation of a dictionary</p></td>
</tr>, 
<tr>
<td class='ts'>4</td>
<td><a href='/python/dictionary_type.htm'>type(variable)</a>
<p>Returns the type of the passed variable. If passed variable is dictionary, then it would return a dictionary type.</p></td>
</tr>]

5.이제 데이터를 csv 파일에 씁니다.

예시

File = open('my_html_data_to_csv.csv', 'wt+')
Data = csv.writer(File)
try:
for row in rows:
FilteredRow = []
for cell in row.find_all(['td', 'th']):
FilteredRow.append(cell.get_text())
Data.writerow(FilteredRow)
finally:
File.close()

6. 결과는 이제 my_html_data_to_csv.csv 파일에 저장됩니다.

예시

위에서 설명한 모든 것을 하나로 합칠 것입니다.

예시

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup

# set the url..
url = 'https://www.tutorialspoint.com/python/python_basic_syntax.htm'

# Open the url and parse the html
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# extract the first table
table = soup.find_all('table')[0]
rows = table.find_all('tr')

# write the content to the file
File = open('my_html_data_to_csv.csv', 'wt+')
Data = csv.writer(File)
try:
for row in rows:
FilteredRow = []
for cell in row.find_all(['td', 'th']):
FilteredRow.append(cell.get_text())
Data.writerow(FilteredRow)
finally:
File.close()

html 페이지의 표입니다.

Python에서 HTML 테이블 데이터를 CSV로 저장하는 방법