scikit-learn 라이브러리를 사용하여 Python에서 교육 및 테스트 목적으로 데이터 세트를 분할하는 방법을 설명하시겠습니까?

<시간/>

일반적으로 sklearn으로 알려진 Scikit-learn은 기계 학습 알고리즘을 구현하기 위해 사용되는 Python 라이브러리입니다. 통계 모델링을 수행하기 위한 다양한 도구를 제공하기 때문에 강력하고 강력합니다.

여기에는 Python의 강력하고 안정적인 인터페이스를 통해 분류, 회귀, 클러스터링, 차원 축소 등이 포함됩니다. Numpy, SciPy 및 Matplotlib 라이브러리를 기반으로 합니다.

머신 러닝 알고리즘에 입력 데이터를 전달하기 전에 학습 데이터 세트와 테스트 데이터 세트로 분할해야 합니다.

데이터가 선택한 모델에 적합하면 입력 데이터 세트가 이 모델에 대해 학습됩니다. 학습이 수행되면 모델이 데이터에서 학습합니다.

또한 새로운 데이터를 일반화하는 방법을 배웁니다. 테스트 데이터 세트는 모델 학습 중에 사용되지 않습니다.

모든 하이퍼파라미터가 조정되고 최적의 가중치가 설정되면 테스트 데이터 세트가 머신 러닝 알고리즘에 제공됩니다.

알고리즘이 새 데이터에 얼마나 잘 일반화되는지 확인하는 데 사용되는 데이터 세트입니다. scikit-learn 라이브러리를 사용하여 데이터를 분할하는 방법을 살펴보겠습니다.

예시

from sklearn.datasets import load_iris
my_data = load_iris()
X = my_data.data
y = my_data.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.2, random_state = 2
)
print("The dimensions of the features of training data ")
print(X_train.shape)
print("The dimensions of the features of test data ")
print(X_test.shape)
print("The dimensions of the target values of training data ")
print(y_train.shape)
print("The dimensions of the target values of test data ")
print(y_test.shape)

출력

The dimensions of the features of training data
(120, 4)
The dimensions of the features of test data
(30, 4)
The dimensions of the target values of training data
(120,)
The dimensions of the target values of test data
(30,)

설명

필수 패키지를 가져옵니다.
이 작업에 필요한 데이터세트도 환경에 로드됩니다.
특성 및 대상 값은 데이터세트에서 분리됩니다.
교육 데이터와 테스트 데이터는 각각 80%와 20%의 비율로 나뉩니다.
즉, 데이터의 20%가 모델이 새 데이터를 얼마나 잘 일반화하는지 확인하는 데 사용됩니다.
데이터 크기와 함께 이러한 분할이 콘솔에 인쇄됩니다.