Computer >> 컴퓨터 >  >> 프로그램 작성 >> Python

Tensorflow를 사용하여 Python에서 훈련할 IMDB 데이터 세트를 준비하려면 어떻게 해야 하나요?


Tensorflow는 Google에서 제공하는 기계 학습 프레임워크입니다. 알고리즘, 딥 러닝 애플리케이션 등을 구현하기 위해 Python과 함께 사용되는 오픈 소스 프레임워크입니다. 연구 및 생산 목적으로 사용됩니다. 복잡한 수학 연산을 빠르게 수행하는 데 도움이 되는 최적화 기술이 있습니다.

'tensorflow' 패키지는 아래 코드 줄을 사용하여 Windows에 설치할 수 있습니다 -

pip install tensorflow

Tensor는 TensorFlow에서 사용되는 데이터 구조입니다. 흐름도에서 가장자리를 연결하는 데 도움이 됩니다. 이 흐름도를 '데이터 흐름 그래프'라고 합니다. 텐서는 다차원 배열 또는 목록에 불과합니다.

'IMDB' 데이터세트에는 50,000개 이상의 영화에 대한 리뷰가 포함되어 있습니다. 이 데이터세트는 일반적으로 자연어 처리와 관련된 작업에 사용됩니다.

Google Colaboratory를 사용하여 아래 코드를 실행하고 있습니다. Google Colab 또는 Colaboratory는 브라우저를 통해 Python 코드를 실행하는 데 도움이 되며 구성이 필요 없고 GPU(그래픽 처리 장치)에 대한 무료 액세스가 필요합니다. Colaboratory는 Jupyter Notebook을 기반으로 구축되었습니다.

다음은 IMDB 데이터세트의 코드 스니펫입니다. -

예시

def vectorize_text(text, label):
  text = tf.expand_dims(text, −1)
  return vectorize_layer(text), label

text_batch, label_batch = next(iter(raw_train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review is ", first_review)
print("Label is ", raw_train_ds.class_names[first_label])
print("Vectorized review is ", vectorize_text(first_review, first_label))

print("1222 −−−> ",vectorize_layer.get_vocabulary()[1222])
print(" 451 −−−> ",vectorize_layer.get_vocabulary()[451])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

코드 크레딧 - https://www.tensorflow.org/tutorials/keras/text_classification

출력

Review is tf.Tensor(b'Silent Night, Deadly Night 5 is the very last of the series, and like part 4, it\'s unrelated to the first three except by title and the fact that it\'s a Christmas-themed horror flick.<br /><br />Except to the oblivious, there\'s some obvious things going on here...Mickey Rooney plays a toymaker named Joe Petto and his creepy son\'s name is Pino. Ring a bell, anyone? Now, a little boy named Derek heard a knock at the door one evening, and opened it to find a present on the doorstep for him. Even though it said "don\'t open till Christmas", he begins to open it anyway but is stopped by his dad, who scolds him and sends him to bed, and opens the gift himself. Inside is a little red ball that sprouts Santa arms and a head, and proceeds to kill dad. Oops, maybe he should have left well-enough alone. Of course Derek is then traumatized by the incident since he watched it from the stairs, but he doesn\'t grow up to be some killer Santa, he just stops talking.<br /><br />There\'s a mysterious stranger lurking around, who seems very interested in the toys that Joe Petto makes. We even see him buying a bunch when Derek\'s mom takes him to the store to find a gift for him to bring him out of his trauma. And what exactly is this guy doing? Well, we\'re not sure but he does seem to be taking these toys apart to see what makes them tick. He does keep his landlord from evicting him by promising him to pay him in cash the next day and presents him with a "Larry the Larvae" toy for his kid, but of course "Larry" is not a good toy and gets out of the box in the car and of course, well, things aren\'t pretty.<br /><br />Anyway, eventually what\'s going on with Joe Petto and Pino is of course revealed, and as with the old story, Pino is not a "real boy". Pino is probably even more agitated and naughty because he suffers from "Kenitalia" (a smooth plastic crotch) so that could account for his evil ways. And the identity of the lurking stranger is revealed too, and there\'s even kind of a happy ending of sorts. Whee.<br /><br />A step up from part 4, but not much of one. Again, Brian Yuzna is involved, and Screaming Mad George, so some decent special effects, but not enough to make this great. A few leftovers from part 4 are hanging around too, like Clint Howard and Neith Hunter, but that doesn\'t really make any difference. Anyway, I now have seeing the whole series out of my system. Now if I could get some of it out of my brain. 4 out of 5.', shape=(), dtype=string)
Label is neg
Vectorized review is (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[1287, 313, 2380, 313, 661, 7, 2, 52, 229, 5, 2,
200, 3, 38, 170, 669, 29, 5492, 6, 2, 83, 297,
549, 32, 410, 3, 2, 186, 12, 29, 4, 1, 191,
510, 549, 6, 2, 8229, 212, 46, 576, 175, 168, 20,
1, 5361, 290, 4, 1, 761, 969, 1, 3, 24, 935,
2271, 393, 7, 1, 1675, 4, 3747, 250, 148, 4, 112,
436, 761, 3529, 548, 4, 3633, 31, 2, 1331, 28, 2096,
3, 2912, 9, 6, 163, 4, 1006, 20, 2, 1, 15,
85, 53, 147, 9, 292, 89, 959, 2314, 984, 27, 762,
6, 959, 9, 564, 18, 7, 2140, 32, 24, 1254, 36,
1, 85, 3, 3298, 85, 6, 1410, 3, 1936, 2, 3408,
301, 965, 7, 4, 112, 740, 1977, 12, 1, 2014, 2772,
3, 4, 428, 3, 5177, 6, 512, 1254, 1, 278, 27,
139, 25, 308, 1, 579, 5, 259, 3529, 7, 92, 8981,
32, 2, 3842, 230, 27, 289, 9, 35, 2, 5712, 18,
27, 144, 2166, 56, 6, 26, 46, 466, 2014, 27, 40,
2745, 657, 212, 4, 1376, 3002, 7080, 183, 36, 180, 52,
920, 8, 2, 4028, 12, 969, 1, 158, 71, 53, 67,
85, 2754, 4, 734, 51, 1, 1611, 294, 85, 6, 2,
1164, 6, 163, 4, 3408, 15, 85, 6, 717, 85, 44,
5, 24, 7158, 3, 48, 604, 7, 11, 225, 384, 73,
65, 21, 242, 18, 27, 120, 295, 6, 26, 667, 129,
4028, 948, 6, 67, 48, 158, 93, 1]])>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
1222 ---> stick
451 ---> already
Vocabulary size: 10000

설명

  • 기본적으로 주어진 텍스트를 컴퓨터가 이해할 수 있도록 숫자로 변환하는 'vectorize_text'라는 함수가 정의되어 있습니다.

  • IMDB 데이터 세트는 모델을 훈련하는 데 사용됩니다.

  • 리뷰, 레이블 및 벡터화된 데이터의 샘플이 콘솔에 표시됩니다.

  • 훈련 데이터, 테스트 데이터 및 검증 데이터는 모두 벡터화됩니다.