Python을 사용하여 stackoverflow 질문 데이터 세트와 관련된 텍스트 데이터를 벡터화하는 데 Tensorflow를 어떻게 사용할 수 있습니까?

<시간/>

Tensorflow는 Google에서 제공하는 기계 학습 프레임워크입니다. 알고리즘, 딥 러닝 애플리케이션 등을 구현하기 위해 Python과 함께 사용되는 오픈 소스 프레임워크입니다. 연구 및 생산 목적으로 사용됩니다. 복잡한 수학 연산을 빠르게 수행하는 데 도움이 되는 최적화 기술이 있습니다.

NumPy와 다차원 배열을 사용하기 때문입니다. 이러한 다차원 배열은 '텐서'라고도 합니다. 이 프레임워크는 심층 신경망 작업을 지원합니다. 확장성이 뛰어나고 많은 인기 있는 데이터 세트와 함께 제공됩니다. GPU 계산을 사용하고 리소스 관리를 자동화합니다. 수많은 기계 학습 라이브러리와 함께 제공되며 잘 지원되고 문서화되어 있습니다. 이 프레임워크는 심층 신경망 모델을 실행하고 훈련하며 각 데이터 세트의 관련 특성을 예측하는 애플리케이션을 생성하는 기능을 가지고 있습니다.

'tensorflow' 패키지는 아래 코드 줄을 사용하여 Windows에 설치할 수 있습니다 -

pip install tensorflow

Tensor는 TensorFlow에서 사용되는 데이터 구조입니다. 흐름도에서 가장자리를 연결하는 데 도움이 됩니다. 이 흐름도를 '데이터 흐름 그래프'라고 합니다. 텐서는 다차원 배열 또는 목록에 불과합니다.

Google Colaboratory를 사용하여 아래 코드를 실행하고 있습니다. Google Colab 또는 Colaboratory는 브라우저를 통해 Python 코드를 실행하는 데 도움이 되며 구성이 필요 없고 GPU(그래픽 처리 장치)에 대한 무료 액세스가 필요합니다. Colaboratory는 Jupyter Notebook을 기반으로 구축되었습니다.

예시

다음은 텍스트 데이터를 벡터화하는 코드 스니펫입니다. -

print("The vectorize function is defined")
def int_vectorize_text(text, label):
   text = tf.expand_dims(text, -1)
   return int_vectorize_layer(text), label
print(" A batch of the dataset is retrieved")
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question is : ", first_question)
print("Label is : ", first_label)

print("'binary' vectorized question is :",
   binary_vectorize_text(first_question, first_label)[0])
print("'int' vectorized question is :",

   int_vectorize_text(first_question, first_label)[0])

코드 크레딧 - https://www.tensorflow.org/tutorials/load_data/text

출력

The vectorize function is defined
A batch of the dataset is retrieved
Question is : tf.Tensor(b'"function expected error in blank for dynamically created check box
when it is clicked i want to grab the attribute value.it is working in ie 8,9,10 but not working in ie
11,chrome shows function expected error..<input type=checkbox checked=\'checked\'
id=\'symptomfailurecodeid\' tabindex=\'54\' style=\'cursor:pointer;\' onclick=chkclickevt(this);
failurecodeid=""1"" >...function chkclickevt(obj) { .
alert(obj.attributes(""failurecodeid""));.}"\n', shape=(), dtype=string)
Label is : tf.Tensor(2, shape=(), dtype=int32)
'binary' vectorized question is : tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)
'int' vectorized question is : tf.Tensor(
[[ 37 464 65 7  16 12 879 262 181 448 44 10 6  700
   3  46  4 2085 2 473 1   6  156  7  478 1 25 20
  156 7  478 1  499 37 464 1 1846 1666 1  1  1  1
   1  1   1  1    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0 0 0
   0  0   0  0    0 0    0 0    0    0 0  0]], shape=(1, 250), dtype=int64)

설명

바이너리 모드는 토큰의 존재를 나타내는 배열을 반환합니다.
int 모드에서는 모든 토큰이 정수로 바뀝니다.
이렇게 하면 순서가 유지됩니다.
벡터화 기능이 정의되어 있습니다.
데이터 샘플이 벡터화되고 벡터화의 'binary' 및 'int' 모드가 콘솔에 표시됩니다.
해당 특정 레이어에서 'get_vocabulary' 메서드를 사용하여 문자열을 조회할 수 있습니다.