Tensorflow에서 유니코드 문자열을 어떻게 표현하고 조작할 수 있습니까?

<시간/>

유니코드 문자열은 기본적으로 utf-8로 인코딩됩니다. 유니코드 문자열은 Tensorflow 모듈의 '상수' 메서드를 사용하여 UTF-8로 인코딩된 스칼라 값으로 나타낼 수 있습니다. 유니코드 문자열은 Tensorflow 모듈에 있는 'encode' 메소드를 사용하여 UTF-16으로 인코딩된 스칼라로 표현할 수 있습니다.

더 읽어보기:TensorFlow란 무엇이며 Keras가 TensorFlow와 함께 신경망을 생성하는 방법은 무엇입니까?

자연어를 처리하는 모델은 다른 문자 집합을 가진 다른 언어를 처리합니다. 유니코드는 거의 모든 언어의 문자를 나타내는 데 사용되는 표준 인코딩 시스템으로 간주됩니다. 모든 문자는 0에서 0x10FFFF 사이의 고유 정수 코드 포인트를 사용하여 인코딩됩니다. 유니코드 문자열은 0개 이상의 코드 값 시퀀스입니다.

파이썬을 사용하여 유니코드 문자열을 표현하는 방법과 이에 상응하는 유니코드를 사용하여 조작하는 방법을 이해합시다. 먼저, 표준 문자열 연산에 해당하는 유니코드를 사용하여 스크립트 감지를 기반으로 유니코드 문자열을 토큰으로 분리합니다.

Google Colaboratory를 사용하여 아래 코드를 실행하고 있습니다. Google Colab 또는 Colaboratory는 브라우저를 통해 Python 코드를 실행하는 데 도움이 되며 구성이 필요 없고 GPU(그래픽 처리 장치)에 대한 무료 액세스가 필요합니다. Colaboratory는 Jupyter Notebook을 기반으로 구축되었습니다.

import tensorflow as tf
print("A constant is defined")
tf.constant(u"Thanks 😊")
print("The shape of the tensor is")
tf.constant([u"You are", u"welcome!"]).shape
print("Unicode string represented as UTF-8 encoded scalar")
text_utf8 = tf.constant(u"语言处理")
print(text_utf8)
print("Unicode string represented as UTF-16 encoded scalar")
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
print(text_utf16be)
print("Unicode string represented as a vector of Unicode code points")
text_chars = tf.constant([ord(char) for char in u"语言处理"])
print(text_chars)

코드 크레딧:https://www.tensorflow.org/tutorials/load_data/unicode

출력

A constant is defined
The shape of the tensor is
Unicode string represented as UTF-8 encoded scalar
tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
Unicode string represented as UTF-16 encoded scalar
tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)
Unicode string represented as a vector of Unicode code points
tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

설명

TensorFlow tf.string은 기본 dtype입니다.
사용자가 바이트 문자열의 텐서를 만들 수 있습니다.
유니코드 문자열은 기본적으로 utf-8로 인코딩됩니다.
tf.string 텐서는 바이트 문자열이 원자 단위로 처리되기 때문에 길이가 다른 바이트 문자열을 보유할 수 있습니다.
문자열 길이는 텐서 차원에 포함되지 않습니다.
Python을 사용하여 문자열을 구성하면 v2와 v3 간에 유니코드 처리가 변경됩니다. v2에서 유니코드 문자열은 "u" 접두사로 표시됩니다.
v3에서 문자열은 기본적으로 유니코드로 인코딩됩니다.
TensorFlow에서 유니코드 문자열을 나타내는 두 가지 표준 방법이 있습니다.
문자열 스칼라:일련의 코드 포인트가 알려진 문자 인코딩으로 인코딩됩니다.
int32 벡터:모든 위치에 단일 코드 포인트가 포함된 메서드입니다.