Tensorflow 및 Python으로 유니코드 문자열을 어떻게 분할하고 바이트 오프셋을 지정할 수 있습니까?

<시간/>

유니코드 문자열을 분할할 수 있으며 'unicode_split' 메서드와 'unicode_decode_with_offsets' 메서드를 각각 사용하여 바이트 오프셋을 지정할 수 있습니다. 이러한 메소드는 'tensorflow' 모듈의 'string' 클래스에 있습니다.

자세히 알아보기: TensorFlow란 무엇이며 Keras가 TensorFlow와 함께 신경망을 생성하는 방법은 무엇입니까?

시작하려면 Python을 사용하여 유니코드 문자열을 표현하고 해당하는 유니코드를 사용하여 조작하십시오. 표준 문자열 연산에 해당하는 유니코드를 사용하여 스크립트 감지를 기반으로 유니코드 문자열을 토큰으로 분리합니다.

Google Colaboratory를 사용하여 아래 코드를 실행하고 있습니다. Google Colab 또는 Colaboratory는 브라우저를 통해 Python 코드를 실행하는 데 도움이 되며 구성이 필요 없고 GPU(그래픽 처리 장치)에 대한 무료 액세스가 필요합니다. Colaboratory는 Jupyter Notebook을 기반으로 구축되었습니다.

print("Split unicode strings")
tf.strings.unicode_split(thanks, 'UTF-8').numpy()
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')
print("Printing byte offset for characters")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
   print("At byte offset {}: codepoint {}".format(offset, codepoint))

코드 크레딧:https://www.tensorflow.org/tutorials/load_data/unicode

출력

Split unicode strings
Printing byte offset for characters
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

설명

tf.strings.unicode_split 작업은 유니코드 문자열을 개별 문자의 하위 문자열로 분할합니다.
생성되는 문자 텐서는 원래 문자열과 tf.strings.unicode_decode에 의해 정렬되어야 합니다.
이를 위해서는 각 문자가 시작되는 오프셋을 알아야 합니다.
tf.strings.unicode_decode_with_offsets 메서드는 각 문자의 시작 오프셋이 포함된 두 번째 텐서를 반환한다는 점을 제외하고 unicode_decode 메서드와 유사합니다.