Tensorflow와 Python을 사용하여 단어 목록에서 비정형 텐서를 빌드하는 방법은 무엇입니까?

<시간/>

RaggedTensor는 문장에서 단어의 시작 오프셋을 사용하여 구축할 수 있습니다. 첫째, 문장의 모든 단어에 있는 모든 문자의 코드 포인트가 구축됩니다. 다음으로 콘솔에 표시됩니다. 특정 문장의 단어 수가 결정되고 오프셋이 결정됩니다.

자세히 알아보기: TensorFlow란 무엇이며 Keras가 TensorFlow와 함께 신경망을 생성하는 방법은 무엇입니까?

Python을 사용하여 유니코드 문자열을 표현하고 해당하는 유니코드를 사용하여 조작합니다. 먼저 유니코드 문자열을 표준 문자열 연산에 해당하는 유니코드를 사용하여 스크립트 감지를 기반으로 토큰으로 분리합니다.

Google Colaboratory를 사용하여 아래 코드를 실행하고 있습니다. Google Colab 또는 Colaboratory는 브라우저를 통해 Python 코드를 실행하는 데 도움이 되며 구성이 필요 없고 GPU(그래픽 처리 장치)에 대한 무료 액세스가 필요합니다. Colaboratory는 Jupyter Notebook을 기반으로 구축되었습니다.

print("Get the code point of every character in every word")
word_char_codepoint = tf.RaggedTensor.from_row_starts(
   values=sentence_char_codepoint.values,
   row_starts=word_starts)
print(word_char_codepoint)
print("Get the number of words in the specific sentence")
sentence_num_words = tf.reduce_sum(tf.cast(sentence_char_starts_word, tf.int64), axis=1)

코드 크레딧:https://www.tensorflow.org/tutorials/load_data/unicode

출력

Get the code point of every character in every word
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>
Get the number of words in the specific sentence

설명

모든 단어의 모든 문자에 대한 코드 포인트가 구축됩니다.
콘솔에 표시됩니다.
특정 문장의 단어 수가 결정됩니다.