Tensorflow와 Python을 사용하여 문장의 모든 단어에 대한 코드 포인트를 얻으려면 어떻게 해야 하나요?

<시간/>

문장에 있는 모든 단어의 코드 포인트를 얻으려면 먼저 문장이 단어의 시작인지 아닌지 확인합니다. 그런 다음, 모든 문장의 문자 목록에서 특정 단어의 색인부터 문자 색인이 시작되는지 확인합니다. 이를 확인하면 아래의 방법으로 모든 단어의 모든 문자의 코드 포인트를 얻는다.

스크립트 식별자는 단어 경계와 추가해야 하는 위치를 결정하는 데 도움이 됩니다. 단어 경계는 문장의 시작과 스크립트가 이전 문자와 다른 각 문자에 추가됩니다. 시작 오프셋은 RaggedTensor를 빌드하는 데 사용할 수 있습니다. 이 RaggedTensor에는 모든 배치의 단어 목록이 포함됩니다.

자세히 알아보기: TensorFlow란 무엇이며 Keras가 TensorFlow와 함께 신경망을 생성하는 방법은 무엇입니까?

파이썬을 사용하여 유니코드 문자열을 표현하는 방법과 이에 상응하는 유니코드를 사용하여 조작하는 방법을 이해합시다. 먼저, 표준 문자열 연산에 해당하는 유니코드를 사용하여 스크립트 감지를 기반으로 유니코드 문자열을 토큰으로 분리합니다.

Google Colaboratory를 사용하여 아래 코드를 실행하고 있습니다. Google Colab 또는 Colaboratory는 브라우저를 통해 Python 코드를 실행하는 데 도움이 되며 구성이 필요 없고 GPU(그래픽 처리 장치)에 대한 무료 액세스가 필요합니다. Colaboratory는 Jupyter Notebook을 기반으로 구축되었습니다.

print("Check if sentence is the start of the word")
sentence_char_starts_word = tf.concat(
   [tf.fill([sentence_char_script.nrows(), 1], True),
    tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
   axis=1)
print("Check if index of character starts from specific index of word in flattened list of characters from all sentences")
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)
print("Get the code point of every character in every word")
word_char_codepoint = tf.RaggedTensor.from_row_starts(
   values=sentence_char_codepoint.values,
   row_starts=word_starts)
print(word_char_codepoint)

코드 크레딧:https://www.tensorflow.org/tutorials/load_data/unicode

출력

Check if sentence is the start of the word
Check if index of character starts from specific index of word in flattened list of characters from all sentences
tf.Tensor([ 0   5   7 12 13 15], shape=(6,), dtype=int64)
Get the code point of every character in every word
<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [116, 104, 101, 114, 101], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

설명

스크립트 식별자는 단어 경계를 추가해야 하는 위치를 결정하는 데 도움이 됩니다.
모든 문장의 시작 부분과 스크립트가 이전 문자와 다른 각 문자에 대해 단어 경계가 추가됩니다.
다음으로 이러한 시작 오프셋을 사용하여 RaggedTensor를 빌드할 수 있습니다.
이 RaggedTensor에는 모든 배치의 단어 목록이 포함되어 있습니다.