LLM1.1 Tokenization

所有人的LLM第一课 - Fine

BPE (Byte Pair Encoding)

Merge the most frequent character or symbol pairs greedily https://chatgpt.com/share/6867a488-3e1c-8006-84bf-11336501ee4b

🔧 Suppose during training, the most frequent merges were:

"u" + "n" → "un"
"h" + "a" → "ha"
"p" + "p" → "pp"
"iness" is also a common suffix

👉 Tokenization result for "unhappiness" might be:

["un", "ha", "pp", "iness"]

✅ Efficient.
❌ Can’t reverse easily; rigid once merges are decided.

Steps

Starting with training data and expected words table size a. with the basic word table, like the 26 letters and different symbols, then give them the ID b. Based on the basic word table, to split the training data into the minimum unit.

c. Do the statistics to take the frequency of the near by words, choose the relevant most frequency and merge them d. do it until we reach the expected words size or the next one probability is 1.

BBPE (Byte-Level BPE)

Pros | Cons (Especially compare to BPE)

WordPiece

Merge based on Language Model Likelihood rather than raw frequency. Try to balance between frequency and semantic meaning.

Suppose ["un", "##happi", "##ness"] gives the best language model likelihood (like predicting “unhappiness” well in context).

👉 Tokenization result:

["un", "##happi", "##ness"]

✅ More semantically aware.
❌ Slightly slower to train.

Unigram Language Model

Training goal: Start with a big vocabulary (like 100k subwords), then prune it down based on which combinations maximize the likelihood of the corpus.

🧪 Example: "internationalization"

Let's say your learned vocabulary includes ["i", "inter", "nation", "national", "al", "iz", "iza", "ation", "tion", "zation", "ization"] Unigram LM tries all valid segmentations like:

["inter", "nation", "al", "ization"]
["international", "ization"]
["inter", "national", "ization"]
["inter", "nation", "zation"]

Then it calculates the total likelihood based on learnt probabilities and picks the highest scoring one

So the result might be: ["inter", "national", "ization"]

EM Algorithm (Expectation - Maximization)

用来训练ULM的语言模型 https://chatgpt.com/share/6867cb5b-5a80-8006-b509-591256ea3040

Initialization. Starting with a vocabulary of all possible subwords. Assign them initial probabilities. (You could do Uniform... )
Compute the Probability of each possible segments $$ P(S1) = P(inter) * P(national) * P(ization)$$ $$ P(S_2) = etc. $$ Then normalize them: $$ P(S1 | Internationalization) = P(S1) / (P(S1) + P(S2) + P(S3) + ...)$$ Now you know how likely each segmentation is.
M - Step (Maximization)

expected_count("inter") += P(S1|word)
expected_count("tern") += P(S2|word)

Then normalize:

P(subword) = expected_count[subword] / total_expected_counts

4. Repeat step2 and 3 until converge

Viterbi Algorithm 维特比算法

动态规划算法，用来找路径。在UML中用来做预测（找路径）

TODO: 待补充。

🤖 Model

Google-bert

"BERT" -> Bidirectional Encoder Representations from Transformers. -> It excels at understanding the context of the words in text by examining the words that come before and after it. https://huggingface.co/google-bert/bert-base-uncased

`Bert-base-uncased`

No difference between English and english.

Pre-Trained

MLM (Masked Language Modelling) During training, some percentage of the input tokens are randomly masked (hidden), and the model's objective is to predict these masked tokens based on the surrounding context. This is what allows BERT to learn deep bidirectional representations of language.
NSP (Next Sentence Prediction) The model is given two sentences and must predict whether the second sentence is the actual sentence that follows the first in the original text. This helps the model understand sentence relationships