
The primary function of natural language processing (NLP) is to enable computers to understand, interpret, and generate human languages, facilitating seamless interaction between humans and machines4. NLP is used in various applications, such as machine translation, sentiment analysis, and intelligent conversational agents, making it a crucial technology in the field of artificial intelligence.

Traditional tokenizer methods, such as Byte Pair Encoding (BPE) and Unigram, have limitations including substantial computational requirements, large and inefficient vocabularies, and difficulties in handling underrepresented languages2. These methods often result in many near-duplicate tokens and struggle with languages that have complex structures and morphology.

T-FREE is a tokenizer-free method that embeds words directly through sparse activation patterns over character triplets, eliminating the need for traditional subword tokens. In contrast, Byte Pair Encoding (BPE) merges frequent token pairs, and Unigram removes the least influential tokens iteratively. T-FREE reduces vocabulary size, improves performance, and addresses inefficiencies of traditional tokenizers while maintaining near-optimal performance across languages without pre-trained vocabulary4.