NewsLab
Apr 29 00:41 UTC

Show HN: 1gbps Tokenizer written in Assembly. 20x faster than HuggingFace (github.com)

3 points|by dogmaticdev||2 comments|Read full story on github.com

Comments (2)

2 shown
  1. 1. dogmaticdev||context
    I wrote this tokenizer using SSE2 SIMD Instructions. It takes text, removes white-space, and separates strings using a null terminator.

    I didn't bother making it multi thread, since it is already very fast. Maybe I will one day.

    stats: 10448 bytes in 11302 nano seconds 10448 ÷ 0.000011302 = 923620933.5 bytes, or 923mb/s

    31346 bytes in 32241 nano seconds 31346 ÷ 0.000032241 = 972240315.1 bytes, or 972mb/s

    As you can see, it approaches 1 byte per nano second as more text is parsed.

  2. 2. aetherspawn||context
    You should have a go writing it with SSE intrinsics. You might find that letting the compilers optimiser have a crack at it will make it even faster. Or at least it will be easier to call.