Hacker News
Show HN: Tokenkit – Convert LLMs to new tokenizers (incl byte-level Llama/Gemma)
Hi HN!
I’ve been working on transferring LLMs across tokenizers using a new method called ALM (details in our paper [1]). It distills a model trained with one tokenizer into a version using another, enabling things like converting subword models into byte-level ones much more effectively than what was possible before.
To make this easy to use, I released tokenkit, a library implementing ALM and other tokenizer transfer methods: https://github.com/bminixhofer/tokenkit.
As a demo, I used ALM to create two byte-level instruction-tuned models:
- https://huggingface.co/benjamin/Gemma2-2B-IT-Byte
- https://huggingface.co/benjamin/Llama3-2-3B-IT-Byte
Even though the distillation phase is very short (just 1.2B bytes ≈ 330M subword tokens), the models perform competitively (for example 57.0% MMLU of the byte-level Llama vs. 62.4% MMLU of the original Llama3-3B-Instruct).
This approach opens up an interesting direction: we can potentially keep subword tokenization for pretraining (to still squeeze as much text into the model in as little time as possible), but then change to a more user-friendly tokenization afterwards.
These models aren’t yet optimized for efficiency, but if you would add self-speculative decoding plus a BLT/DTP-style hierarchical architecture and/or linearized attention, they might also be able to replace subword-based models when speed matters.
If you're interested in training your own models, the guide on tokenizer transfer via tokenkit [2] should make it easy. The model cards of the transfers above also contain the exact command used to train them. I’ve been training on fairly limited hardware, so effective transfer is possible even in a (near) consumer-grade setup.
[1]: https://arxiv.org/abs/2503.20083
[2]: https://github.com/bminixhofer/tokenkit/blob/main/docs/token...
Comments URL: https://news.ycombinator.com/item?id=43772015
Points: 1
# Comments: 0
Berlin DeepTech Landscape
Article URL: https://startup-map.berlin/lists/53190
Comments URL: https://news.ycombinator.com/item?id=43772012
Points: 3
# Comments: 0
A Biofilm's Shape Emerges from Cellular Geometry
Article URL: https://www.quantamagazine.org/how-a-biofilms-strange-shape-emerges-from-cellular-geometry-20250421/
Comments URL: https://news.ycombinator.com/item?id=43772000
Points: 1
# Comments: 0
Tesla profits drop 71% on weak sales and anti-Elon Musk sentiment
Article URL: https://techcrunch.com/2025/04/22/tesla-profits-drop-71-on-weak-sales-and-anti-elon-musk-sentiment/
Comments URL: https://news.ycombinator.com/item?id=43771965
Points: 4
# Comments: 0
BenefíCIOs do Morango para a Pele: Beleza e Vitaminas
Article URL: https://sabordocampo.com/beneficios-do-morango-para-a-pele/
Comments URL: https://news.ycombinator.com/item?id=43771962
Points: 1
# Comments: 1
Why Is Everyone in My Meeting?
Article URL: https://robert.winter.ink/why-is-everyone-in-my-meeting/
Comments URL: https://news.ycombinator.com/item?id=43771939
Points: 2
# Comments: 0
Show HN: Traditional Japanese colors in a browser extension
Hi HN. I made a Chrome extension[1] that generates color palettes using traditional Japanese colors[2] based on any color you input.
These colors have been a bit of a recent obsession of mine, ever since I came across a company[3] that publishes a free color reference including traditional sets for Japan, China, and France, while on a walk.
Since then, I’ve acquired several books on the topic, had conversations with people who have been interested in this for far longer than I have, and received a grant from Adobe[4] — which I recommend applying for if you’re working on something creative.
The colors were historically used in textiles and different sources list different colors. Some include modern additions while others are more conservative.
I’ve been using the app daily for user interfaces and various creative projects. My hope is that you might get some use out of it too.
Here’s what the output looks like for Y Combinator’s orange:
https://drive.google.com/drive/folders/1VICRkO_7Z1B05KnnzF0c...
It’s nighttime for me so if you have any questions, I’ll be answering in the morning.
[1] https://irorgb.com/ [2] https://en.wikipedia.org/wiki/Traditional_colors_of_Japan [3] https://www.dic-graphics.co.jp/en/products/dcguide/index.htm... [4] https://developer.adobe.com/fund-for-design/
Comments URL: https://news.ycombinator.com/item?id=43771937
Points: 1
# Comments: 0
I couldn't sleep over bad grammar at work–so I built this Chrome extension
Article URL: https://chromewebstore.google.com/detail/typemate-fix-grammar-ever/ehgcmamfghkhiinhnhhhgjilpfokaokh
Comments URL: https://news.ycombinator.com/item?id=43771934
Points: 1
# Comments: 1
Three ways to cool Earth by pulling carbon from the sky
Article URL: https://www.nature.com/articles/d41586-025-01233-6
Comments URL: https://news.ycombinator.com/item?id=43771920
Points: 2
# Comments: 0
Hydrogen generation and organic synthesis explored through new database
Article URL: https://phys.org/news/2025-04-harnessing-photocatalysis-hydrogen-generation-synthesis.html
Comments URL: https://news.ycombinator.com/item?id=43771901
Points: 1
# Comments: 0
A Marketplace for Made in America CNC machining, PCB fabrication and 3D printing
Article URL: https://vendra-marketplace.io/
Comments URL: https://news.ycombinator.com/item?id=43771889
Points: 1
# Comments: 1
5th Circuit Lets AT&T Off the Hook for Major Location Data Privacy Violations
Crypto ally Paul Atkins sworn in as SEC Chair
Article URL: https://www.coindesk.com/policy/2025/04/22/crypto-ally-paul-atkins-sworn-in-to-replace-gary-gensler-atop-u-s-sec
Comments URL: https://news.ycombinator.com/item?id=43771416
Points: 1
# Comments: 0
Kworkflow tool talk from FOSDEM 2025
Article URL: https://melissawen.github.io/blog/2025/04/22/fosdem-2025-kernel-devroom-kworkflow-talk
Comments URL: https://news.ycombinator.com/item?id=43771413
Points: 1
# Comments: 0
Max is going to make you pay for people you share your password with
Article URL: https://www.theverge.com/news/653778/max-password-sharing-extra-member-add-on
Comments URL: https://news.ycombinator.com/item?id=43771400
Points: 1
# Comments: 0
Ask HN: What is going on with cloudflare?
I hoped after the shitstorm cloudflare received a few months ago, about their sales tactics and treating their sales team (the viral firing tikok video, extortion attempts of customers big enough to buy enterprise plans), things would be improving but i don't see any significant changes and am growingly concerned about the state of the company.
While the core engineering side seems to still develop the platform into new heights and also have found a very convincing AI and general computing strategy (eg. the upcoming container platform), their billing platform is faulty for 6 months now (https://www.cloudflarestatus.com/incidents/gnl2dp153485) with the salesforce based ticket system broken for weeks sending most filed tickets into a black void.
Suspiciously plan upgrades seem to be less affected compared to plan downgrades and cancellations, is this a pure evil play of scale?
Comments URL: https://news.ycombinator.com/item?id=43771392
Points: 1
# Comments: 0
OpenAI tells judge it would buy Chrome from Google
Article URL: https://www.theverge.com/news/653882/openai-chrome-google-us-judge
Comments URL: https://news.ycombinator.com/item?id=43771388
Points: 2
# Comments: 0
Floating-Point Numbers in Residue Number Systems
Article URL: https://leetarxiv.substack.com/p/floating-point-numbers-in-residue
Comments URL: https://news.ycombinator.com/item?id=43771386
Points: 1
# Comments: 0
Agentic Integration Tests
Article URL: https://magnitude.run
Comments URL: https://news.ycombinator.com/item?id=43771385
Points: 1
# Comments: 0
Show HN: Convenient Containers – ergonomic generics in C
Hello Hacker News :)
I'd like to share my C generic data-structure library Convenient Containers (CC). The library's main advantages are summarized in the Rationale section of its README.[1] In short, using some novel techniques, the library is able to provide a range of fully typesafe data structures with a generic API agnostic to both container type and data types, without requiring the user to make any boilerplate pre-declarations for every container/data type combination. In other words, CC containers look and function much like containers in languages with native support for generics:
#include #include "cc.h" int main( void ) { vec( int ) our_vec; init( &our_vec ); push( &our_vec, 5 ); printf( "%d\n", *get( &our_vec, 0 ) ); cleanup( &our_vec ); map( int, float ) our_map; init( &our_map ); insert( &our_map, 5, 0.5f ); printf( "%f\n", *get( &our_map, 5 ) ); cleanup( &our_map ); } CC is the only C data-structure library that offers such ergonomics, to the best of my knowledge.
Of course, performance is also important. To this end, CC's hash tables (i.e. its maps and sets) have performed well in benchmarks both by me[2] and by others.[3] Its red-black trees (i.e. its ordered maps and sets) have also been benchmarked and proven to perform on par with their C++ Standard Library counterparts.[4]
Some of the techniques upon which CC relies are explained briefly in its FAQ[5] and more thoroughly in a series of Reddit comments that I made back when the library was first released.[6] I am working on a series of articles to describe these techniques more systematically, the first of which I published earlier.[7]
Thanks for reading!
[1] https://github.com/JacksonAllan/CC#rationale
[2] https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/
[3] https://gist.github.com/attractivechaos/6815764c213f38802227...
[4] https://github.com/JacksonAllan/CC/releases/tag/v1.3.0
[5] https://github.com/JacksonAllan/CC?tab=readme-ov-file#how-do...
[6] https://www.reddit.com/r/C_Programming/comments/zvubfb/comme...
[7] https://github.com/JacksonAllan/CC/blob/main/articles/Better...
Comments URL: https://news.ycombinator.com/item?id=43771383
Points: 1
# Comments: 0