Hacker News

Show HN: Tokenkit – Convert LLMs to new tokenizers (incl byte-level Llama/Gemma)

Hacker News - Wed, 04/23/2025 - 9:35am

Hi HN!

I’ve been working on transferring LLMs across tokenizers using a new method called ALM (details in our paper [1]). It distills a model trained with one tokenizer into a version using another, enabling things like converting subword models into byte-level ones much more effectively than what was possible before.

To make this easy to use, I released tokenkit, a library implementing ALM and other tokenizer transfer methods: https://github.com/bminixhofer/tokenkit.

As a demo, I used ALM to create two byte-level instruction-tuned models:

- https://huggingface.co/benjamin/Gemma2-2B-IT-Byte

- https://huggingface.co/benjamin/Llama3-2-3B-IT-Byte

Even though the distillation phase is very short (just 1.2B bytes ≈ 330M subword tokens), the models perform competitively (for example 57.0% MMLU of the byte-level Llama vs. 62.4% MMLU of the original Llama3-3B-Instruct).

This approach opens up an interesting direction: we can potentially keep subword tokenization for pretraining (to still squeeze as much text into the model in as little time as possible), but then change to a more user-friendly tokenization afterwards.

These models aren’t yet optimized for efficiency, but if you would add self-speculative decoding plus a BLT/DTP-style hierarchical architecture and/or linearized attention, they might also be able to replace subword-based models when speed matters.

If you're interested in training your own models, the guide on tokenizer transfer via tokenkit [2] should make it easy. The model cards of the transfers above also contain the exact command used to train them. I’ve been training on fairly limited hardware, so effective transfer is possible even in a (near) consumer-grade setup.

[1]: https://arxiv.org/abs/2503.20083

[2]: https://github.com/bminixhofer/tokenkit/blob/main/docs/token...

Comments URL: https://news.ycombinator.com/item?id=43772015

Points: 1

# Comments: 0

Categories: Hacker News

Berlin DeepTech Landscape

Hacker News - Wed, 04/23/2025 - 9:35am
Categories: Hacker News

Show HN: Traditional Japanese colors in a browser extension

Hacker News - Wed, 04/23/2025 - 9:29am

Hi HN. I made a Chrome extension[1] that generates color palettes using traditional Japanese colors[2] based on any color you input.

These colors have been a bit of a recent obsession of mine, ever since I came across a company[3] that publishes a free color reference including traditional sets for Japan, China, and France, while on a walk.

Since then, I’ve acquired several books on the topic, had conversations with people who have been interested in this for far longer than I have, and received a grant from Adobe[4] — which I recommend applying for if you’re working on something creative.

The colors were historically used in textiles and different sources list different colors. Some include modern additions while others are more conservative.

I’ve been using the app daily for user interfaces and various creative projects. My hope is that you might get some use out of it too.

Here’s what the output looks like for Y Combinator’s orange:

https://drive.google.com/drive/folders/1VICRkO_7Z1B05KnnzF0c...

It’s nighttime for me so if you have any questions, I’ll be answering in the morning.

[1] https://irorgb.com/ [2] https://en.wikipedia.org/wiki/Traditional_colors_of_Japan [3] https://www.dic-graphics.co.jp/en/products/dcguide/index.htm... [4] https://developer.adobe.com/fund-for-design/

Comments URL: https://news.ycombinator.com/item?id=43771937

Points: 1

# Comments: 0

Categories: Hacker News

Ask HN: What is going on with cloudflare?

Hacker News - Wed, 04/23/2025 - 8:38am

I hoped after the shitstorm cloudflare received a few months ago, about their sales tactics and treating their sales team (the viral firing tikok video, extortion attempts of customers big enough to buy enterprise plans), things would be improving but i don't see any significant changes and am growingly concerned about the state of the company.

While the core engineering side seems to still develop the platform into new heights and also have found a very convincing AI and general computing strategy (eg. the upcoming container platform), their billing platform is faulty for 6 months now (https://www.cloudflarestatus.com/incidents/gnl2dp153485) with the salesforce based ticket system broken for weeks sending most filed tickets into a black void.

Suspiciously plan upgrades seem to be less affected compared to plan downgrades and cancellations, is this a pure evil play of scale?

Comments URL: https://news.ycombinator.com/item?id=43771392

Points: 1

# Comments: 0

Categories: Hacker News

Agentic Integration Tests

Hacker News - Wed, 04/23/2025 - 8:37am

Article URL: https://magnitude.run

Comments URL: https://news.ycombinator.com/item?id=43771385

Points: 1

# Comments: 0

Categories: Hacker News

Show HN: Convenient Containers – ergonomic generics in C

Hacker News - Wed, 04/23/2025 - 8:37am

Hello Hacker News :)

I'd like to share my C generic data-structure library Convenient Containers (CC). The library's main advantages are summarized in the Rationale section of its README.[1] In short, using some novel techniques, the library is able to provide a range of fully typesafe data structures with a generic API agnostic to both container type and data types, without requiring the user to make any boilerplate pre-declarations for every container/data type combination. In other words, CC containers look and function much like containers in languages with native support for generics:

#include #include "cc.h" int main( void ) { vec( int ) our_vec; init( &our_vec ); push( &our_vec, 5 ); printf( "%d\n", *get( &our_vec, 0 ) ); cleanup( &our_vec ); map( int, float ) our_map; init( &our_map ); insert( &our_map, 5, 0.5f ); printf( "%f\n", *get( &our_map, 5 ) ); cleanup( &our_map ); } CC is the only C data-structure library that offers such ergonomics, to the best of my knowledge.

Of course, performance is also important. To this end, CC's hash tables (i.e. its maps and sets) have performed well in benchmarks both by me[2] and by others.[3] Its red-black trees (i.e. its ordered maps and sets) have also been benchmarked and proven to perform on par with their C++ Standard Library counterparts.[4]

Some of the techniques upon which CC relies are explained briefly in its FAQ[5] and more thoroughly in a series of Reddit comments that I made back when the library was first released.[6] I am working on a series of articles to describe these techniques more systematically, the first of which I published earlier.[7]

Thanks for reading!

[1] https://github.com/JacksonAllan/CC#rationale

[2] https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/

[3] https://gist.github.com/attractivechaos/6815764c213f38802227...

[4] https://github.com/JacksonAllan/CC/releases/tag/v1.3.0

[5] https://github.com/JacksonAllan/CC?tab=readme-ov-file#how-do...

[6] https://www.reddit.com/r/C_Programming/comments/zvubfb/comme...

[7] https://github.com/JacksonAllan/CC/blob/main/articles/Better...

Comments URL: https://news.ycombinator.com/item?id=43771383

Points: 1

# Comments: 0

Categories: Hacker News

Pages