Hacker News

Show HN: Tokenkit – Convert LLMs to new tokenizers (incl byte-level Llama/Gemma)

Hacker News - Wed, 04/23/2025 - 9:35am

Hi HN!

I’ve been working on transferring LLMs across tokenizers using a new method called ALM (details in our paper [1]). It distills a model trained with one tokenizer into a version using another, enabling things like converting subword models into byte-level ones much more effectively than what was possible before.

To make this easy to use, I released tokenkit, a library implementing ALM and other tokenizer transfer methods: https://github.com/bminixhofer/tokenkit.

As a demo, I used ALM to create two byte-level instruction-tuned models:

- https://huggingface.co/benjamin/Gemma2-2B-IT-Byte

- https://huggingface.co/benjamin/Llama3-2-3B-IT-Byte

Even though the distillation phase is very short (just 1.2B bytes ≈ 330M subword tokens), the models perform competitively (for example 57.0% MMLU of the byte-level Llama vs. 62.4% MMLU of the original Llama3-3B-Instruct).

This approach opens up an interesting direction: we can potentially keep subword tokenization for pretraining (to still squeeze as much text into the model in as little time as possible), but then change to a more user-friendly tokenization afterwards.

These models aren’t yet optimized for efficiency, but if you would add self-speculative decoding plus a BLT/DTP-style hierarchical architecture and/or linearized attention, they might also be able to replace subword-based models when speed matters.

If you're interested in training your own models, the guide on tokenizer transfer via tokenkit [2] should make it easy. The model cards of the transfers above also contain the exact command used to train them. I’ve been training on fairly limited hardware, so effective transfer is possible even in a (near) consumer-grade setup.

[1]: https://arxiv.org/abs/2503.20083

[2]: https://github.com/bminixhofer/tokenkit/blob/main/docs/token...

Comments URL: https://news.ycombinator.com/item?id=43772015

Points: 1

# Comments: 0

Categories: Hacker News

Berlin DeepTech Landscape

Hacker News - Wed, 04/23/2025 - 9:35am

Article URL: https://startup-map.berlin/lists/53190

Comments URL: https://news.ycombinator.com/item?id=43772012

Points: 3

# Comments: 0

Categories: Hacker News

A Biofilm's Shape Emerges from Cellular Geometry

Hacker News - Wed, 04/23/2025 - 9:34am

Article URL: https://www.quantamagazine.org/how-a-biofilms-strange-shape-emerges-from-cellular-geometry-20250421/

Comments URL: https://news.ycombinator.com/item?id=43772000

Points: 1

# Comments: 0

Categories: Hacker News

Tesla profits drop 71% on weak sales and anti-Elon Musk sentiment

Hacker News - Wed, 04/23/2025 - 9:32am

Article URL: https://techcrunch.com/2025/04/22/tesla-profits-drop-71-on-weak-sales-and-anti-elon-musk-sentiment/

Comments URL: https://news.ycombinator.com/item?id=43771965

Points: 4

# Comments: 0

Categories: Hacker News

BenefíCIOs do Morango para a Pele: Beleza e Vitaminas

Hacker News - Wed, 04/23/2025 - 9:32am

Article URL: https://sabordocampo.com/beneficios-do-morango-para-a-pele/

Comments URL: https://news.ycombinator.com/item?id=43771962

Points: 1

# Comments: 1

Categories: Hacker News

Why Is Everyone in My Meeting?

Hacker News - Wed, 04/23/2025 - 9:29am

Article URL: https://robert.winter.ink/why-is-everyone-in-my-meeting/

Comments URL: https://news.ycombinator.com/item?id=43771939

Points: 2

# Comments: 0

Categories: Hacker News

Show HN: Traditional Japanese colors in a browser extension

Hacker News - Wed, 04/23/2025 - 9:29am

Hi HN. I made a Chrome extension[1] that generates color palettes using traditional Japanese colors[2] based on any color you input.

These colors have been a bit of a recent obsession of mine, ever since I came across a company[3] that publishes a free color reference including traditional sets for Japan, China, and France, while on a walk.

Since then, I’ve acquired several books on the topic, had conversations with people who have been interested in this for far longer than I have, and received a grant from Adobe[4] — which I recommend applying for if you’re working on something creative.

The colors were historically used in textiles and different sources list different colors. Some include modern additions while others are more conservative.

I’ve been using the app daily for user interfaces and various creative projects. My hope is that you might get some use out of it too.

Here’s what the output looks like for Y Combinator’s orange:

https://drive.google.com/drive/folders/1VICRkO_7Z1B05KnnzF0c...

It’s nighttime for me so if you have any questions, I’ll be answering in the morning.

[1] https://irorgb.com/ [2] https://en.wikipedia.org/wiki/Traditional_colors_of_Japan [3] https://www.dic-graphics.co.jp/en/products/dcguide/index.htm... [4] https://developer.adobe.com/fund-for-design/

Comments URL: https://news.ycombinator.com/item?id=43771937

Points: 1

# Comments: 0

Categories: Hacker News

I couldn't sleep over bad grammar at work–so I built this Chrome extension

Hacker News - Wed, 04/23/2025 - 9:29am

Article URL: https://chromewebstore.google.com/detail/typemate-fix-grammar-ever/ehgcmamfghkhiinhnhhhgjilpfokaokh

Comments URL: https://news.ycombinator.com/item?id=43771934

Points: 1

# Comments: 1

Categories: Hacker News

Three ways to cool Earth by pulling carbon from the sky

Hacker News - Wed, 04/23/2025 - 9:28am

Article URL: https://www.nature.com/articles/d41586-025-01233-6

Comments URL: https://news.ycombinator.com/item?id=43771920

Points: 2

# Comments: 0

Categories: Hacker News

Hydrogen generation and organic synthesis explored through new database

Hacker News - Wed, 04/23/2025 - 9:25am

Article URL: https://phys.org/news/2025-04-harnessing-photocatalysis-hydrogen-generation-synthesis.html

Comments URL: https://news.ycombinator.com/item?id=43771901

Points: 1

# Comments: 0

Categories: Hacker News

A Marketplace for Made in America CNC machining, PCB fabrication and 3D printing

Hacker News - Wed, 04/23/2025 - 9:25am

Article URL: https://vendra-marketplace.io/

Comments URL: https://news.ycombinator.com/item?id=43771889

Points: 1

# Comments: 1

Categories: Hacker News

5th Circuit Lets AT&T Off the Hook for Major Location Data Privacy Violations

Hacker News - Wed, 04/23/2025 - 9:23am

Article URL: https://www.techdirt.com/2025/04/23/5th-circuit-obediently-lets-att-off-the-hook-for-major-location-data-privacy-violations/

Comments URL: https://news.ycombinator.com/item?id=43771882

Points: 1

# Comments: 1

Categories: Hacker News

Crypto ally Paul Atkins sworn in as SEC Chair

Hacker News - Wed, 04/23/2025 - 8:41am

Article URL: https://www.coindesk.com/policy/2025/04/22/crypto-ally-paul-atkins-sworn-in-to-replace-gary-gensler-atop-u-s-sec

Comments URL: https://news.ycombinator.com/item?id=43771416

Points: 1

# Comments: 0

Categories: Hacker News

Kworkflow tool talk from FOSDEM 2025

Hacker News - Wed, 04/23/2025 - 8:41am

Article URL: https://melissawen.github.io/blog/2025/04/22/fosdem-2025-kernel-devroom-kworkflow-talk

Comments URL: https://news.ycombinator.com/item?id=43771413

Points: 1

# Comments: 0

Categories: Hacker News

Max is going to make you pay for people you share your password with

Hacker News - Wed, 04/23/2025 - 8:39am

Article URL: https://www.theverge.com/news/653778/max-password-sharing-extra-member-add-on

Comments URL: https://news.ycombinator.com/item?id=43771400

Points: 1

# Comments: 0

Categories: Hacker News

Ask HN: What is going on with cloudflare?

Hacker News - Wed, 04/23/2025 - 8:38am

I hoped after the shitstorm cloudflare received a few months ago, about their sales tactics and treating their sales team (the viral firing tikok video, extortion attempts of customers big enough to buy enterprise plans), things would be improving but i don't see any significant changes and am growingly concerned about the state of the company.

While the core engineering side seems to still develop the platform into new heights and also have found a very convincing AI and general computing strategy (eg. the upcoming container platform), their billing platform is faulty for 6 months now (https://www.cloudflarestatus.com/incidents/gnl2dp153485) with the salesforce based ticket system broken for weeks sending most filed tickets into a black void.

Suspiciously plan upgrades seem to be less affected compared to plan downgrades and cancellations, is this a pure evil play of scale?

Comments URL: https://news.ycombinator.com/item?id=43771392

Points: 1

# Comments: 0

Categories: Hacker News

OpenAI tells judge it would buy Chrome from Google

Hacker News - Wed, 04/23/2025 - 8:38am

Article URL: https://www.theverge.com/news/653882/openai-chrome-google-us-judge

Comments URL: https://news.ycombinator.com/item?id=43771388

Points: 2

# Comments: 0

Categories: Hacker News

Floating-Point Numbers in Residue Number Systems

Hacker News - Wed, 04/23/2025 - 8:38am

Article URL: https://leetarxiv.substack.com/p/floating-point-numbers-in-residue

Comments URL: https://news.ycombinator.com/item?id=43771386

Points: 1

# Comments: 0

Categories: Hacker News

Agentic Integration Tests

Hacker News - Wed, 04/23/2025 - 8:37am

Article URL: https://magnitude.run

Comments URL: https://news.ycombinator.com/item?id=43771385

Points: 1

# Comments: 0

Categories: Hacker News

Show HN: Convenient Containers – ergonomic generics in C

Hacker News - Wed, 04/23/2025 - 8:37am

Hello Hacker News :)

I'd like to share my C generic data-structure library Convenient Containers (CC). The library's main advantages are summarized in the Rationale section of its README.[1] In short, using some novel techniques, the library is able to provide a range of fully typesafe data structures with a generic API agnostic to both container type and data types, without requiring the user to make any boilerplate pre-declarations for every container/data type combination. In other words, CC containers look and function much like containers in languages with native support for generics:

#include #include "cc.h" int main( void ) { vec( int ) our_vec; init( &our_vec ); push( &our_vec, 5 ); printf( "%d\n", *get( &our_vec, 0 ) ); cleanup( &our_vec ); map( int, float ) our_map; init( &our_map ); insert( &our_map, 5, 0.5f ); printf( "%f\n", *get( &our_map, 5 ) ); cleanup( &our_map ); } CC is the only C data-structure library that offers such ergonomics, to the best of my knowledge.

Of course, performance is also important. To this end, CC's hash tables (i.e. its maps and sets) have performed well in benchmarks both by me[2] and by others.[3] Its red-black trees (i.e. its ordered maps and sets) have also been benchmarked and proven to perform on par with their C++ Standard Library counterparts.[4]

Some of the techniques upon which CC relies are explained briefly in its FAQ[5] and more thoroughly in a series of Reddit comments that I made back when the library was first released.[6] I am working on a series of articles to describe these techniques more systematically, the first of which I published earlier.[7]

Thanks for reading!

[1] https://github.com/JacksonAllan/CC#rationale

[2] https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/

[3] https://gist.github.com/attractivechaos/6815764c213f38802227...

[4] https://github.com/JacksonAllan/CC/releases/tag/v1.3.0

[5] https://github.com/JacksonAllan/CC?tab=readme-ov-file#how-do...

[6] https://www.reddit.com/r/C_Programming/comments/zvubfb/comme...

[7] https://github.com/JacksonAllan/CC/blob/main/articles/Better...

Comments URL: https://news.ycombinator.com/item?id=43771383

Points: 1

# Comments: 0

Categories: Hacker News

Biographical Information Summary - This is Just a Summary Joe Pearce
About Joe Pearce joeintenn
Links Joe Pearce
Flounder's Keylime Pie is the Best in the World, At Least I Think So... Joe Pearce
Harley Ride Joe Pearce
Cobra with New Cover Joe Pearce
Mustang Cobra After Ceramic Coating Joe Pearce
Carter County Cruise In Joe Pearce
2003 Ford Mustang SVT Cobra Convertible NAPA Auto Car Show Top 10 Joe Pearce
Ponies in the Smokies - Mustang Trophy Joe Pearce

Hacker News

Show HN: Tokenkit – Convert LLMs to new tokenizers (incl byte-level Llama/Gemma)

Berlin DeepTech Landscape

A Biofilm's Shape Emerges from Cellular Geometry

Tesla profits drop 71% on weak sales and anti-Elon Musk sentiment

BenefíCIOs do Morango para a Pele: Beleza e Vitaminas

Why Is Everyone in My Meeting?

Show HN: Traditional Japanese colors in a browser extension

I couldn't sleep over bad grammar at work–so I built this Chrome extension

Three ways to cool Earth by pulling carbon from the sky

Hydrogen generation and organic synthesis explored through new database

A Marketplace for Made in America CNC machining, PCB fabrication and 3D printing

5th Circuit Lets AT&T Off the Hook for Major Location Data Privacy Violations

Crypto ally Paul Atkins sworn in as SEC Chair

Kworkflow tool talk from FOSDEM 2025

Max is going to make you pay for people you share your password with

Ask HN: What is going on with cloudflare?

OpenAI tells judge it would buy Chrome from Google

Floating-Point Numbers in Residue Number Systems

Agentic Integration Tests

Show HN: Convenient Containers – ergonomic generics in C

Pages

Welcome to Joe Pearce's Home Page.

Web page offered by Joe Pearce © 2004 - 2025 - All rights reserved.

Thanks to the ETSU Computer and Information Sciences Department.

Thanks to the NSTCC Computer and Information Sciences and Computer Engineering Technologies Department.

This is my Favicon.

You are here

Hacker News

Pages

Welcome to Joe Pearce's Home Page.

Web page offered by Joe Pearce © 2004 - 2025 - All rights reserved.

Thanks to the ETSU Computer and Information Sciences Department.

Thanks to the NSTCC Computer and Information Sciences and Computer Engineering Technologies Department.

This is my Favicon.