While much of the AI world strives to create ever larger language models like OpenAI GPT-5 and Claude Sonnet 4.5 from Anthropic, an Israeli artificial intelligence startup. AI21 goes the other way.
AI21 just introduced Jamba Reasoning 3Bmodel with 3 billion parameters. This compact, open source model can handle huge context windows of 250,000 tokens (meaning it can “remember” and parse much more text than conventional language models) and can run at high speed, even at consumer devices. The launch highlights a growing shift: Smaller, more efficient models could shape the future of artificial intelligence as much as scale.
“We believe in a more decentralized future of AI, in which not everything will run in huge data centers,” says Ori Goshenco-CEO of AI21, in an interview with the publication IEEE spectrum. “Large models will still have a role, but small, powerful models running on devices will have a significant impact” on both the future and the economics of AI, he says. Jamba is designed for developers who want to build edge AI apps and systems that run efficiently on the device.
AI21's Jamba Reasoning 3B is designed to handle long sequences of text and complex tasks like math, coding, and reasoning—all at impressive speeds on everyday devices like laptops And mobile phones. Jamba Reasoning 3B can also run in a hybrid configuration, with simple jobs running locally on the device while larger problems are sent to the powerful cloud. servers. According to AI21, this smarter routing could significantly reduce AI infrastructure costs for certain workloads—potentially by an order of magnitude.
Small but powerful LLM
With 3 billion parameters, Jamba Reasoning 3B is tiny by today's standards. You have a standard. Models such as GPT-5 or Claude can handle over 100 billion parameters, and even smaller models such as Lama 3 (8B) or Mistral (7B) are more than twice the size of the AI21, Goshen notes.
This compact size makes it even more remarkable that the AI21 model can handle a context window of 250,000 tokens on consumer devices. Some proprietary models, such as GPT-5, offer even longer context windows, but Jamba sets a new high water mark among open source models. Previous open model record of 128,000 tokens was carried out Llama 3.2 (3B) from Meta, Phi-4 Mini from Microsoft and DeepSik R1which are much larger models. Jamba Reasoning 3B can process over 17 tokens per second even when running at full capacity– that is, with extremely long input data that uses a full context window of 250,000 tokens. Many other models slow down or struggle when their input length exceeds 100,000 tokens.
Goshen explains that the model is built on an architecture called Jambawhich combines two types of neural network designs: transformer layers familiar from others large language modelsAnd Mamba layers that are designed to use memory more efficiently. This hybrid design allows the model to process long documents, large codebases, and other large inputs directly on a laptop or phone, using about one-tenth the memory of traditional models. transformers. Goshen says the model is much faster than traditional transformers because it relies less on a memory component called HF cachewhich can slow down processing as the length of the input data increases.
Why are small LLMs needed?
The model's hybrid architecture gives it an advantage in both speed and memory efficiency, even with very long input data, confirms an LLM software engineer. The engineer wished to remain anonymous because he is not authorized to comment on models from other companies. The more users launch generative AI Locally on laptops, models must quickly process long contexts without consuming too much memory. According to the engineer, Jamba, with 3 billion parameters, meets these requirements, making it a model optimized for on-device use.
Jamba reasoning 3B is open source under a permit License Apache 2.0 and is available on popular platforms such as Hugging Face And LM Studio. The release also includes instructions for fine-tuning the model using an open-source reinforcement learning framework (called Experience), which makes it easier and more accessible for developers to adapt the model to their own tasks.
“Jamba Reasoning 3B marks the beginning of a family of small, powerful reasoning models,” Goshen said. “Reduction enables decentralization, personalization and cost efficiency. GPUs V data centersindividuals and businesses can run their own models on devices. This opens up new economics and greater accessibility.”
Articles from your site
Related articles on the Internet