Small Language Models (SLMs)

January 07, 2025

Profile picture

why

Efficient to deploy, fast, and actually really good!

For people who can actually architect systems of agents, it's striking me more and more as a better alternative to calling the heavy-weights when you can get more targeted, faster, context-aware predictions with smaller and specialist models. My latest explorations with DeepSeek Coder 8B and Phi 3 left me with a really good impression of the current state of the short kings.

Some reasons I'm excited.

  1. modular development and simplified audits

You can actually apply system design thinking and you're not betting your business on some miracle moody API. The overreliance on these APIs for literally anything these days reminds me of God objects. The smaller size of SLMs also lowers the barrier for conducting audits, verification, and customization to meet regulations. It’s easier to understand how the model processes data, and implement your own encryption or logging.

  1. running on isolated and low-end hardware

SLMs can operate almost anywhere: from a local server in a private network to a doctor’s or inspector’s device. The edge is here. Robots, cars, drones.

  1. distributed security architecture

Unlike the monolithic architecture of LLMs, where all security components are “baked” into one large model, SLMs enable the creation of a distributed security system.

  1. overindexing on llms can lead to unexpected results

They're non-deterministic machines after all, and the larger the task we give them, the worse they perform. Add to that the current providers are not reliable at all. Check OpenAI's or Anthropic's status pages and not a week goes by without downtime.

  1. money $$$

This definitely should not be last but, in the age of VC-fueled bonanza we live in, it's easy to forget. Most AI startups' business models simply will not survive LLM unit-economics long-term. The bet is on some massive efficiency improvements, which although likely to some degree, might not happen in the scale these businesses expect. Let's not forget OpenAI itself is losing money on their most expensive plan and Uber is still not profitable after 15 years. At some point the bill arrives. Meanwhile connect a high-pressure pipe from your wallets straight to NVIDIA's headquarters.

how

Setting up a runtime like vLLM on AWS INF2 has shown some promising results in my testing. I absolutely love the work AWS is doing with the Inferentia chips. More realistically, for production workloads, you'd get started with an offering from Google Vertex AI or AWS Bedrock suite of products. Deploying them on the edge can be done with Microsoft's ONNX or even WASM.

what

Probably* you are not going to be doing image/video-generation with those models but CV, NLP, and function calling are all doable.
*I don't know honestly, maybe soon.

Some real-life examples:

  • Summarizing domain-specific documents like regulations. #phi-3.5
  • Smart summarization where you split documents in logical sections instead of dumping everything in a mega prompt. #phi-3.5
  • Generating marketing collateral, snippets, personalized customer support responses. #phi-3.5
  • Digitization of images and handwritten text. #MiniCPM-Llama3-V2.5
  • Data extraction for both structured and unstructured data. #MiniCPM-Llama3-V2.5
  • Content Moderation #LLaMA3.18B
  • Retail demand forecasting train your own with AutoML
  • Helpdesk support and routing #LLaMA3.18B
  • Diabetes tests! #Diabetica-7B

interesting

models

I only put open-source models above but Gemini Flash reserves an honorable mention.

leaderboards

resources

papers