| Model Name | Size / Type | Description / Key Strengths | Context Window (approx.) | Notes / Variants |
|---|---|---|---|---|
| all-minilm-l6-v2-vllm | Small embedding | Sentence-transformers model mapping sentences & paragraphs to 384-dim vectors | – | Embedding model |
| deepseek-r1-distill-llama | Distilled LLaMA | Fast, optimized for real-world tasks | Medium-High | Distilled version |
| deepseek-v3.2-vllm | Large | Improved efficiency, reasoning, DSA, agentic capabilities | High | – |
| devstral-small | 24B | Agentic coding LLM fine-tuned from Mistral-Small-3.1 | 128K | Original version |
| devstral-small-2 | 24B (FP8) | Agentic SWE tasks, codebase tooling, SWE-bench; supports vision | 256K | Top recommendation for coding |
| embeddinggemma | Embedding | State-of-the-art text embedding from Google DeepMind | – | – |
| functiongemma | 270M | Offline function-calling agents on small devices | Low-Medium | Very lightweight |
| gemma3 | Small / Medium | Google’s latest Gemma: small yet strong for chat & generation | Medium-High | Includes QAT variant |
| gemma3n | Efficient multimodal | Text, image, audio, video on low-resource devices | Medium | Multimodal edge |
| gemma4 | 2B–31B (incl. 26B MoE) | Multimodal, optimized for reasoning, coding, long context | High | Multiple sizes (E2B, E4B, 31B, 26B A4B) |
| glm-4.7-flash | ~30B-A3B MoE | Balances strong performance with efficient deployment | High | Flash variant |
| glm-5-safetensors | 744B MoE (40B active) | Reasoning, coding, agentic tasks (FP8) | High | Very large MoE |
| gpt-oss | Varies | OpenAI’s open-weight models for powerful reasoning & agentic tasks | High | Includes safeguard variant |
| granite-4.0-h-micro | 3B | Long-context instruct with RL, tool calling, enterprise readiness | Long | Micro variant |
| granite-4.0-h-nano | Lightweight | Lightweight instruct via SFT, RL, merging | Medium | Nano variant |
| granite-4.0-h-small | 32B | Long-context instruct with RL, tool use, enterprise optimization | Long | Small variant |
| granite-4.0-h-tiny | 7B | Long-context instruct with RL, tool use, enterprise optimization | Long | Tiny variant |
| granite-4.0-micro | 3B | Long-context instruct with RL, tool use, enterprise optimization | Long | – |
| granite-4.0-nano | Lightweight | Lightweight instruct via SFT, RL, merging | Medium | – |
| granite-docling | Multimodal | Efficient document conversion | Medium | Document-focused |
| granite-embedding-multilingual | 278M | Encoder-only multilingual embedding (XLM-RoBERTa style) | – | Multilingual embedding |
| kimi-k2 | Varies | Open-source agent with deep reasoning, stable tool use, fast INT4 | 256K | Thinking model |
| llama3.2 | Medium | Reliable for coding, chat, Q&A | High | – |
| llama3.3 | Medium-Large | Improved reasoning and generation quality | High | Newest LLaMA 3 |
| magistral-small-3.2 | 24B multimodal | Tuned for accuracy, tool use, fewer repeats | High | Mistral AI |
| ministral3 | ~24B performance | Compact vision-enabled model optimized for local edge use | Medium-High | Vision-enabled |
| mistral | Efficient | Top-tier performance and fast inference | High | General purpose |
| moondream2 | Small VLM | Fast visual language model for image interpretation via text prompts | Medium | Vision-focused |
| nomic-embed-text-v1.5 | Embedding | Open-source, fully auditable text embedding model | – | Auditable embedding |
| phi4 | Compact | Surprisingly capable at reasoning and code | Medium | Microsoft compact |
| qwen3 | Large | Top-tier coding, math, reasoning, language tasks | Very High | Latest Qwen LLM |
| qwen3-coder | Coding series | Dedicated coding agent models | High | – |
| qwen3-coder-next | 80B MoE (3B active) | Advanced coding agent for code generation, debugging, agentic tasks | 256K | Highly efficient MoE |
| qwen3-embedding | Embedding | Multilingual for retrieval, ranking, clustering | – | Text embedding |
| qwen3-reranker | Reranker | Multilingual reranking for text retrieval (119 languages) | – | Reranking model |
| qwen3-vl | Advanced multimodal | Major gains in text, vision, video, reasoning | High | Vision-language |
| qwen3.5 | 397B MoE (17B active) | Multimodal with 262K context, 201 languages, reasoning/coding/agents | 262K | Flagship MoE |
| qwq | Experimental lean | Fast, mysterious Qwen variant | Medium-High | Experimental |
| seed-oss | Varies | Reasoning, agent, general capabilities, developer-friendly | High | – |
| smollm2 | Tiny | Built for speed, edge devices, local development (includes SmolVLM multimodal) | Medium | Lightweight |
| smollm3 | 3.1B | Efficient on-device use with strong chat performance | Medium | On-device |
| smolvlm | Lightweight multimodal | Video, image, text analysis optimized for devices | Medium | Multimodal edge |
| snowflake-arctic-embed-l-v2-vllm | Embedding | Boosts multilingual retrieval and efficiency | – | Multilingual embedding |
| stable-diffusion | Diffusion | Image generation (base latent diffusion + refiner) | – | Image gen (not LLM) |







