How to Write Good Prompts for Small Language Models
How to Write Good Prompts for Small Language Models
Small language models (LLMs) — such as Gemma 1B, 4B, or LLaMA 7B — are capable of generating text and answering questions, but their capabilities are significantly more limited than massive models like GPT-4 or Gemini 1.5. This means that when writing prompts for small models, we must be far more precise and creative to get satisfactory results. Large models often “figure out” our intentions with minimal effort, while smaller models require carefully planned instructions and context. In this post, we discuss good and bad practices in prompt engineering for small LLMs, methods for evaluating prompt quality, and techniques for diagnosing problems. This article is aimed at experienced developers and CTOs, so it includes technical details while remaining accessible.
Small vs. Large Models — Why Size Matters
The larger the model (in terms of parameters and training data), the more advanced its language abilities and “intelligence.” Small models (with 1–7 billion parameters) have limited knowledge and weaker reasoning and instruction-following capabilities compared to 30B+ or 100B+ models. In practice, this results in:
-
Following instructions: The smallest models often struggle to follow instructions. For example, a 1B LLaMA model may get “lost” when given more complex tasks and needs fine-tuning to produce structured responses. Some developers note that below a certain model size, it becomes hard to get useful results from instruction-following without further training.
-
Hallucinations and answer quality: Small models are more prone to incorrect or fabricated responses. Reports suggest that 4B models “hallucinate” heavily and are mainly suitable for basic tasks or experimentation, whereas ~12B models provide more reliable and accurate answers.
-
Task complexity: Small models can only handle relatively simple tasks. They often lack multi-step reasoning or advanced coding abilities. A rough rule of thumb from practitioners: 1B is weak, 4B becomes useful, 12B is clearly better, and 27B is slightly smarter. Larger models like 70B can solve problems that 7B can’t handle.
-
Domain knowledge: Bigger models have more general knowledge from training data, so they’re better at handling specialized terms or rare languages. Small models may miss context or provide inaccurate answers due to limited internal knowledge.
-
Context length: Typically, larger models support larger input contexts (though not always). Leading large LLMs can handle huge contexts (e.g., 1 million tokens), while many small models are limited to a few thousand tokens (e.g., Gemma supports 8K tokens). When writing prompts for small models, always account for this constraint to avoid truncation.
To summarize: Large models are more forgiving of vague or messy prompts. Small models are not. With small LLMs, a good prompt is essential.
Best Practices for Prompting Small LLMs
Here are proven practices for improving the reliability of small LLM responses:
-
Clear, specific instructions: Be unambiguous and detailed. Instead of “Do something with these numbers,” say “Calculate the average of this list and round to two decimal places.” Always specify desired output format (e.g., “Respond with a valid JSON object with fields X and Y only.”).
-
Provide context and assumptions: Don’t assume the model knows what you know. If your task needs specific knowledge, include it in the prompt. For example, if a current date is needed, include it directly.
-
Use formatting and structure: Organize prompts into sections with labels or delimiters. This helps the model distinguish between task descriptions, inputs, and expected outputs. Example: use ”### Task:” or “Input:” markers.
-
One task at a time: Avoid asking for multiple actions in a single prompt. For example, asking for translation, tone analysis, and summary in one go may confuse a small model. Instead, split into sequential steps.
-
Assign a role: Giving the model a persona helps. Start with, “You are a legal assistant…” or “You’re an experienced Python developer…” to guide its tone and focus.
-
Use examples (few-shot prompting): If the model struggles to understand the task, show it examples. A few well-structured input-output pairs can establish the desired pattern.
-
Use chain-of-thought prompting: Ask the model to “think step-by-step” before giving the answer. This improves reasoning for complex tasks.
-
Control output format: Specify exactly how the output should be formatted. If the model generates extra text, clarify that only the desired format is acceptable (e.g., “Respond with YES or NO only.”).
-
Token limit awareness: Keep prompt length in check. Remember that context includes the prompt and the expected output. Use tokenizers to estimate length, and consider pre-processing data to save space.
-
Iterate and document: Treat prompts like code: version them, test changes, and document what works. Testing and refinement is essential.
Bad Prompting Habits
Avoid the following pitfalls:
-
Vague or ambiguous prompts: These lead to inconsistent or irrelevant responses. Be as specific as possible.
-
Multi-instruction prompts: Combining unrelated tasks makes it harder for small models to follow.
-
Contradictory or confusing wording: Inconsistent instructions can derail the response.
-
Poor formatting: A wall of unstructured text will confuse both humans and models. Use clear formatting.
-
Assuming hidden knowledge: Small models won’t “figure it out” if you don’t include key context.
-
Ignoring context/token limits: Sending too much data causes truncation and loss of relevant information.
-
Lack of testing: One example is not enough. Always test across multiple cases.
How to Evaluate a Prompt
-
Try different inputs: Test your prompt on various inputs and evaluate consistency.
-
Check output formatting: Use scripts to validate output format (e.g., parse JSON, check numeric ranges).
-
Compare to larger models: If a large model succeeds and your small one fails, you may need to improve the prompt or switch models.
-
Use standard metrics: For summarization or translation, try BLEU or ROUGE scores. For classification, use accuracy or F1.
-
Consider latency and cost: Efficient prompts can reduce runtime and resource usage.
-
Get user feedback: For production use, gather human feedback to improve your prompts.
Debugging Prompt Failures
If a prompt doesn’t work:
-
Review it line-by-line: Look for ambiguity, missing context, or contradictions.
-
Simplify: Strip it down to the minimum required to perform the task. Then reintroduce elements step by step.
-
Pre-process input: If the model struggles with complex logic or calculations, do the heavy lifting beforehand.
-
Test other models: A different small model might handle the task better.
-
Highlight priorities: Use explicit rules or bullet points to reinforce critical instructions.
-
Split the prompt: Break the task into separate steps or prompt stages.
-
Use a larger model to analyze: If you’re stuck, ask a bigger LLM to critique your prompt or suggest improvements.
Conclusion
Prompting small LLMs is an art and a science. These models can be surprisingly capable if we adapt to their limitations. Use precise instructions, context, examples, and structured formatting. Avoid ambiguity and overloading the prompt. Test thoroughly, iterate often, and treat prompt engineering as a development discipline. When all else fails, consider upgrading to a bigger model or fine-tuning your own. Small models won’t guess what you mean — but with the right prompt, they can be sharp and effective tools.