Build a Local Writing Assistant on an Old Computer with Bonsai and Ollama
- 4 hours ago
- 12 min read
Introduction
Most “run this model locally” tutorials stop the moment the model produces any output at all. They download a file, start a server, send one test prompt, and call it done. They rarely cover what happens when that output is technically present but practically useless, because the model spent its entire response budget thinking instead of answering.
In this tutorial we build a local writing assistant on top of Bonsai, PrismML’s 1-bit quantized language model, served through Ollama. You paste a rough, informal draft into a small local web form, choose a tone, tighten, formalize, casual, or clarify, and a Python layer sends it to the model running entirely on your own machine and gets back a rewritten version. Along the way we discover that this particular model is a reasoning model that nobody warned us about, work out why its answers were coming back empty, and fix it without ever sending a single byte to a cloud API.

What We Are Building
A CLI tool and a local web form, both backed by the same logic. The workflow:
Load a rough draft, either from a file or pasted into a web form
Send it to Bonsai, running locally through Ollama, with one of four tone instructions
Separate the model’s final answer from its internal reasoning, since Bonsai produces both
Save the rewritten draft and the reasoning trace as two distinct files
Track token usage and response time across every request, with no cost to compute since this runs free and local
Tech Stack
Component | Tool |
Model | Bonsai 8B (PrismML’s 1-bit quantized LLM) |
Model runtime | Ollama, via its OpenAI-compatible local endpoint |
Web form | Python’s built-in http.server |
Dependencies | None beyond the Python standard library |
Project Structure
bonsai_writing_assistant/
├── src/
│ ├── draft_polisher.py # DraftPolisher class: loads a draft, calls Bonsai, saves the result
│ └── polish_server.py # local web form, calls DraftPolisher directly, no subprocess involved
├── examples/
│ └── sample_draft.txt # a rambling, informal draft for testing every tone
├── requirements.txt # empty on purpose, stdlib only
├── usage_log.json # token counts and latency, accumulated across every request
└── results/ # one polished_draft.md and reasoning_trace.md per CLI run
Setting Up Bonsai Through Ollama
This project does not install or run Bonsai itself. Pull and run it through Ollama first:
ollama run hf.co/prism-ml/Bonsai-8B-gguf:Q1_0
That pulls the quantized GGUF directly from its Hugging Face repository and starts serving it. Leave that running, Ollama’s default port is 11434, and everything below assumes http://127.0.0.1:11434 is already answering requests.
Building the Draft Polisher
Create a file named draft_polisher.py inside a src folder.
import argparse # parse CLI args for standalone use
import json # build the request body and parse the response
import time # measure how long the local model takes to respond
import urllib.request # talk to Bonsai's local server without extra dependencies
from datetime import datetime # timestamp each usage entry in usage_log.json
from pathlib import Path # filesystem paths for the draft and the polished output
from typing import Any, Dict, List # type hints for usage records and aggregate summaries
PROJECT_ROOT = Path(__file__).resolve().parent.parent # repo root, one level above src/
BONSAI_SERVER_URL = "http://127.0.0.1:11434/v1/chat/completions" # Ollama's OpenAI-compatible endpoint
BONSAI_MODEL_NAME = "hf.co/prism-ml/Bonsai-8B-gguf:Q1_0" # matches `ollama run hf.co/prism-ml/Bonsai-8B-gguf:Q1_0`
REWRITE_INSTRUCTIONS = { # one system instruction per supported tone, keyed by the --tone value
"tighten": "Rewrite the following text to be more concise. Remove filler words and redundant "
"phrasing, but keep the original meaning intact.",
"formalize": "Rewrite the following text in a more formal, professional tone, suitable for a "
"business email.",
"casual": "Rewrite the following text in a more relaxed, friendly, conversational tone.",
"clarify": "Rewrite the following text to fix awkward phrasing and improve clarity, without "
"changing its meaning.",
}
class DraftPolisher:
# Wraps every call to the local Bonsai server, so both the CLI entry point below and the
# web server in polish_server.py share one implementation instead of duplicating the HTTP call.
def __init__(self, server_url: str = BONSAI_SERVER_URL, model_name: str = BONSAI_MODEL_NAME):
self.server_url = server_url # where Ollama is listening
self.model_name = model_name # the model id to send in every request
def load_draft(self, draft_path: Path) -> str:
return draft_path.read_text(encoding="utf-8").strip() # raw draft text, whitespace trimmed
BONSAI_SERVER_URL points at Ollama’s port, not at any cloud address, and BONSAI_MODEL_NAME matches the exact identifier the ollama run command pulled, since Ollama looks up loaded models by this string. REWRITE_INSTRUCTIONS keeps every supported tone’s system instruction in one place, so adding a fifth tone later means adding one dictionary entry, not touching the request logic at all.
Calling Bonsai and Discovering It’s a Reasoning Model
def request_rewrite(self, draft_text: str, tone: str) -> Dict[str, object]:
instruction = REWRITE_INSTRUCTIONS[tone] # raises KeyError on an unsupported tone
payload = json.dumps({ # OpenAI-shaped body; Bonsai never sees OpenAI at all
"model": self.model_name,
"messages": [
{"role": "system", "content": "You are a careful, concise writing editor."},
{"role": "user", "content": f"{instruction}\n\n{draft_text}"},
],
"temperature": 0.4, # low, since this is editing rather than free writing
"max_tokens": 2000, # Bonsai reasons at length before writing the final
# answer (confirmed: 300 tokens wasn't even enough
# for one short sentence), so this needs real headroom.
# "think": false was tested and does not suppress this.
"stream": False, # wait for the full response, no chunked streaming
}).encode("utf-8")
request = urllib.request.Request(
self.server_url, data=payload,
headers={"Content-Type": "application/json"}, # no Authorization header, nothing to authenticate
)
start = time.monotonic() # local inference is slow here — expect 1-3+ minutes
with urllib.request.urlopen(request, timeout=300) as response:
body = json.loads(response.read().decode("utf-8")) # same {"choices": [...]} shape OpenAI uses
elapsed_seconds = round(time.monotonic() - start, 3)
message = body["choices"][0]["message"] # same nesting OpenAI uses: choices[0].message
rewritten_text = message.get("content", "").strip() # the final answer, may be empty — see below
reasoning_text = message.get("reasoning", "").strip() # Bonsai's chain-of-thought, kept separately
finish_reason = body["choices"][0].get("finish_reason", "") # "length" means the token budget ran out
if not rewritten_text and finish_reason == "length":
# The token budget ran out while Bonsai was still in its "reasoning" field and
# never reached the final answer. Say so plainly instead of saving a blank result.
rewritten_text = (
"[Bonsai ran out of tokens while reasoning and never wrote a final answer. "
"Try a shorter draft or raise max_tokens in request_rewrite.]"
)
token_usage = dict(body.get("usage", {})) # Ollama's OpenAI shim doesn't always include this
return {
"rewritten_text": rewritten_text,
"reasoning_text": reasoning_text,
"elapsed_seconds": elapsed_seconds,
"token_usage": token_usage,
}
Finding out what was actually happening took inspecting the raw HTTP response directly, since curl’s flag syntax doesn’t work the same way through PowerShell’s curl alias. Sending the same request with Invoke-RestMethod and printing the full response with ConvertTo-Json -Depth 10 showed the real shape:
{
"message": {
"role": "assistant",
"content": "",
"reasoning": "Okay, the user wants me to rewrite a text that's a bit awkward. Let me start by reading the original message..."
},
"finish_reason": "length"
}content was empty, but reasoning held several hundred words of genuine chain-of-thought, and finish_reason: "length" confirmed the model had hit its token ceiling mid-thought, before ever writing a final answer into content. Bonsai is a reasoning model: it always writes out its thinking first, in a separate field, the same pattern used by other “thinking” models, and only writes the actual answer afterward, if there is any token budget left to do so.
Saving the Result and the Reasoning Separately
def save_result(self, output_dir: Path, draft_text: str, tone: str, result: Dict[str, object]) -> Path:
output_dir.mkdir(parents=True, exist_ok=True) # ensure the destination folder exists
output_path = output_dir / "polished_draft.md" # the deliverable: original, rewrite, and timing
content = (
f"## Original (tone requested: {tone})\n\n{draft_text}\n\n" # what was actually sent in
f"## Rewritten\n\n{result['rewritten_text']}\n\n" # Bonsai's final answer
f"## Timing\n\nBonsai responded in {result['elapsed_seconds']} seconds.\n" # how slow local inference was
)
output_path.write_text(content, encoding="utf-8") # one write, the file never exists half-finished
reasoning_text = result.get("reasoning_text") or "(no reasoning text returned for this request)"
reasoning_path = output_dir / "reasoning_trace.md" # kept separate from the polished draft on purpose,
reasoning_path.write_text( # since this is Bonsai's internal chain-of-thought,
f"## Bonsai's Reasoning (tone requested: {tone})\n\n{reasoning_text}\n", # not part of the deliverable text
encoding="utf-8",
)
return output_path # caller only needs the main file's path
reasoning_trace.md exists as its own file rather than a section bolted onto polished_draft.md, since the chain-of-thought is internal to the model, not something whoever requested the rewrite actually asked to read. Keeping it separate means polished_draft.md stays a clean, shareable deliverable, while the reasoning is still available on disk for anyone curious how Bonsai got to its answer.
Tracking Token Usage
def make_usage_entry(model_name: str, tone: str, result: Dict[str, Any]) -> Dict[str, Any]:
# No cost fields here on purpose — Bonsai runs locally through Ollama, so there is no
# per-token price to compute, unlike the OpenAI-backed projects earlier in this series.
usage = result.get("token_usage") or {} # may be empty; Ollama doesn't always report it
return {
"timestamp": datetime.now().isoformat(), # when this specific request was logged
"model": model_name, # which model id actually served this request
"tone": tone, # which rewrite instruction was requested
"prompt_tokens": usage.get("prompt_tokens", 0), # input tokens, straight from Ollama's usage field
"completion_tokens": usage.get("completion_tokens", 0), # output tokens, includes reasoning plus answer
"total_tokens": usage.get("total_tokens", 0), # prompt_tokens + completion_tokens
"elapsed_seconds": result.get("elapsed_seconds", 0), # wall-clock time set by request_rewrite
}
def aggregate_usage(entries: List[Dict[str, Any]]) -> Dict[str, Any]:
# Re-derived from the full entry list every time, rather than kept as a running counter,
# so this function alone is the single source of truth for what the totals mean.
return {
"total_requests": len(entries), # how many requests this list covers
"total_prompt_tokens": sum(e["prompt_tokens"] for e in entries), # summed across every request
"total_completion_tokens": sum(e["completion_tokens"] for e in entries),
"total_tokens": sum(e["total_tokens"] for e in entries),
"total_elapsed_seconds": round(sum(e["elapsed_seconds"] for e in entries), 3), # total wall-clock time spent
}
def log_usage(stats_path: Path, entry: Dict[str, Any]) -> None:
try: # load history written by previous runs
existing = json.loads(stats_path.read_text(encoding="utf-8"))
entries = existing.get("requests", []) # every request recorded so far
except (FileNotFoundError, json.JSONDecodeError):
entries = [] # first run — start with empty history
entries.append(entry) # this request now joins the lifetime history
output = {
"summary": {
"timestamp": datetime.now().isoformat(), # when usage_log.json was last written
**aggregate_usage(entries), # lifetime totals across every request ever made
},
"requests": entries, # every individual request ever recorded
}
stats_path.parent.mkdir(parents=True, exist_ok=True) # in case usage_log.json lives in a new folder
stats_path.write_text(json.dumps(output, indent=2), encoding="utf-8") # overwrite with the updated lifetime history
This mirrors the accumulate-across-runs pattern used for cost tracking in the OpenAI-backed projects earlier in this series, minus every cost field, since there is genuinely no per-token price for a model running on your own hardware. What is left, prompt tokens, completion tokens, and elapsed time, is still useful: it is the only way to see, across many requests, how much of Bonsai’s output budget is reasoning versus answer, and how slow local inference actually is on your specific machine.
Wiring It Together and the CLI Entry Point
def polish_file(draft_path: Path, tone: str, output_dir: Path, stats_path: Path = None,
polisher: DraftPolisher = None) -> Path:
polisher = polisher or DraftPolisher() # allow a caller (or a test) to inject a different instance
draft_text = polisher.load_draft(draft_path) # raw draft text from disk
result = polisher.request_rewrite(draft_text, tone) # the model call, plus reasoning and usage data
if stats_path is not None:
log_usage(stats_path, make_usage_entry(polisher.model_name, tone, result))
return polisher.save_result(output_dir, draft_text, tone, result)
def main() -> None:
parser = argparse.ArgumentParser(description="Local writing assistant powered by Bonsai") # CLI entry point
parser.add_argument("--draft", required=True, type=Path) # path to the rough draft file
parser.add_argument("--tone", choices=list(REWRITE_INSTRUCTIONS), default="clarify") # which rewrite to request
parser.add_argument("--output-dir", type=Path, default=PROJECT_ROOT / "results") # where polished_draft.md lands
parser.add_argument("--stats-path", type=Path, default=PROJECT_ROOT / "usage_log.json") # token/latency history
args = parser.parse_args() # parses sys.argv, exits with usage text on error
output_path = polish_file(args.draft, args.tone, args.output_dir, args.stats_path)
print(f"Wrote {output_path}") # explicit success signal, same reason the last project needed one
if __name__ == "__main__":
main()
polish_file accepts an optional polisher argument purely so a test can inject a DraftPolisher instance with request_rewrite mocked out, without touching any of the surrounding orchestration logic. stats_path is optional too, since the function is just as useful for a quick one-off polish where logging usage does not matter.
The Web Form
Create a file named polish_server.py, also inside src.
import html # escape rewritten text before embedding it in the response page
import http.server # minimal stdlib HTTP server, no extra dependencies needed
import socketserver # TCP server base used to host PolishRequestHandler
import sys # extend sys.path so draft_polisher can be imported by file path
from pathlib import Path # resolve this script's own directory for the sys.path insert
from urllib.parse import parse_qs # decode the form-encoded POST body
sys.path.insert(0, str(Path(__file__).resolve().parent))
from draft_polisher import ( # reuse the same Bonsai call and usage logging as the CLI tool
DraftPolisher, REWRITE_INSTRUCTIONS, PROJECT_ROOT, log_usage, make_usage_entry,
)
HOST, PORT = "127.0.0.1", 8090 # local-only — never bound to a public interface
STATS_PATH = PROJECT_ROOT / "usage_log.json" # same file the CLI tool logs to, so usage accumulates either way
TONE_CHOICES_HTML = "".join(f'<option value="{tone}">{tone}</option>' for tone in REWRITE_INSTRUCTIONS) # one <option> per tone
DRAFT_FORM_PAGE = f"""
<html><body>
<h1>Local Writing Assistant (Bonsai)</h1>
<form method="POST" action="/polish">
<textarea name="draft_text" rows="12" cols="70" placeholder="Paste your rough draft here..."></textarea><br>
<select name="tone">{TONE_CHOICES_HTML}</select>
<button type="submit">Polish</button>
</form>
</body></html>
""".strip()
def extract_submitted_draft(body: bytes) -> dict:
# decodes a standard application/x-www-form-urlencoded POST body — the default encoding
# a plain HTML <form> without enctype="multipart/form-data" sends, so no multipart parsing
# is needed here, unlike the file-upload form in the previous project.
parsed = parse_qs(body.decode("utf-8")) # {"draft_text": ["..."], "tone": ["..."]}
return {
"draft_text": parsed.get("draft_text", [""])[0], # the textarea's contents, empty string if missing
"tone": parsed.get("tone", ["clarify"])[0], # the selected <option>, defaults to "clarify"
}
class PolishRequestHandler(http.server.BaseHTTPRequestHandler): # one instance per incoming HTTP request
def do_GET(self) -> None: # serves the draft form on any GET request
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.end_headers()
self.wfile.write(DRAFT_FORM_PAGE.encode("utf-8"))
def do_POST(self) -> None: # runs the rewrite and returns it inline
content_length = int(self.headers["Content-Length"]) # size of the incoming form body
body = self.rfile.read(content_length) # read the full request body
submitted = extract_submitted_draft(body) # {"draft_text": ..., "tone": ...}
polisher = DraftPolisher() # talks directly to Ollama, no subprocess involved
result = polisher.request_rewrite(submitted["draft_text"], submitted["tone"]) # may take 1-3+ minutes
log_usage(STATS_PATH, make_usage_entry(polisher.model_name, submitted["tone"], result)) # same log as the CLI
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.end_headers()
escaped = html.escape(result["rewritten_text"]) # avoid breaking the page on stray HTML chars
self.wfile.write(f"<pre>{escaped}</pre>".encode("utf-8"))
if __name__ == "__main__":
with socketserver.TCPServer((HOST, PORT), PolishRequestHandler) as httpd: # bind to localhost only
print(f"Serving on http://{HOST}:{PORT}")
httpd.serve_forever()
This server is considerably simpler than a comparable form in an agent-orchestrated project, since there is no dispatch layer in between: do_POST calls DraftPolisher directly, in the same process, the same way the CLI tool does. extract_submitted_draft only needs urllib.parse.parse_qs, since a plain <form> without enctype="multipart/form-data" sends a simple URL-encoded body, not the multipart format a file upload would need.
Running the Application
Create the virtual environment, no packages need installing:
python -m venv venv
venv\Scripts\activate
Run the CLI tool directly against the sample draft:
python src\draft_polisher.py --draft examples\sample_draft.txt --tone clarifyExpect this to take one to three minutes given Bonsai’s reasoning overhead. The result lands in results\polished_draft.md and results\reasoning_trace.md. Or run the web form instead:
python src\draft_polisher.pyOpen http://127.0.0.1:8090, paste a draft, choose a tone, and submit. The rewritten text comes back directly in the response.
Installation steps and Output



sample_draft.txt file
Hey so I just wanted to like reach out and see if maybe you had some time later this week or even next week to possibly hop on a call about the project we talked about last time, no rush or anything just whenever works for you really, let me know whenever you get a chance thanks so much!
Then we will run this command on terminal:
python src\draft_polisher.py --draft examples\sample_draft.txt --tone tighten
polished_draft.md file
## Original (tone requested: clarify)
Hey so I just wanted to like reach out and see if maybe you had some time later this week or even next week to possibly hop on a call about the project we talked about last time, no rush or anything just whenever works for you really, let me know whenever you get a chance thanks so much!
## Rewritten
Hi, I wanted to reach out and see if you have some time this week or next to schedule a call about the project we talked about last time. No rush at all—just let me know whenever works for you. Thank you for your time!
## Timing
Bonsai responded in 143.511 seconds.
Who Can Benefit
Students who want to see how a reasoning model’s response actually differs from a plain chat model at the API level
Developers running their first local or open-weight model through Ollama instead of a cloud API
Anyone working with a reasoning-capable model through an OpenAI-shaped endpoint, where reasoning and content arrive as separate fields
Writers and small teams who want a private, offline rewriting tool with no per-request cost
Researchers experimenting with extreme quantization, like Bonsai’s 1-bit weights, who need honest latency numbers rather than a vendor’s best case
How Codersarts Can Help
If you want to take this further, Codersarts offers hands-on support at every stage.
For learners: Live 1-to-1 sessions with an AI engineer who can walk through local model deployment, reasoning-model response formats, and debugging strategies for self-hosted inference in detail.
For teams: End-to-end development of tooling built on local or open-weight models, including reliability testing across model updates and usage tracking.
For enterprises: Architecture consulting for on-premises or air-gapped AI deployments, including model capability evaluation for hardware-constrained environments.
Reach out at contact@codersarts.com or visit www.codersarts.com to get started.
Continue Your AI Learning Journey with Codersarts
If you enjoyed this article and would like to discover more about modern AI applications, production-ready LLM systems, and real-world RAG and MCP implementations, be sure to explore these other blogs from Codersarts:
Build a Cost-Efficient Writing Quality Checker with Tiered Model Routing and OpenAI
Build Your First A2A Agent: An Email Drafting Pipeline Using Python and OpenAI
Building an AI Interview Prep Agent with Qwen 3.7 Max and Streamlit
https://www.codersarts.com/post/building-an-ai-interview-prep-agent-with-qwen-3-7-max-and-streamlit
Academic Research Assistance and Literature Review Automation Using RAG
Clinical Decision Support Systems Using RAG: Intelligent Diagnostic Assistance for Healthcare
Financial Decision Making with RAG Powered Market Intelligence
https://www.codersarts.com/post/financial-decision-making-with-rag-powered-market-intelligence




Comments