Production-Ready Text Embeddings with WebAssembly: WasmEdge + GGML
Building production ML inference services that run anywhere—from Raspberry Pi to cloud edge—requires a different approach. This article walks through a complete implementation of a text embedding API using WasmEdge, GGML, and Rust, delivering a 136KB WASM module paired with a 1.8MB async HTTP server that processes embeddings in ~100-200ms per request.
Full implementation: github.com/porameht/wasmedge-ggml-llama-embedding
What We're Building
A production-ready embedding API that transforms text into 384-dimensional vectors using:
- WebAssembly runtime - Cross-platform portability (ARM, x86, RISC-V)
- GGML quantization - Efficient model inference (44MB model)
- Async HTTP server - High-throughput request handling with Tokio
- Dual-binary architecture - Flexible deployment (WASM-only or full server)
- Zero-config deployment - Environment variable configuration
Use cases:
- Semantic search - Index and query millions of documents by meaning
- RAG pipelines - Retrieval-augmented generation for LLMs
- Recommendation engines - Find similar products, content, or users
- Duplicate detection - Identify semantically similar items
- Edge computing - Run inference locally on IoT devices
The Challenge: Traditional ML Deployment
Traditional ML deployment approaches come with significant overhead:
| Approach | Binary Size | Dependencies | Portability |
|---|---|---|---|
| Python + PyTorch | N/A | Python runtime + packages | Platform-dependent |
| ONNX Runtime | ~20MB | Platform-specific builds | Good |
| TensorFlow Serving | ~500MB | Heavy dependencies | Platform-dependent |
| WasmEdge + GGML | 136KB | WasmEdge only | True cross-platform |
Why WebAssembly for edge computing:
- Truly cross-platform - Same binary runs on ARM, x86, RISC-V
- Minimal footprint - 136KB WASM module + 1.8MB server
- Sandboxed execution - Built-in isolation for multi-tenant environments
- No runtime dependencies - Just WasmEdge, no Python/Node.js required
System Architecture:
Why WebAssembly for Edge ML?
WebAssembly (Wasm) is the perfect match for edge computing ML inference:
1. True Cross-Platform Deployment
Deploy the same binary to any edge device:
# Build once for WebAssembly
cargo build --target wasm32-wasip1 --release
# Run on ANY edge device - no recompilation needed
# Raspberry Pi (ARM)
wasmedge model.wasm
# Edge server (x86)
wasmedge model.wasm
# IoT gateway (RISC-V)
wasmedge model.wasm
2. Edge-First Security
Critical for multi-tenant edge environments:
# Sandboxed execution - perfect for edge multi-tenancy
wasmedge --dir .:. model.wasm
# Explicit resource control
wasmedge --nn-preload default:GGML:AUTO:model.gguf model.wasm
3. Production Performance Metrics
Verified specifications from the implementation:
| Metric | Value | Notes |
|---|---|---|
| WASM Binary | 136KB | Portable inference module |
| Server Binary | 1.8MB | Async HTTP API wrapper |
| Model Size | 44MB | Quantized GGUF format |
| Cold Start | 2-3 seconds | Model load + initialization |
| Inference Latency | 100-200ms | Per embedding request |
The Implementation
Dual-Binary Architecture - Two specialized components working together:
Component 1: WASM Embedding Module (136KB)
src/wasm.rs - The core inference engine
Responsibilities:
- Load GGUF models via WASI-NN
- Process text through GGML backend
- Output pure JSON embeddings
- Handle context management
- CLI interface for direct usage
Build command:
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features
Component 2: HTTP Server (1.8MB)
src/server.rs - Production API wrapper
Responsibilities:
- Async HTTP handling (Warp + Tokio)
- Process management (spawn WasmEdge)
- CORS support for web integration
- Health monitoring
- Error handling and logging
Build command:
cargo build --bin embedding-api-server --release
Why this architecture?
- Flexibility - Use WASM alone or full server
- Performance - Async server handles concurrency
- Portability - WASM runs anywhere
- Maintainability - Clear separation of concerns
Let's dive into the implementation details.
1. Environment Configuration
Load runtime settings from environment variables for zero-recompilation deployment:
fn get_options_from_env() -> Value {
let mut options = json!({});
if let Ok(val) = env::var("enable_log") {
options["enable-log"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("ctx_size") {
options["ctx-size"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("batch_size") {
options["batch-size"] = serde_json::from_str(val.as_str()).unwrap()
}
if let Ok(val) = env::var("threads") {
options["threads"] = serde_json::from_str(val.as_str()).unwrap()
}
options
}
Available options:
enable_log- Detailed logging (token counts, versions)ctx_size- Context window size (default: 512)batch_size- Batch processing size (default: 512)threads- CPU threads for inference (default: 4)
2. WASI-NN Graph Initialization
The core of WASI-NN integration - loading the GGML model:
let graph = GraphBuilder::new(GraphEncoding::Ggml, ExecutionTarget::AUTO)
.config(options.to_string())
.build_from_cache(model_name)
.expect("Create GraphBuilder Failed");
let mut context = graph
.init_execution_context()
.expect("Init Context Failed");
The WASI-NN Flow:
3. HTTP Server Implementation
The production API wraps the WASM module with an async HTTP server:
async fn generate_embedding(&self, text: &str) -> Result<EmbedResponse> {
let mut child = Command::new("wasmedge")
.arg("--dir").arg(".:.")
.arg("--nn-preload").arg(&format!("default:GGML:AUTO:{}", self.model_path.display()))
.arg(&self.wasm_path)
.arg("default")
.arg(text)
.stdout(Stdio::piped())
.spawn()?;
let mut stdout = Vec::new();
if let Some(mut out) = child.stdout.take() {
out.read_to_end(&mut stdout).await?;
}
let output = String::from_utf8_lossy(&stdout);
let parsed: serde_json::Value = serde_json::from_str(output.trim())?;
Ok(EmbedResponse {
n_embedding: parsed["n_embedding"].as_u64().unwrap() as usize,
embedding: parsed["embedding"].as_array().unwrap()
.iter().filter_map(|v| v.as_f64()).collect()
})
}
API Endpoints:
# Generate embedding
curl -X POST http://localhost:3000/embed \
-H "Content-Type: application/json" \
-d '{"text":"What is the capital of France?"}'
# Health check
curl http://localhost:3000/health
# API info
curl http://localhost:3000/
4. Tensor Processing
Proper tensor dimension handling for embeddings:
fn set_data_to_context(context: &mut GraphExecutionContext, data: Vec<u8>) -> Result<(), Error> {
context.set_input(0, TensorType::U8, &[data.len()], &data)
}
fn get_data_from_context(context: &GraphExecutionContext, index: usize) -> String {
const MAX_OUTPUT_BUFFER_SIZE: usize = 4096 * 20 + 128;
let mut output_buffer = vec![0u8; MAX_OUTPUT_BUFFER_SIZE];
let mut output_size = context.get_output(index, &mut output_buffer).unwrap();
output_size = std::cmp::min(MAX_OUTPUT_BUFFER_SIZE, output_size);
String::from_utf8_lossy(&output_buffer[..output_size]).to_string()
}
Why 4096 * 20 + 128?
- Most embedding models output ≤ 4096 dimensions
- Each float printed as string: ~20 bytes
- 128 bytes for JSON structure (
{"n_embedding":...})
5. Output Format
HTTP Response:
{
"n_embedding": 384,
"embedding": [0.5426, -0.0384, -0.0364, ..., 0.1234]
}
WASM CLI Output:
$ wasmedge --dir .:. --nn-preload default:GGML:AUTO:model.gguf \
wasmedge-ggml-llama-embedding.wasm default "Hello world"
{"n_embedding":384,"embedding":[0.5426,-0.0384,-0.0364,...]}
The WASM module outputs pure JSON (no extra text), making it easy to parse in any language or integrate with shell pipelines.
The Model: All-MiniLM-L6-v2
We use the all-MiniLM-L6-v2 model in GGUF format:
| Specification | Value |
|---|---|
| Output Dimensions | 384 |
| Model Size | 44MB (f16 quantized) |
| Max Sequence Length | 256 tokens |
| Performance | ~10ms per inference |
| Use Case | General-purpose embeddings |
Download from HuggingFace:
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf
Model comparison:
| Model | Dimensions | Size | Inference | Quality |
|---|---|---|---|---|
| MiniLM-L6 | 384 | 44MB | ~10ms | Good |
| BERT-Base | 768 | 440MB | ~30ms | Better |
| MPNet | 768 | 440MB | ~35ms | Better |
| E5-Large | 1024 | 1.3GB | ~100ms | Best |
Building and Running
Quick Start
# 1. Install WasmEdge with WASI-NN plugin
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasi_nn-ggml
source $HOME/.wasmedge/env
# 2. Install Rust target
rustup target add wasm32-wasip1
# 3. Download model
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf
# 4. Build both binaries
./build-wasm.sh # Builds WASM module (136KB)
cargo build --bin embedding-api-server --release # Builds HTTP server (1.8MB)
# 5. Run the server
./target/release/embedding-api-server
Build time: ~10 seconds (both binaries) Output sizes: 136KB (WASM) + 1.8MB (server)
Build Configuration
The Cargo.toml uses feature flags for optimal binary sizes:
[features]
default = ["server"]
wasm = ["wasmedge-wasi-nn"]
server = ["warp", "tokio", "serde", "anyhow"]
[profile.release]
opt-level = 3 # Maximum optimization
lto = true # Link-time optimization
strip = true # Strip debug symbols
Build targets:
# WASM module only (edge devices)
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features
# HTTP server only (cloud/production)
cargo build --bin embedding-api-server --release
Size comparison:
WASM debug: 450KB
WASM release: 136KB (70% reduction)
Server release: 1.8MB
Edge Computing Use Cases
1. IoT Device Local Search
Run semantic search on IoT devices without cloud dependency:
Option A: HTTP API (Recommended)
import requests
import numpy as np
# Run API server on Raspberry Pi / edge gateway
def get_embedding(text):
response = requests.post('http://localhost:3000/embed',
json={'text': text})
return response.json()['embedding']
# Index sensor documentation locally
docs = [
"Temperature sensor calibration procedure",
"Pressure sensor fault codes",
"Humidity sensor maintenance"
]
doc_embeddings = [get_embedding(doc) for doc in docs]
# Search locally - no internet required
query = "How to fix pressure errors?"
query_emb = get_embedding(query)
# Compute similarity on-device
similarities = [
np.dot(query_emb, doc_emb) / (np.linalg.norm(query_emb) * np.linalg.norm(doc_emb))
for doc_emb in doc_embeddings
]
print(docs[np.argmax(similarities)]) # "Pressure sensor fault codes"
2. Privacy-First Edge RAG
Process sensitive data locally - never send to cloud:
// Healthcare edge device - HIPAA compliant local processing
import { QdrantClient } from '@qdrant/js-client-rest';
async function getEmbedding(text) {
// HTTP API runs LOCALLY on edge device
const response = await fetch('http://localhost:3000/embed', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text })
});
return (await response.json()).embedding;
}
// Local vector database on edge device
const client = new QdrantClient({ url: 'http://localhost:6333' });
// Index patient data LOCALLY - never leaves device
async function indexPatientRecord(id, record) {
const embedding = await getEmbedding(record);
await client.upsert('patient_records', {
points: [{ id, vector: embedding, payload: { text: record } }]
});
}
// Search patient records LOCALLY - HIPAA compliant
async function searchRecords(query, limit = 5) {
const embedding = await getEmbedding(query);
return await client.search('patient_records', {
vector: embedding,
limit
});
}
3. Offline Edge Gateway Processing
Process data streams on edge gateway with no cloud connectivity:
#!/bin/bash
# Factory floor edge gateway - process sensor logs locally
# Process production line data offline
cat sensor_logs.txt | while read line; do
echo "$line" | wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:model.gguf \
embedding.wasm default "$line" >> production_embeddings.jsonl
done
# Data stays on factory network - never goes to internet
# Perfect for manufacturing, oil & gas, remote locations
Performance Characteristics
Expected Latency
Based on the architecture, each embedding request involves:
# Test direct WASM inference
time wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:model.gguf \
embedding.wasm default "What is the capital of France?"
Typical request timeline:
- Cold start: 2-3 seconds (initial model load)
- Inference: 100-200ms per request
- Model size: 44MB in memory
- Output: 384-dimensional vector
Performance Considerations
WebAssembly overhead: The WASM sandbox adds safety at a small performance cost:
- Boundary crossing between WASM and native code
- Memory isolation for security
- Portable binary format trade-offs
Trade-off: Prioritizing portability, security, and minimal dependencies over raw speed. Suitable for edge computing where these factors matter more than peak throughput.
Edge Deployment Strategies
1. Edge Serverless (CDN Edge)
Deploy to Cloudflare Workers or Fastly Compute@Edge:
// Run ML inference at CDN edge - closest to users
export default {
async fetch(request) {
const { text } = await request.json();
const wasm = await WebAssembly.instantiateStreaming(
fetch('/embedding.wasm')
);
// ML inference at edge location (not central cloud)
const embedding = wasm.instance.exports.get_embedding(text);
return new Response(JSON.stringify({ embedding }));
}
}
Edge benefits:
- Sub-50ms latency worldwide
- No cold start (always warm at edge)
- Reduced egress costs
2. IoT Edge Gateway
Deploy the HTTP server on Raspberry Pi or industrial edge hardware:
# On Raspberry Pi / edge gateway
export WASM_PATH="./wasmedge-ggml-llama-embedding.wasm"
export MODEL_PATH="./all-MiniLM-L6-v2-ggml-model-f16.gguf"
export PORT=3000
./target/release/embedding-api-server
Server features:
- CORS enabled - Browser integration ready
- Async Rust - Handles concurrent requests efficiently
- Health checks - Monitor edge device status
- JSON-only API - Easy integration with any language
Perfect for:
- Smart buildings - Local occupancy analytics
- Industrial IoT - Real-time equipment monitoring
- Retail edge - In-store customer insights
- Vehicle computing - Offline navigation assistance
3. Edge Container Deployment
Production-ready container for edge devices:
FROM rust:1.75 AS builder
WORKDIR /build
COPY Cargo.toml Cargo.lock ./
COPY src ./src
# Build both binaries
RUN rustup target add wasm32-wasip1 && \
cargo build --bin wasm-embedding --target wasm32-wasip1 --release \
--features wasm --no-default-features && \
cargo build --bin embedding-api-server --release
FROM wasmedge/slim:latest
WORKDIR /app
COPY --from=builder /build/target/wasm32-wasip1/release/wasm-embedding.wasm ./wasmedge-ggml-llama-embedding.wasm
COPY --from=builder /build/target/release/embedding-api-server .
COPY all-MiniLM-L6-v2-ggml-model-f16.gguf .
ENV WASM_PATH=/app/wasmedge-ggml-llama-embedding.wasm
ENV MODEL_PATH=/app/all-MiniLM-L6-v2-ggml-model-f16.gguf
EXPOSE 3000
CMD ["./embedding-api-server"]
Edge container footprint:
WasmEdge runtime: ~50MB
WASM module: 136KB
Server binary: 1.8MB
Model file: 44MB
Total: ~96MB (Edge-optimized)
Compare to typical Python ML deployments requiring 2-3GB+ for runtime, dependencies, and models.
Advanced Features
1. Error Handling
The implementation includes robust error detection:
match context.compute() {
Ok(_) => (),
Err(Error::BackendError(BackendError::ContextFull)) => {
println!("\n[INFO] Context full");
// Could implement context rotation here
}
Err(Error::BackendError(BackendError::PromptTooLong)) => {
println!("\n[INFO] Prompt too long");
// Could implement chunking here
}
Err(err) => {
println!("\n[ERROR] {}", err);
}
}
2. Metadata Extraction
Extract model information and token counts:
let metadata = get_metadata_from_context(&context);
println!("[INFO] llama_commit: {}", metadata["llama_commit"]);
println!("[INFO] llama_build_number: {}", metadata["llama_build_number"]);
println!("[INFO] Number of input tokens: {}", metadata["input_tokens"]);
println!("[INFO] Number of output tokens: {}", metadata["output_tokens"]);
3. Performance Tuning
Adjust runtime parameters for your workload:
# High throughput (batch processing)
export ctx_size=2048
export batch_size=512
export threads=8
# Low latency (real-time)
export ctx_size=512
export batch_size=128
export threads=2
# Debug mode
export enable_log=true
Troubleshooting
Issue 1: "unknown option: nn-preload"
Symptom: WasmEdge doesn't recognize --nn-preload
Solution:
# Reinstall with WASI-NN plugin
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasi_nn-ggml
source $HOME/.wasmedge/env
# Verify plugin is installed
wasmedge --version
# Should show: (plugin "wasi_nn") version X.X.X
Issue 2: "Context full" errors
Symptom: Frequent "Context full" errors
Solution:
# Increase context size
export ctx_size=4096
# Or implement context rotation
# (truncate old tokens when context fills)
Issue 3: Slow inference
Symptom: Inference takes >100ms
Check:
# Increase thread count
export threads=8
# Verify CPU isn't throttled
cat /proc/cpuinfo | grep MHz
# Check for memory swapping
free -h
Future Enhancements
1. Batch Processing API
// Process multiple texts in one inference call
fn batch_embed(texts: Vec<String>) -> Vec<Vec<f32>> {
texts.iter().map(|text| {
// Process in parallel
get_embedding(text)
}).collect()
}
2. Model Caching
// Cache loaded models for reuse
lazy_static! {
static ref MODEL_CACHE: Mutex<HashMap<String, Graph>> =
Mutex::new(HashMap::new());
}
3. Streaming Output
// Stream embeddings as they're computed
async fn stream_embedding(text: &str) -> impl Stream<Item = f32> {
// Yield each dimension as computed
}
Comparison with Alternatives
WasmEdge vs Python
Python approach:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("Hello world")
WasmEdge approach:
wasmedge --dir .:. --nn-preload default:GGML:AUTO:model.gguf \
embedding.wasm default "Hello world"
| Aspect | WasmEdge | Python |
|---|---|---|
| Startup | 10ms | 2-3 seconds |
| Memory | 100MB | 2-3GB |
| Binary | 148KB | N/A (interpreted) |
| Portability | One binary | Python + deps |
| Sandboxing | Native | None |
WasmEdge vs ONNX Runtime
| Feature | WasmEdge | ONNX |
|---|---|---|
| Setup | One command | Multi-step |
| Model format | GGUF | ONNX |
| Performance | Good | Better |
| Portability | Excellent | Good |
| Sandboxing | Yes | No |
Real-World Production Insights
After building this production embedding API, here are the key learnings:
What Works Exceptionally Well
1. Dual-Binary Architecture
- WASM module: Perfect for batch processing and edge devices
- HTTP server: Ideal for cloud deployments and microservices
- Flexibility to choose deployment strategy per environment
2. GGML Quantization
- 44MB model vs. 440MB unquantized
- Minimal accuracy loss for embeddings
- Fast inference without GPU requirements
3. Async Rust Server
- Tokio handles concurrent requests efficiently
- Scales with available CPU cores
- Low memory overhead compared to Python frameworks
Production Deployment Strategies
Strategy 1: Kubernetes Microservice
apiVersion: apps/v1
kind: Deployment
metadata:
name: embedding-api
spec:
replicas: 3
template:
spec:
containers:
- name: embeddings
image: wasmedge-embedding:latest
env:
- name: WASM_PATH
value: "/app/wasmedge-ggml-llama-embedding.wasm"
- name: MODEL_PATH
value: "/app/all-MiniLM-L6-v2-ggml-model-f16.gguf"
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "1000m"
Strategy 2: Serverless (AWS Lambda / Cloud Run)
- Cold start: 2-3 seconds (acceptable for many use cases)
- Memory: 512MB allocation recommended
- Timeout: Set to 30 seconds for safety
- Cost-effective compared to GPU-based alternatives
Strategy 3: Edge Devices (Raspberry Pi / IoT Gateways)
- Runs on devices with 512MB+ RAM
- Inference latency: 100-200ms
- Perfect for local semantic search
- Zero cloud egress costs
Performance Optimization Tips
1. Model Caching
# Pre-load model to reduce cold starts
export WASMEDGE_PLUGIN_PATH=/usr/local/lib/wasmedge
wasmedge --nn-preload default:GGML:AUTO:model.gguf
2. Concurrent Processing
- Use
tokio::spawnfor parallel requests - Configure thread pool based on available resources
- Each request spawns independent WasmEdge process
3. Connection Optimization
- Reuse HTTP connections
- Enable keep-alive for repeated requests
- Consider batching embeddings for bulk operations
When to Use This vs. Alternatives
Use WasmEdge + GGML when:
- Cross-platform deployment required
- Resource constraints (memory, disk)
- No GPU available
- Edge/IoT deployment
- Privacy-first requirements (local processing)
- Serverless/cold start optimization needed
Use alternatives when:
- GPU acceleration available and ultra-low latency required
- Training required (not just inference)
- Very high throughput batch processing needed
- Working with very large unquantized models
The Future: Where This Technology Shines
Immediate Applications:
- Semantic Search as a Service - Index millions of documents affordably
- Edge RAG Systems - Run retrieval locally, LLM in cloud
- Privacy-Compliant ML - Process medical/financial data on-premise
- IoT Intelligence - Smart devices with local understanding
- Cost Optimization - Replace expensive GPU APIs
Emerging Opportunities:
- WebAssembly Component Model for language interop
- WASI-NN GPU support for hybrid acceleration
- Larger models (1B+ params) with better quantization
- Multi-modal embeddings (text + image)
- Fine-tuning at the edge
The lightweight binary advantage: This project demonstrates that production ML doesn't require massive infrastructure. WebAssembly + GGML makes AI inference accessible with minimal dependencies and true cross-platform portability.
As WASI-NN matures and more models get quantized to GGUF, WebAssembly is becoming increasingly viable for edge ML deployments where portability, security, and minimal footprint are priorities.
Try It Yourself
Get started in 5 minutes:
# 1. Install WasmEdge
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasi_nn-ggml
source $HOME/.wasmedge/env
# 2. Clone and setup
git clone https://github.com/porameht/wasmedge-ggml-llama-embedding
cd wasmedge-ggml-llama-embedding
rustup target add wasm32-wasip1
# 3. Download model
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf
# 4. Build
./build-wasm.sh
cargo build --bin embedding-api-server --release
# 5. Run HTTP server
./target/release/embedding-api-server
Test the API:
# Generate embedding
curl -X POST http://localhost:3000/embed \
-H "Content-Type: application/json" \
-d '{"text":"What is the capital of France?"}'
# Health check
curl http://localhost:3000/health
Resources:
What will you build with portable ML?
