Edge AI Inference
The execution of AI model inference directly on end-user devices or local hardware, rather than sending data to centralised cloud servers. Edge AI inference eliminates network round-trip latency, reduces inference costs, and ensures data sovereignty by keeping prompts and outputs on the device.
Benchmark
Cloud inference averages $0.002–$0.06 per 1K tokens; edge inference reduces this to near-zero marginal cost after device deployment — Loc.ai customers see up to 95% lower inference spend.
On-Device LLM
A large language model that runs entirely on the user's hardware (laptop, phone, workstation, or edge server) without any required cloud connection. On-device LLMs use quantised weights and optimised runtimes such as ONNX, GGUF, or WebGPU to fit within consumer-grade memory budgets.
Benchmark
Quantised 7B–8B parameter models now run at 20–60 tokens/second on a typical Apple Silicon laptop, matching cloud-hosted GPT-3.5-class quality for most product workloads.
Sovereign AI Infrastructure
AI compute infrastructure where data processing occurs entirely within the user's device or an organisation's controlled environment, with no data transmitted to third-party cloud providers. Sovereign AI is a requirement for UK GDPR, EU AI Act, and sector regulations in healthcare, legal, and finance.
Benchmark
Over 60% of enterprise AI buyers in regulated industries cite data sovereignty as a top-three blocker to cloud-LLM adoption (Gartner, 2025).
Inference Cost Per Token
The marginal monetary cost of generating or processing a single token through an AI model. For cloud APIs this is a per-call charge to the vendor; for edge inference it is the amortised hardware and energy cost of the local computation.
Benchmark
Cloud frontier models charge $0.50–$60 per million tokens; edge inference on existing user hardware has a marginal cost approaching zero after the model is downloaded.
Token Throughput
The number of tokens an AI model can process or generate per second on a given hardware target. Throughput determines perceived latency and how many concurrent users a single device can serve.
Benchmark
A Llama-3 8B model quantised to Q4 reaches ~45 tokens/second on an M2 MacBook Air and ~120 tokens/second on an M3 Max, comfortably above the ~10 tokens/second perceived as 'real-time' by users.
Model Quantisation
The process of reducing the numerical precision of a model's weights (for example from 16-bit floats to 4-bit integers) so the model occupies less memory and runs faster on commodity hardware, with only a small quality trade-off.
Benchmark
4-bit quantisation typically shrinks model size by 4× and increases inference speed by 2–3× with under 2% degradation on standard benchmarks (MMLU, HellaSwag).
Federated Inference
An architectural pattern where inference is distributed across many end-user devices rather than centralised in a vendor's data centre. Each device handles its own user's requests locally, and only aggregate, anonymised telemetry is shared.
Benchmark
Federated inference scales compute linearly with users at zero incremental vendor cost — a 10× user base requires 0× additional cloud GPU spend.
WebGPU Inference
Running AI model inference inside a browser using the WebGPU API, which exposes the user's GPU directly to JavaScript and WebAssembly. WebGPU inference allows SaaS products to deploy on-device AI with no install step.
Benchmark
WebGPU delivers 5–10× the performance of WebGL for tensor workloads and is supported in 80%+ of installed desktop browsers as of 2026.
ONNX Runtime
A cross-platform inference engine for models exported to the Open Neural Network Exchange (ONNX) format. ONNX Runtime provides accelerated execution on CPU, GPU, NPU and mobile hardware from a single model artefact.
Benchmark
ONNX Runtime ships in Windows 11, Office, and the Edge browser, giving on-device inference a path to over 1 billion installed endpoints.
Edge-Cloud Hybrid Inference
A routing pattern where most requests are served by an on-device model and a small fraction — typically long-context, multimodal, or highly specialised queries — are transparently forwarded to a cloud model as fallback.
Benchmark
Loc.ai's hybrid routing typically keeps 90–98% of requests on-device, cutting cloud API spend by an order of magnitude while preserving full capability for edge cases.
Data Sovereignty
The principle that data is subject only to the laws and governance structures of the jurisdiction in which it is collected and processed. For AI, this typically means data must never leave the user's device or the organisation's controlled environment.
Benchmark
The EU AI Act, UK Data Protection Act, and US state privacy laws (CCPA, CPRA) all impose stricter requirements on data leaving user-controlled environments — on-device inference removes this category of risk entirely.
Private AI
Any AI deployment in which prompts, completions, and intermediate state are never visible to a third-party model provider. Private AI is typically implemented via on-device inference, self-hosted models, or trusted execution environments.
Benchmark
Loc.ai's SafeChat keeps 100% of prompts on the user's device — including in regulated workflows such as legal review, clinical notes, and confidential M&A.
OpenAI-Compatible API
A local or self-hosted HTTP endpoint that implements the OpenAI REST schema (chat/completions, embeddings, models). Existing OpenAI SDK code can be pointed at the endpoint with only a base-URL change, so apps can swap in on-device or on-prem inference with no code rewrite.
Benchmark
Loc.ai:Control exposes an OpenAI-compatible endpoint on localhost — typical migration from cloud OpenAI to local inference is a one-line base URL change.
Air-Gapped LLM Deployment
An LLM deployment that runs on hardware with no outbound network access — model artefacts are loaded from an internal registry and no telemetry, prompts, or completions ever leave the network boundary. Standard requirement in regulated finance, defence, and segregated trading environments.
Benchmark
Air-gapped Loc.ai deployments require zero outbound calls at runtime, making them auditable against FCA SYSC 8, ISO 27001 A.13, and similar control sets.
Shadow AI
Unsanctioned use of public AI tools (ChatGPT, Copilot, Claude, etc.) by employees for work tasks, typically pasting confidential data into third-party systems outside IT's visibility or DPA coverage.
Benchmark
Surveys consistently find 50–70% of knowledge workers have used shadow AI tools with company data — providing a sanctioned, on-device alternative such as SafeChat is the most effective mitigation.