Making Inference Requests to AI Models
This guide shows you how to make inference requests to AI models deployed on the Formation network. Formation provides OpenAI-compatible APIs, making it easy to integrate with existing applications and tools.
Overview
Formation AI models serve inference requests through OpenAI-compatible endpoints, allowing you to:
- Use Existing Tools: Works with OpenAI client libraries and applications
- Switch Seamlessly: Drop-in replacement for OpenAI API calls
- Access Specialized Models: Use custom models not available elsewhere
- Pay Per Use: Only pay for the inference requests you make
Formation vs OpenAI API
Feature | Formation | OpenAI |
---|---|---|
API Format | OpenAI-compatible | Native OpenAI |
Authentication | ECDSA signatures | API keys |
Model Selection | Custom + standard models | OpenAI models only |
Billing | Credit-based, transparent | Usage-based |
Availability | Decentralized network | Centralized service |
OpenAI API Compatibility Overview
Formation models implement the OpenAI API v1 specification, ensuring compatibility with existing tools and libraries.
Supported Endpoints
Core Inference Endpoints
POST /v1/models/{model_id}/inference # Formation-specific endpoint POST /v1/chat/completions # OpenAI-compatible chat POST /v1/completions # OpenAI-compatible text completion GET /v1/models # List available models
Optional Endpoints (model-dependent)
POST /v1/embeddings # Text embeddings POST /v1/images/generations # Image generation POST /v1/audio/transcriptions # Speech-to-text
Request/Response Format
Formation models accept and return data in the exact same format as OpenAI's API:
{ "model": "formation-model-id", "messages": [ {"role": "user", "content": "Hello, how are you?"} ], "max_tokens": 1000, "temperature": 0.7 }
Formation Endpoint Configuration
1. Base URL Structure
Formation uses a different base URL structure than OpenAI:
OpenAI: https://api.openai.com/v1/chat/completions
Formation: https://formation.ai/v1/models/{model_id}/inference
2. Model-Specific Endpoints
Each model on Formation has its own inference endpoint:
# General format https://formation.ai/v1/models/{model_id}/inference # Examples https://formation.ai/v1/models/llama2-7b-chat/inference https://formation.ai/v1/models/stable-diffusion-xl/inference https://formation.ai/v1/models/whisper-large-v3/inference
3. Discovering Available Models
List all available models:
curl -X GET "https://formation.ai/v1/models" \ -H "X-Formation-Address: 0x1234567890abcdef..." \ -H "X-Formation-Signature: 0xabcdef..." \ -H "X-Formation-Message: Formation authentication request"
Response:
{ "success": true, "models": [ { "id": "llama2-7b-chat", "name": "Llama 2 7B Chat", "description": "Meta's Llama 2 7B parameter chat model", "type": "text_generation", "owner_id": "0x9876543210fedcba...", "is_private": false, "pricing": { "model": "per_token", "input_rate": 0.5, "output_rate": 1.0, "currency": "credits_per_1k_tokens" }, "capabilities": ["chat", "text_generation"], "max_tokens": 4096, "context_length": 4096 } ], "total": 1 }
Authentication with ECDSA Signatures
Formation uses ECDSA signature authentication instead of API keys for enhanced security.
1. Understanding ECDSA Authentication
- More Secure: No API keys to leak or manage
- Wallet Compatible: Works with Ethereum wallets
- Cryptographically Secure: Uses the same security as blockchain transactions
- Decentralized: No central authority manages keys
2. Required Headers
Every request must include these headers:
X-Formation-Address: 0x1234567890abcdef1234567890abcdef12345678 X-Formation-Signature: 0xabcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890ab X-Formation-Message: Formation authentication request Content-Type: application/json
3. Generating Signatures
Python Example
import hashlib from eth_account import Account # Your private key (keep this secure!) private_key = "0x1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef12" # Create account from private key account = Account.from_key(private_key) # Create message to sign message = "Formation authentication request" message_hash = hashlib.sha256(message.encode()).hexdigest() # Sign the message signature = account.signHash(message_hash) # Headers for API requests headers = { "X-Formation-Address": account.address, "X-Formation-Signature": signature.signature.hex(), "X-Formation-Message": message, "Content-Type": "application/json" }
JavaScript Example
const { ethers } = require('ethers'); const crypto = require('crypto'); // Your private key const privateKey = '0x1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef12'; // Create wallet const wallet = new ethers.Wallet(privateKey); // Create message to sign const message = 'Formation authentication request'; const messageHash = crypto.createHash('sha256').update(message).digest('hex'); // Sign the message const signature = wallet.signMessage(messageHash); // Headers for API requests const headers = { 'X-Formation-Address': wallet.address, 'X-Formation-Signature': signature, 'X-Formation-Message': message, 'Content-Type': 'application/json' };
4. Security Best Practices
- Protect Private Keys: Never share or expose your private keys
- Use Environment Variables: Store keys securely in environment variables
- Rotate Keys: Periodically generate new key pairs
- Monitor Usage: Track your API usage and costs
Making Your First Inference Request
1. Simple Text Generation
curl -X POST "https://formation.ai/v1/models/llama2-7b-chat/inference" \ -H "X-Formation-Address: 0x1234567890abcdef..." \ -H "X-Formation-Signature: 0xabcdef..." \ -H "X-Formation-Message: Formation authentication request" \ -H "Content-Type: application/json" \ -d '{ "messages": [ { "role": "user", "content": "Explain quantum computing in simple terms" } ], "max_tokens": 500, "temperature": 0.7 }'
Response:
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1640995200, "model": "llama2-7b-chat", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Quantum computing is like having a super-powered computer that can solve certain problems much faster than regular computers. Instead of using regular bits that are either 0 or 1, quantum computers use quantum bits (qubits) that can be both 0 and 1 at the same time..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 12, "completion_tokens": 156, "total_tokens": 168 } }
2. Chat Conversation
import requests import json # Authentication headers (generated as shown above) headers = { "X-Formation-Address": "0x1234567890abcdef...", "X-Formation-Signature": "0xabcdef...", "X-Formation-Message": "Formation authentication request", "Content-Type": "application/json" } # Start a conversation conversation = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "What's the weather like today?"} ] response = requests.post( "https://formation.ai/v1/models/llama2-7b-chat/inference", headers=headers, json={ "messages": conversation, "max_tokens": 200, "temperature": 0.7 } ) result = response.json() assistant_message = result["choices"][0]["message"]["content"] # Continue the conversation conversation.append({"role": "assistant", "content": assistant_message}) conversation.append({"role": "user", "content": "What about tomorrow?"}) # Make another request response = requests.post( "https://formation.ai/v1/models/llama2-7b-chat/inference", headers=headers, json={ "messages": conversation, "max_tokens": 200, "temperature": 0.7 } )
3. Streaming Responses
For real-time response streaming:
import requests import json def stream_inference(model_id, messages, headers): """Stream inference responses in real-time""" response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers={**headers, "Accept": "text/event-stream"}, json={ "messages": messages, "max_tokens": 500, "temperature": 0.7, "stream": True }, stream=True ) for line in response.iter_lines(): if line: line = line.decode('utf-8') if line.startswith('data: '): data = line[6:] # Remove 'data: ' prefix if data == '[DONE]': break try: chunk = json.loads(data) if 'choices' in chunk and len(chunk['choices']) > 0: delta = chunk['choices'][0].get('delta', {}) content = delta.get('content', '') if content: print(content, end='', flush=True) except json.JSONDecodeError: continue # Usage messages = [{"role": "user", "content": "Write a short story about a robot"}] stream_inference("llama2-7b-chat", messages, headers)
Rate Limiting and Billing Considerations
1. Understanding Formation's Billing
Formation uses a credit-based billing system:
- Credits: Internal currency for paying for inference
- Token-Based: Most models charge per input/output token
- Transparent Pricing: See exact costs before making requests
- No Surprise Bills: Credits are pre-purchased
2. Rate Limits
Formation implements rate limiting to ensure fair usage:
Default Rate Limits
- Free Tier: 100 requests/hour, 10,000 tokens/hour
- Pro Tier: 1,000 requests/hour, 100,000 tokens/hour
- Pro Plus Tier: 5,000 requests/hour, 500,000 tokens/hour
- Power Tier: 10,000 requests/hour, 1,000,000 tokens/hour
Rate Limit Headers
Responses include rate limit information:
X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 999 X-RateLimit-Reset: 1640995800 X-RateLimit-Type: requests_per_hour
3. Cost Estimation
Before making requests, estimate costs:
def estimate_inference_cost(model_id, input_text, max_tokens=1000): """Estimate the cost of an inference request""" # Get model pricing information response = requests.get( f"https://formation.ai/v1/models/{model_id}", headers=headers ) model_info = response.json()["model"] pricing = model_info["pricing"] if pricing["model"] == "per_token": # Estimate input tokens (rough approximation: 1 token ≈ 4 characters) estimated_input_tokens = len(input_text) // 4 estimated_output_tokens = max_tokens input_cost = (estimated_input_tokens / 1000) * pricing["input_rate"] output_cost = (estimated_output_tokens / 1000) * pricing["output_rate"] total_cost = input_cost + output_cost return { "estimated_input_tokens": estimated_input_tokens, "estimated_output_tokens": estimated_output_tokens, "estimated_cost_credits": total_cost, "pricing_model": "per_token" } elif pricing["model"] == "per_request": return { "estimated_cost_credits": pricing["base_cost"], "pricing_model": "per_request" } # Example usage cost_estimate = estimate_inference_cost( "llama2-7b-chat", "Explain the theory of relativity", max_tokens=500 ) print(f"Estimated cost: {cost_estimate['estimated_cost_credits']} credits")
4. Monitoring Usage and Costs
Track your usage to avoid unexpected charges:
def get_usage_summary(time_period="today"): """Get usage summary for a time period""" response = requests.get( f"https://formation.ai/v1/account/usage?period={time_period}", headers=headers ) usage_data = response.json() return { "total_requests": usage_data["total_requests"], "total_tokens": usage_data["total_tokens"], "total_cost_credits": usage_data["total_cost_credits"], "models_used": usage_data["models_used"], "period": time_period } # Check today's usage today_usage = get_usage_summary("today") print(f"Today's usage: {today_usage['total_requests']} requests, {today_usage['total_cost_credits']} credits") # Check this month's usage month_usage = get_usage_summary("this_month") print(f"This month: {month_usage['total_requests']} requests, {month_usage['total_cost_credits']} credits")
5. Budget Management
Set up budget alerts and limits:
def set_budget_limit(monthly_limit_credits): """Set a monthly budget limit""" response = requests.post( "https://formation.ai/v1/account/budget", headers=headers, json={ "monthly_limit_credits": monthly_limit_credits, "alert_thresholds": [50, 75, 90], # Alert at 50%, 75%, 90% of budget "hard_limit": True # Stop requests when budget is exceeded } ) return response.json() # Set a $100 equivalent budget (assuming 1 credit = $0.01) budget_result = set_budget_limit(10000) print(f"Budget set: {budget_result}")
Working with Different Model Types
1. Text Generation Models
Chat Models
Best for conversational AI and instruction-following:
def chat_with_model(model_id, system_prompt, user_message): """Chat with a conversational model""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ] response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers=headers, json={ "messages": messages, "max_tokens": 1000, "temperature": 0.7 } ) return response.json()["choices"][0]["message"]["content"] # Example usage response = chat_with_model( "llama2-7b-chat", "You are a helpful coding assistant.", "Write a Python function to calculate fibonacci numbers" )
Completion Models
Best for text completion and generation:
def complete_text(model_id, prompt, max_tokens=500): """Complete text using a completion model""" response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers=headers, json={ "prompt": prompt, "max_tokens": max_tokens, "temperature": 0.8, "top_p": 0.9 } ) return response.json()["choices"][0]["text"] # Example usage completed_text = complete_text( "gpt-3.5-turbo-instruct", "The future of artificial intelligence is" )
2. Image Generation Models
def generate_image(model_id, prompt, size="1024x1024"): """Generate an image from a text prompt""" response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers=headers, json={ "prompt": prompt, "size": size, "n": 1, "response_format": "url" } ) return response.json()["data"][0]["url"] # Example usage image_url = generate_image( "stable-diffusion-xl", "A beautiful sunset over a mountain lake, photorealistic" ) print(f"Generated image: {image_url}")
3. Embedding Models
def get_embeddings(model_id, texts): """Get embeddings for text inputs""" response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers=headers, json={ "input": texts, "encoding_format": "float" } ) return [item["embedding"] for item in response.json()["data"]] # Example usage embeddings = get_embeddings( "text-embedding-ada-002", ["Hello world", "How are you?", "Goodbye"] )
Error Handling and Troubleshooting
1. Common Error Codes
Authentication Errors
{ "error": { "code": "AUTHENTICATION_FAILED", "message": "Invalid signature or address", "type": "authentication_error" } }
Solution: Verify your signature generation and ensure the message matches.
Model Not Found
{ "error": { "code": "MODEL_NOT_FOUND", "message": "Model 'invalid-model-id' not found", "type": "invalid_request_error" } }
Solution: Check the model ID and ensure it exists and is accessible.
Rate Limit Exceeded
{ "error": { "code": "RATE_LIMIT_EXCEEDED", "message": "Rate limit exceeded. Try again in 3600 seconds.", "type": "rate_limit_error", "retry_after": 3600 } }
Solution: Wait for the specified time or upgrade your subscription tier.
Insufficient Credits
{ "error": { "code": "INSUFFICIENT_CREDITS", "message": "Insufficient credits for this request. Required: 50, Available: 25", "type": "billing_error", "required_credits": 50, "available_credits": 25 } }
Solution: Add credits to your account or upgrade your subscription.
2. Implementing Retry Logic
import time import random from typing import Dict, Any def make_inference_request_with_retry( model_id: str, payload: Dict[str, Any], max_retries: int = 3, base_delay: float = 1.0 ) -> Dict[str, Any]: """Make inference request with exponential backoff retry logic""" for attempt in range(max_retries + 1): try: response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers=headers, json=payload, timeout=30 ) if response.status_code == 200: return response.json() elif response.status_code == 429: # Rate limited error_data = response.json() retry_after = error_data.get("retry_after", base_delay * (2 ** attempt)) if attempt < max_retries: print(f"Rate limited. Retrying in {retry_after} seconds...") time.sleep(retry_after) continue else: raise Exception(f"Rate limit exceeded after {max_retries} retries") elif response.status_code in [500, 502, 503, 504]: # Server errors if attempt < max_retries: delay = base_delay * (2 ** attempt) + random.uniform(0, 1) print(f"Server error. Retrying in {delay:.2f} seconds...") time.sleep(delay) continue else: raise Exception(f"Server error after {max_retries} retries") else: # Non-retryable error response.raise_for_status() except requests.exceptions.RequestException as e: if attempt < max_retries: delay = base_delay * (2 ** attempt) + random.uniform(0, 1) print(f"Request failed: {e}. Retrying in {delay:.2f} seconds...") time.sleep(delay) continue else: raise e raise Exception("All retry attempts failed") # Example usage try: result = make_inference_request_with_retry( "llama2-7b-chat", { "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 } ) print(result["choices"][0]["message"]["content"]) except Exception as e: print(f"Request failed: {e}")
3. Debugging Connection Issues
def debug_connection(model_id): """Debug connection issues with Formation API""" print("🔍 Debugging Formation API connection...") # Test 1: Check if Formation API is reachable try: response = requests.get("https://formation.ai/health", timeout=10) print(f"✅ Formation API reachable: {response.status_code}") except Exception as e: print(f"❌ Formation API unreachable: {e}") return # Test 2: Check authentication try: response = requests.get( "https://formation.ai/v1/models", headers=headers, timeout=10 ) if response.status_code == 200: print("✅ Authentication successful") else: print(f"❌ Authentication failed: {response.status_code}") print(f"Response: {response.text}") except Exception as e: print(f"❌ Authentication test failed: {e}") # Test 3: Check if model exists try: response = requests.get( f"https://formation.ai/v1/models/{model_id}", headers=headers, timeout=10 ) if response.status_code == 200: print(f"✅ Model {model_id} exists and accessible") else: print(f"❌ Model {model_id} not accessible: {response.status_code}") except Exception as e: print(f"❌ Model check failed: {e}") # Test 4: Simple inference test try: response = requests.post( f"https://formation.ai/v1/models/{model_id}/inference", headers=headers, json={ "messages": [{"role": "user", "content": "test"}], "max_tokens": 10 }, timeout=30 ) if response.status_code == 200: print("✅ Inference test successful") else: print(f"❌ Inference test failed: {response.status_code}") print(f"Response: {response.text}") except Exception as e: print(f"❌ Inference test failed: {e}") # Run diagnostics debug_connection("llama2-7b-chat")
Best Practices
1. Efficient API Usage
- Batch Requests: Group multiple requests when possible
- Optimize Prompts: Use clear, concise prompts to reduce token usage
- Set Appropriate Limits: Use
max_tokens
to control response length - Cache Results: Store frequently used responses to avoid repeated requests
2. Cost Optimization
- Monitor Usage: Regularly check your usage and costs
- Choose Right Models: Use smaller models for simpler tasks
- Optimize Parameters: Adjust temperature and other parameters for efficiency
- Set Budgets: Use budget limits to prevent unexpected charges
3. Security
- Protect Private Keys: Never expose private keys in code or logs
- Use Environment Variables: Store credentials securely
- Rotate Keys: Periodically generate new key pairs
- Monitor Access: Track API usage for unusual patterns
4. Performance
- Handle Timeouts: Set appropriate timeout values
- Implement Retry Logic: Handle transient failures gracefully
- Use Streaming: For long responses, use streaming to improve user experience
- Parallel Requests: Make concurrent requests when appropriate
Next Steps
- API Reference - Complete API documentation and schemas
- Code Examples - Working examples in multiple programming languages
- Agent Usage Guide - Learn about using AI agents
Ready to start making inference requests? Begin with simple text generation and gradually explore more advanced features! 🚀