Deploying AI Models on Formation

This guide walks you through the complete process of deploying your AI model to the Formation network, from containerization to production deployment.

Deployment Overview

Formation model deployment follows a structured pipeline:

Deployment Components

Docker Container: Packages your model and dependencies
OpenAI-Compatible API: Provides standard inference endpoints
Formfile: Defines deployment configuration and resources
form-state: Manages model registration and metadata
form-vmm: Handles VM instance deployment and lifecycle
formnet: Provides secure networking between instances

Prerequisites

Before deploying, ensure you have:

✅ Trained Model: Working AI model ready for inference
✅ API Wrapper: OpenAI-compatible HTTP endpoints (see Model Requirements)
✅ Formation Access: Active node on Formation network
✅ ECDSA Keys: For signing deployment requests
✅ Docker: For containerization and testing

Step 1: Containerization Best Practices

1.1 Container Architecture

Formation models should follow a layered container architecture:

┌─────────────────────────────────────────┐
│            Formation Model              │
├─────────────────────────────────────────┤
│  HTTP API Layer (Port 8080)             │
│  ├── /v1/chat/completions               │
│  ├── /v1/completions                    │
│  ├── /v1/models                         │
│  └── /health                            │
├─────────────────────────────────────────┤
│  Model Inference Engine                 │
│  ├── Model Loading                      │
│  ├── Request Processing                 │
│  ├── Response Generation                │
│  └── Usage Tracking                     │
├─────────────────────────────────────────┤
│  Base Container Environment             │
│  ├── Python/Runtime                     │
│  ├── ML Framework (PyTorch/TF)          │
│  ├── System Dependencies                │
│  └── Formation Integration              │
└─────────────────────────────────────────┘

1.2 Dockerfile Best Practices

Multi-Stage Build Example

# Multi-stage Dockerfile for efficient model deployment
FROM python:3.9-slim as base

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    wget \
    git \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Create app directory
WORKDIR /app

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Model preparation stage
FROM base as model-prep

# Download or copy model weights
COPY model/ ./model/
# Alternative: Download from remote
# RUN wget -O model/weights.bin https://example.com/model-weights.bin

# Optimize model (quantization, etc.)
RUN python scripts/optimize_model.py

# Production stage
FROM base as production

# Copy optimized model from prep stage
COPY --from=model-prep /app/model/ ./model/

# Copy application code
COPY src/ ./src/
COPY config/ ./config/

# Create non-root user for security
RUN useradd --create-home --shell /bin/bash modeluser
RUN chown -R modeluser:modeluser /app
USER modeluser

# Expose port
EXPOSE 8080

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

# Start model server
CMD ["python", "src/server.py"]

Optimized Single-Stage Example

# Single-stage Dockerfile for simpler models
FROM python:3.9-slim

# Install system dependencies in one layer
RUN apt-get update && apt-get install -y \
    curl \
    && rm -rf /var/lib/apt/lists/* \
    && apt-get clean

# Set working directory
WORKDIR /app

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application code
COPY model/ ./model/
COPY src/ ./src/

# Create non-root user
RUN useradd --create-home modeluser && \
    chown -R modeluser:modeluser /app
USER modeluser

# Expose port and add health check
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

# Start server
CMD ["python", "src/server.py"]

1.3 Container Optimization Techniques

Memory Optimization

# Use slim base images
FROM python:3.9-slim

# Install only required packages
RUN pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Use multi-stage builds to reduce final image size
FROM python:3.9-slim as final
COPY --from=builder /usr/local/lib/python3.9/site-packages /usr/local/lib/python3.9/site-packages

# Clean up unnecessary files
RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Startup Time Optimization

# src/server.py - Optimized model loading
import os
import time
import logging
from flask import Flask
import torch

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ModelServer:
    def __init__(self):
        self.model = None
        self.tokenizer = None
        self.model_loaded = False
        
    def load_model(self):
        """Load model with optimization for faster startup"""
        start_time = time.time()
        logger.info("Loading model...")
        
        try:
            # Load model with optimizations
            self.model = torch.jit.load('model/optimized_model.pt')
            self.model.eval()
            
            # Warm up model with dummy input
            dummy_input = torch.randn(1, 512)
            with torch.no_grad():
                _ = self.model(dummy_input)
            
            self.model_loaded = True
            load_time = time.time() - start_time
            logger.info(f"Model loaded successfully in {load_time:.2f}s")
            
        except Exception as e:
            logger.error(f"Failed to load model: {e}")
            raise

# Initialize model server
model_server = ModelServer()
model_server.load_model()

app = Flask(__name__)

@app.route('/health')
def health():
    return {
        "status": "healthy" if model_server.model_loaded else "unhealthy",
        "model_loaded": model_server.model_loaded
    }

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

1.4 Security Best Practices

Secure Container Configuration

# Security-hardened Dockerfile
FROM python:3.9-slim

# Update packages and install security updates
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    curl \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Create non-root user with minimal privileges
RUN groupadd -r modelgroup && \
    useradd -r -g modelgroup -d /app -s /bin/bash modeluser

# Set up application directory with proper permissions
WORKDIR /app
COPY --chown=modeluser:modelgroup . .

# Install dependencies as root, then switch to non-root
RUN pip install --no-cache-dir -r requirements.txt

# Switch to non-root user
USER modeluser

# Use specific port and disable debug mode
EXPOSE 8080
ENV FLASK_ENV=production
ENV FLASK_DEBUG=0

# Health check with timeout
HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

CMD ["python", "src/server.py"]

1.5 Framework-Specific Examples

Hugging Face Transformers

FROM python:3.9-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model/ ./model/
COPY src/ ./src/

# Download model if not included
# RUN python -c "from transformers import AutoModel; AutoModel.from_pretrained('microsoft/DialoGPT-medium')"

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

CMD ["python", "src/hf_server.py"]

vLLM High-Performance Inference

FROM nvidia/cuda:11.8-devel-ubuntu20.04

# Install Python and system dependencies
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    curl \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install vLLM and dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy model configuration
COPY src/ ./src/
COPY model_config.json .

EXPOSE 8080

# Health check for GPU-enabled container
HEALTHCHECK --interval=30s --timeout=15s --start-period=120s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

CMD ["python3", "src/vllm_server.py"]

Ollama Integration

FROM ollama/ollama:latest

# Install additional dependencies
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    curl \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install Python API wrapper
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy API wrapper and configuration
COPY src/ ./src/
COPY models/ ./models/

# Pull model
RUN ollama pull llama2:7b

EXPOSE 8080

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1

CMD ["python3", "src/ollama_wrapper.py"]

Step 2: Resource Requirements Specification

2.1 Understanding Resource Requirements

Formation uses detailed resource specifications to ensure optimal model deployment and performance:

// Resource requirements structure (from form-state/src/model.rs)
pub struct ModelResourceRequirements {
    pub min_vcpus: u8,                    // Minimum CPU cores
    pub recommended_vcpus: u8,            // Optimal CPU cores
    pub min_memory_mb: u64,               // Minimum RAM (MB)
    pub recommended_memory_mb: u64,       // Optimal RAM (MB)
    pub min_disk_gb: u64,                 // Minimum storage (GB)
    pub recommended_disk_gb: u64,         // Optimal storage (GB)
    pub requires_gpu: bool,               // GPU requirement
    pub min_vram_gb: Option<u64>,         // Minimum VRAM (GB)
    pub recommended_vram_gb: Option<u64>, // Optimal VRAM (GB)
    pub cuda_cores: Option<u32>,          // Required CUDA cores
    pub tensor_cores: Option<u32>,        // Required Tensor cores
    pub required_cpu_extensions: Vec<String>, // CPU extensions (AVX, etc.)
    pub required_cuda_version: Option<String>, // CUDA version
}

2.2 Resource Calculation Guidelines

CPU Requirements

# Example resource calculation for different model sizes
def calculate_cpu_requirements(model_size_params):
    """Calculate CPU requirements based on model parameters"""
    
    if model_size_params <= 1e9:  # < 1B parameters
        return {
            "min_vcpus": 2,
            "recommended_vcpus": 4,
            "description": "Small model - CPU inference suitable"
        }
    elif model_size_params <= 7e9:  # 1B - 7B parameters
        return {
            "min_vcpus": 4,
            "recommended_vcpus": 8,
            "description": "Medium model - Multi-core CPU recommended"
        }
    elif model_size_params <= 13e9:  # 7B - 13B parameters
        return {
            "min_vcpus": 8,
            "recommended_vcpus": 16,
            "description": "Large model - High-core CPU or GPU recommended"
        }
    else:  # > 13B parameters
        return {
            "min_vcpus": 16,
            "recommended_vcpus": 32,
            "description": "Very large model - GPU strongly recommended"
        }

Memory Requirements

def calculate_memory_requirements(model_size_params, precision="fp16"):
    """Calculate memory requirements based on model size and precision"""
    
    # Base memory calculation (parameters * bytes per parameter)
    precision_multiplier = {
        "fp32": 4,  # 4 bytes per parameter
        "fp16": 2,  # 2 bytes per parameter
        "int8": 1,  # 1 byte per parameter
        "int4": 0.5 # 0.5 bytes per parameter
    }
    
    model_memory_gb = (model_size_params * precision_multiplier[precision]) / (1024**3)
    
    # Add overhead for inference (activations, KV cache, etc.)
    overhead_multiplier = 2.5  # 2.5x overhead is typical
    total_memory_gb = model_memory_gb * overhead_multiplier
    
    # Convert to MB and add system overhead
    min_memory_mb = int(total_memory_gb * 1024) + 1024  # +1GB system overhead
    recommended_memory_mb = int(min_memory_mb * 1.5)    # 50% buffer
    
    return {
        "min_memory_mb": min_memory_mb,
        "recommended_memory_mb": recommended_memory_mb,
        "model_memory_gb": round(model_memory_gb, 2),
        "total_memory_gb": round(total_memory_gb, 2)
    }

GPU Requirements

def calculate_gpu_requirements(model_size_params, precision="fp16"):
    """Calculate GPU requirements for model inference"""
    
    memory_req = calculate_memory_requirements(model_size_params, precision)
    model_memory_gb = memory_req["model_memory_gb"]
    
    if model_size_params <= 1e9:  # < 1B parameters
        return {
            "requires_gpu": False,
            "min_vram_gb": None,
            "recommended_vram_gb": None,
            "description": "CPU inference sufficient"
        }
    elif model_size_params <= 7e9:  # 1B - 7B parameters
        return {
            "requires_gpu": True,
            "min_vram_gb": max(8, int(model_memory_gb * 1.5)),
            "recommended_vram_gb": max(16, int(model_memory_gb * 2)),
            "cuda_cores": 2048,
            "description": "Mid-range GPU recommended"
        }
    elif model_size_params <= 13e9:  # 7B - 13B parameters
        return {
            "requires_gpu": True,
            "min_vram_gb": max(16, int(model_memory_gb * 1.5)),
            "recommended_vram_gb": max(24, int(model_memory_gb * 2)),
            "cuda_cores": 4096,
            "tensor_cores": 320,
            "description": "High-end GPU required"
        }
    else:  # > 13B parameters
        return {
            "requires_gpu": True,
            "min_vram_gb": max(24, int(model_memory_gb * 1.5)),
            "recommended_vram_gb": max(48, int(model_memory_gb * 2)),
            "cuda_cores": 8192,
            "tensor_cores": 640,
            "description": "Enterprise GPU or multi-GPU setup required"
        }

2.3 Resource Specification Examples

Small Language Model (1B parameters)

{
  "resource_requirements": {
    "min_vcpus": 2,
    "recommended_vcpus": 4,
    "min_memory_mb": 4096,
    "recommended_memory_mb": 8192,
    "min_disk_gb": 10,
    "recommended_disk_gb": 20,
    "requires_gpu": false,
    "min_vram_gb": null,
    "recommended_vram_gb": null,
    "cuda_cores": null,
    "tensor_cores": null,
    "required_cpu_extensions": ["AVX2"],
    "required_cuda_version": null
  }
}

Medium Language Model (7B parameters)

{
  "resource_requirements": {
    "min_vcpus": 4,
    "recommended_vcpus": 8,
    "min_memory_mb": 16384,
    "recommended_memory_mb": 32768,
    "min_disk_gb": 20,
    "recommended_disk_gb": 50,
    "requires_gpu": true,
    "min_vram_gb": 16,
    "recommended_vram_gb": 24,
    "cuda_cores": 2048,
    "tensor_cores": 320,
    "required_cpu_extensions": ["AVX2", "FMA"],
    "required_cuda_version": "11.8"
  }
}

Large Language Model (13B+ parameters)

{
  "resource_requirements": {
    "min_vcpus": 8,
    "recommended_vcpus": 16,
    "min_memory_mb": 32768,
    "recommended_memory_mb": 65536,
    "min_disk_gb": 50,
    "recommended_disk_gb": 100,
    "requires_gpu": true,
    "min_vram_gb": 24,
    "recommended_vram_gb": 48,
    "cuda_cores": 4096,
    "tensor_cores": 640,
    "required_cpu_extensions": ["AVX2", "FMA", "AVX512"],
    "required_cuda_version": "12.0"
  }
}

Image Generation Model

{
  "resource_requirements": {
    "min_vcpus": 4,
    "recommended_vcpus": 8,
    "min_memory_mb": 8192,
    "recommended_memory_mb": 16384,
    "min_disk_gb": 30,
    "recommended_disk_gb": 60,
    "requires_gpu": true,
    "min_vram_gb": 12,
    "recommended_vram_gb": 24,
    "cuda_cores": 2048,
    "tensor_cores": 320,
    "required_cpu_extensions": ["AVX2"],
    "required_cuda_version": "11.8"
  }
}

2.4 Dynamic Resource Calculation

# src/resource_calculator.py
import json
import os
from typing import Dict, Any

class ResourceCalculator:
    """Calculate optimal resource requirements for models"""
    
    def __init__(self, model_path: str):
        self.model_path = model_path
        self.model_info = self._analyze_model()
    
    def _analyze_model(self) -> Dict[str, Any]:
        """Analyze model to determine characteristics"""
        # This would analyze your specific model
        # For example, counting parameters, checking precision, etc.
        
        return {
            "parameters": 7e9,  # 7B parameters
            "precision": "fp16",
            "model_type": "language_model",
            "framework": "pytorch"
        }
    
    def calculate_requirements(self) -> Dict[str, Any]:
        """Calculate complete resource requirements"""
        
        params = self.model_info["parameters"]
        precision = self.model_info["precision"]
        
        cpu_req = calculate_cpu_requirements(params)
        memory_req = calculate_memory_requirements(params, precision)
        gpu_req = calculate_gpu_requirements(params, precision)
        
        return {
            "min_vcpus": cpu_req["min_vcpus"],
            "recommended_vcpus": cpu_req["recommended_vcpus"],
            "min_memory_mb": memory_req["min_memory_mb"],
            "recommended_memory_mb": memory_req["recommended_memory_mb"],
            "min_disk_gb": 20,  # Base disk requirement
            "recommended_disk_gb": 50,
            "requires_gpu": gpu_req["requires_gpu"],
            "min_vram_gb": gpu_req.get("min_vram_gb"),
            "recommended_vram_gb": gpu_req.get("recommended_vram_gb"),
            "cuda_cores": gpu_req.get("cuda_cores"),
            "tensor_cores": gpu_req.get("tensor_cores"),
            "required_cpu_extensions": ["AVX2", "FMA"],
            "required_cuda_version": "11.8" if gpu_req["requires_gpu"] else None
        }
    
    def save_requirements(self, output_path: str):
        """Save requirements to JSON file"""
        requirements = self.calculate_requirements()
        
        with open(output_path, 'w') as f:
            json.dump({"resource_requirements": requirements}, f, indent=2)
        
        print(f"Resource requirements saved to {output_path}")

# Usage example
if __name__ == "__main__":
    calculator = ResourceCalculator("./model")
    calculator.save_requirements("resource_requirements.json")

Step 3: Registration with form-state

3.1 Model Registration Overview

Formation uses the form-state service to manage model registration and metadata. Models are registered via the /models/create endpoint with comprehensive metadata.

Registration Flow

3.2 Model Metadata Structure

{
  "model_id": "my-llm-model-v1",
  "name": "My Custom Language Model",
  "owner_id": "0x1234567890abcdef...",
  "version": "1.0.0",
  "description": "A fine-tuned language model for specific domain tasks",
  "documentation": "# Model Documentation\n\nThis model is trained for...",
  "license": "MIT",
  "model_type": "LLM",
  "framework": "PyTorch",
  "io_modes": ["TextToText"],
  "parameters": 7000000000,
  "quantization": "FP16",
  "context_length": 4096,
  "input_tokens_per_second": 150,
  "output_tokens_per_second": 50,
  "tags": ["language-model", "chat", "instruction-following"],
  "created_at": 1703123456,
  "updated_at": 1703123456,
  "formfile_template": "base64_encoded_formfile",
  "weights_url": "https://example.com/model-weights.bin",
  "weights_checksum": "sha256:abc123...",
  "weights_size_bytes": 14000000000,
  "resource_requirements": {
    "min_vcpus": 4,
    "recommended_vcpus": 8,
    "min_memory_mb": 16384,
    "recommended_memory_mb": 32768,
    "min_disk_gb": 20,
    "recommended_disk_gb": 50,
    "requires_gpu": true,
    "min_vram_gb": 16,
    "recommended_vram_gb": 24,
    "cuda_cores": 2048,
    "tensor_cores": 320,
    "required_cpu_extensions": ["AVX2", "FMA"],
    "required_cuda_version": "11.8"
  },
  "capabilities": [
    "text_generation",
    "chat_completion",
    "instruction_following",
    "code_generation"
  ],
  "average_rating": null,
  "deployment_count": 0,
  "usage_count": 0,
  "is_featured": false,
  "is_private": false,
  "metadata": {
    "training_data": "Custom dataset",
    "fine_tuning_method": "LoRA",
    "base_model": "llama-2-7b"
  },
  "repository_url": "https://github.com/user/my-model",
  "demo_url": "https://demo.example.com",
  "paper_url": "https://arxiv.org/abs/2301.00000",
  "price_per_1m_tokens": 1000,
  "usage_tracking": {
    "track_tokens": true,
    "track_requests": true,
    "enable_royalties": true,
    "royalty_percentage": 10,
    "custom_metrics": ["latency", "quality_score"]
  }
}

3.3 Automatic Registration with form-pack

When you build with form-pack, your model can be automatically registered:

Formfile with Auto-Registration

NAME my-llm-model
DESCRIPTION "Custom language model for domain-specific tasks"

# System Resources
VCPU 8
MEM 32768
DISK 50

# User Configuration
USER username:modeluser passwd:securepass sudo:true ssh_authorized_keys:"ssh-rsa AAAAB3NzaC1yc2E..."

# Base System
FROM ubuntu:22.04

# Install Docker
RUN apt-get update && apt-get install -y docker.io curl
RUN systemctl enable docker

# Copy model container and configuration
COPY ./model-container /app/model-container
COPY ./docker-compose.yml /app/docker-compose.yml
COPY ./model-metadata.json /app/model-metadata.json

# Set working directory
WORKDIR /app

# Register model with form-state (automatic)
ENV AUTO_REGISTER_MODEL=true
ENV MODEL_METADATA_PATH=/app/model-metadata.json

# Start Docker service and run model container
ENTRYPOINT ["sh", "-c", "service docker start && docker-compose up -d"]

3.4 Manual Registration

For custom registration scenarios or when you need more control:

Registration Script

#!/usr/bin/env python3
"""
Model registration script for Formation
"""

import json
import time
import base64
import hashlib
import requests
from eth_account import Account
from eth_account.messages import encode_defunct

class ModelRegistrar:
    def __init__(self, private_key: str, form_state_url: str = "http://localhost:3004"):
        self.private_key = private_key
        self.form_state_url = form_state_url
        self.account = Account.from_key(private_key)
        
    def create_signature(self, message: str) -> tuple:
        """Create ECDSA signature for authentication"""
        message_hash = encode_defunct(text=message)
        signed_message = self.account.sign_message(message_hash)
        
        return (
            signed_message.signature.hex(),
            signed_message.v - 27,  # Recovery ID
            message.encode().hex()
        )
    
    def register_model(self, model_metadata: dict) -> dict:
        """Register model with form-state"""
        
        # Ensure owner_id matches the signing account
        model_metadata["owner_id"] = self.account.address.lower()
        model_metadata["created_at"] = int(time.time())
        model_metadata["updated_at"] = int(time.time())
        
        # Create authentication message
        message = f"register_model_{model_metadata['model_id']}_{int(time.time())}"
        signature, recovery_id, message_hex = self.create_signature(message)
        
        # Prepare request
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Signature {signature}.{recovery_id}.{message_hex}"
        }
        
        # Send registration request
        response = requests.post(
            f"{self.form_state_url}/models/create",
            headers=headers,
            json=model_metadata
        )
        
        if response.status_code == 201:
            print(f"✅ Model '{model_metadata['model_id']}' registered successfully")
            return response.json()
        else:
            print(f"❌ Registration failed: {response.status_code}")
            print(f"Error: {response.text}")
            raise Exception(f"Registration failed: {response.text}")
    
    def update_model(self, model_metadata: dict) -> dict:
        """Update existing model registration"""
        
        model_metadata["updated_at"] = int(time.time())
        
        # Create authentication message
        message = f"update_model_{model_metadata['model_id']}_{int(time.time())}"
        signature, recovery_id, message_hex = self.create_signature(message)
        
        # Prepare request
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Signature {signature}.{recovery_id}.{message_hex}"
        }
        
        # Send update request
        response = requests.post(
            f"{self.form_state_url}/models/update",
            headers=headers,
            json=model_metadata
        )
        
        if response.status_code == 200:
            print(f"✅ Model '{model_metadata['model_id']}' updated successfully")
            return response.json()
        else:
            print(f"❌ Update failed: {response.status_code}")
            print(f"Error: {response.text}")
            raise Exception(f"Update failed: {response.text}")

def load_formfile_template(formfile_path: str) -> str:
    """Load and encode Formfile template"""
    with open(formfile_path, 'r') as f:
        formfile_content = f.read()
    
    return base64.b64encode(formfile_content.encode()).decode()

def calculate_file_checksum(file_path: str) -> str:
    """Calculate SHA-256 checksum of file"""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256_hash.update(chunk)
    return f"sha256:{sha256_hash.hexdigest()}"

def main():
    """Main registration function"""
    
    # Configuration
    PRIVATE_KEY = "0x1234567890abcdef..."  # Your ECDSA private key
    FORM_STATE_URL = "http://localhost:3004"
    
    # Initialize registrar
    registrar = ModelRegistrar(PRIVATE_KEY, FORM_STATE_URL)
    
    # Load Formfile template
    formfile_template = load_formfile_template("Formfile")
    
    # Calculate model weights checksum (if applicable)
    weights_checksum = None
    weights_size = None
    if os.path.exists("model/weights.bin"):
        weights_checksum = calculate_file_checksum("model/weights.bin")
        weights_size = os.path.getsize("model/weights.bin")
    
    # Model metadata
    model_metadata = {
        "model_id": "my-custom-llm-v1",
        "name": "My Custom Language Model",
        "version": "1.0.0",
        "description": "A fine-tuned language model for specific domain tasks",
        "documentation": "# Model Documentation\n\nThis model is trained for domain-specific tasks...",
        "license": "MIT",
        "model_type": "LLM",
        "framework": "PyTorch",
        "io_modes": ["TextToText"],
        "parameters": 7000000000,
        "quantization": "FP16",
        "context_length": 4096,
        "input_tokens_per_second": 150,
        "output_tokens_per_second": 50,
        "tags": ["language-model", "chat", "instruction-following"],
        "formfile_template": formfile_template,
        "weights_url": "https://example.com/model-weights.bin",
        "weights_checksum": weights_checksum,
        "weights_size_bytes": weights_size,
        "resource_requirements": {
            "min_vcpus": 4,
            "recommended_vcpus": 8,
            "min_memory_mb": 16384,
            "recommended_memory_mb": 32768,
            "min_disk_gb": 20,
            "recommended_disk_gb": 50,
            "requires_gpu": True,
            "min_vram_gb": 16,
            "recommended_vram_gb": 24,
            "cuda_cores": 2048,
            "tensor_cores": 320,
            "required_cpu_extensions": ["AVX2", "FMA"],
            "required_cuda_version": "11.8"
        },
        "capabilities": [
            "text_generation",
            "chat_completion",
            "instruction_following"
        ],
        "is_private": False,
        "metadata": {
            "training_data": "Custom dataset",
            "fine_tuning_method": "LoRA",
            "base_model": "llama-2-7b"
        },
        "repository_url": "https://github.com/user/my-model",
        "price_per_1m_tokens": 1000,
        "usage_tracking": {
            "track_tokens": True,
            "track_requests": True,
            "enable_royalties": True,
            "royalty_percentage": 10,
            "custom_metrics": ["latency", "quality_score"]
        }
    }
    
    try:
        # Register the model
        result = registrar.register_model(model_metadata)
        print(f"Registration successful: {result}")
        
    except Exception as e:
        print(f"Registration failed: {e}")
        return 1
    
    return 0

if __name__ == "__main__":
    import os
    import sys
    sys.exit(main())

3.5 Registration Verification

After registration, verify your model was registered correctly:

# Check model registration
curl -H "Authorization: Signature <signature>.<recovery_id>.<message_hex>" \
     http://localhost:3004/models/my-custom-llm-v1

# List all your models
curl -H "Authorization: Signature <signature>.<recovery_id>.<message_hex>" \
     http://localhost:3004/models/list

Verification Script

def verify_registration(registrar: ModelRegistrar, model_id: str):
    """Verify model registration"""
    
    # Create authentication for GET request
    message = f"get_model_{model_id}_{int(time.time())}"
    signature, recovery_id, message_hex = registrar.create_signature(message)
    
    headers = {
        "Authorization": f"Signature {signature}.{recovery_id}.{message_hex}"
    }
    
    response = requests.get(
        f"{registrar.form_state_url}/models/{model_id}",
        headers=headers
    )
    
    if response.status_code == 200:
        model_data = response.json()
        print(f"✅ Model verification successful")
        print(f"Model ID: {model_data['model']['model_id']}")
        print(f"Name: {model_data['model']['name']}")
        print(f"Owner: {model_data['model']['owner_id']}")
        print(f"Status: Registered")
        return True
    else:
        print(f"❌ Model verification failed: {response.status_code}")
        print(f"Error: {response.text}")
        return False

Step 4: Testing Inference Endpoints

4.1 Local Testing Setup

Before deploying to Formation, thoroughly test your model locally:

Test Environment Setup

# Build and run your model container locally
docker build -t my-model:latest .
docker run -p 8080:8080 my-model:latest

# Wait for model to load
sleep 30

# Run test suite
python test_model_api.py

4.2 Comprehensive API Testing

Test Suite Implementation

#!/usr/bin/env python3
"""
Comprehensive test suite for Formation model API
"""

import json
import time
import requests
import threading
from typing import Dict, Any, List
import concurrent.futures

class ModelAPITester:
    def __init__(self, base_url: str = "http://localhost:8080"):
        self.base_url = base_url
        self.session = requests.Session()
        self.test_results = []
    
    def test_health_endpoint(self) -> bool:
        """Test health check endpoint"""
        print("🔍 Testing health endpoint...")
        
        try:
            response = self.session.get(f"{self.base_url}/health", timeout=10)
            
            assert response.status_code == 200, f"Expected 200, got {response.status_code}"
            
            data = response.json()
            assert "status" in data, "Health response missing 'status' field"
            assert data["status"] in ["healthy", "degraded", "unhealthy"], f"Invalid status: {data['status']}"
            
            print(f"✅ Health check passed - Status: {data['status']}")
            return True
            
        except Exception as e:
            print(f"❌ Health check failed: {e}")
            return False
    
    def test_models_endpoint(self) -> bool:
        """Test models list endpoint"""
        print("🔍 Testing models endpoint...")
        
        try:
            response = self.session.get(f"{self.base_url}/v1/models", timeout=10)
            
            assert response.status_code == 200, f"Expected 200, got {response.status_code}"
            
            data = response.json()
            assert "object" in data, "Models response missing 'object' field"
            assert data["object"] == "list", f"Expected 'list', got {data['object']}"
            assert "data" in data, "Models response missing 'data' field"
            assert isinstance(data["data"], list), "Models data should be a list"
            assert len(data["data"]) > 0, "No models found"
            
            model = data["data"][0]
            assert "id" in model, "Model missing 'id' field"
            assert "object" in model, "Model missing 'object' field"
            assert model["object"] == "model", f"Expected 'model', got {model['object']}"
            
            print(f"✅ Models endpoint passed - Found {len(data['data'])} model(s)")
            return True
            
        except Exception as e:
            print(f"❌ Models endpoint failed: {e}")
            return False
    
    def test_chat_completions(self) -> bool:
        """Test chat completions endpoint"""
        print("🔍 Testing chat completions endpoint...")
        
        test_payload = {
            "model": "test-model",
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "Hello, how are you?"}
            ],
            "max_tokens": 100,
            "temperature": 0.7
        }
        
        try:
            start_time = time.time()
            response = self.session.post(
                f"{self.base_url}/v1/chat/completions",
                json=test_payload,
                timeout=30
            )
            response_time = time.time() - start_time
            
            assert response.status_code == 200, f"Expected 200, got {response.status_code}"
            
            data = response.json()
            
            # Validate response structure
            assert "id" in data, "Response missing 'id' field"
            assert "object" in data, "Response missing 'object' field"
            assert data["object"] == "chat.completion", f"Expected 'chat.completion', got {data['object']}"
            assert "choices" in data, "Response missing 'choices' field"
            assert "usage" in data, "Response missing 'usage' field"
            
            # Validate choices
            choices = data["choices"]
            assert len(choices) > 0, "No choices in response"
            
            choice = choices[0]
            assert "index" in choice, "Choice missing 'index' field"
            assert "message" in choice, "Choice missing 'message' field"
            assert "finish_reason" in choice, "Choice missing 'finish_reason' field"
            
            message = choice["message"]
            assert "role" in message, "Message missing 'role' field"
            assert "content" in message, "Message missing 'content' field"
            assert message["role"] == "assistant", f"Expected 'assistant', got {message['role']}"
            assert len(message["content"]) > 0, "Empty message content"
            
            # Validate usage
            usage = data["usage"]
            assert "prompt_tokens" in usage, "Usage missing 'prompt_tokens' field"
            assert "completion_tokens" in usage, "Usage missing 'completion_tokens' field"
            assert "total_tokens" in usage, "Usage missing 'total_tokens' field"
            assert usage["total_tokens"] == usage["prompt_tokens"] + usage["completion_tokens"], "Token count mismatch"
            
            print(f"✅ Chat completions passed - Response time: {response_time:.2f}s")
            print(f"   Tokens: {usage['prompt_tokens']} + {usage['completion_tokens']} = {usage['total_tokens']}")
            return True
            
        except Exception as e:
            print(f"❌ Chat completions failed: {e}")
            return False
    
    def test_streaming_completions(self) -> bool:
        """Test streaming chat completions"""
        print("🔍 Testing streaming completions...")
        
        test_payload = {
            "model": "test-model",
            "messages": [
                {"role": "user", "content": "Count from 1 to 5"}
            ],
            "max_tokens": 50,
            "stream": True
        }
        
        try:
            response = self.session.post(
                f"{self.base_url}/v1/chat/completions",
                json=test_payload,
                stream=True,
                timeout=30
            )
            
            assert response.status_code == 200, f"Expected 200, got {response.status_code}"
            
            chunks_received = 0
            content_received = ""
            
            for line in response.iter_lines():
                if line:
                    line_str = line.decode('utf-8')
                    if line_str.startswith('data: '):
                        data_str = line_str[6:]  # Remove 'data: ' prefix
                        
                        if data_str.strip() == '[DONE]':
                            break
                        
                        try:
                            chunk_data = json.loads(data_str)
                            chunks_received += 1
                            
                            # Validate chunk structure
                            assert "choices" in chunk_data, "Chunk missing 'choices' field"
                            if chunk_data["choices"] and "delta" in chunk_data["choices"][0]:
                                delta = chunk_data["choices"][0]["delta"]
                                if "content" in delta:
                                    content_received += delta["content"]
                                    
                        except json.JSONDecodeError:
                            continue
            
            assert chunks_received > 0, "No chunks received"
            assert len(content_received) > 0, "No content received"
            
            print(f"✅ Streaming completions passed - Received {chunks_received} chunks")
            return True
            
        except Exception as e:
            print(f"❌ Streaming completions failed: {e}")
            return False
    
    def test_error_handling(self) -> bool:
        """Test error handling scenarios"""
        print("🔍 Testing error handling...")
        
        test_cases = [
            {
                "name": "Empty request",
                "payload": {},
                "expected_status": 400
            },
            {
                "name": "Missing messages",
                "payload": {"model": "test-model"},
                "expected_status": 400
            },
            {
                "name": "Invalid messages format",
                "payload": {"model": "test-model", "messages": "invalid"},
                "expected_status": 400
            },
            {
                "name": "Empty messages array",
                "payload": {"model": "test-model", "messages": []},
                "expected_status": 400
            }
        ]
        
        for test_case in test_cases:
            try:
                response = self.session.post(
                    f"{self.base_url}/v1/chat/completions",
                    json=test_case["payload"],
                    timeout=10
                )
                
                assert response.status_code == test_case["expected_status"], \
                    f"Test '{test_case['name']}': Expected {test_case['expected_status']}, got {response.status_code}"
                
                # Validate error response format
                if response.status_code >= 400:
                    data = response.json()
                    assert "error" in data, f"Error response missing 'error' field for test '{test_case['name']}'"
                    error = data["error"]
                    assert "message" in error, f"Error missing 'message' field for test '{test_case['name']}'"
                
                print(f"   ✅ {test_case['name']}: Correct error handling")
                
            except Exception as e:
                print(f"   ❌ {test_case['name']}: {e}")
                return False
        
        print("✅ Error handling tests passed")
        return True
    
    def test_concurrent_requests(self, num_requests: int = 10) -> bool:
        """Test concurrent request handling"""
        print(f"🔍 Testing concurrent requests ({num_requests} requests)...")
        
        def make_request():
            payload = {
                "model": "test-model",
                "messages": [{"role": "user", "content": "Hello"}],
                "max_tokens": 50
            }
            
            try:
                response = requests.post(
                    f"{self.base_url}/v1/chat/completions",
                    json=payload,
                    timeout=30
                )
                return response.status_code == 200
            except:
                return False
        
        try:
            start_time = time.time()
            
            with concurrent.futures.ThreadPoolExecutor(max_workers=num_requests) as executor:
                futures = [executor.submit(make_request) for _ in range(num_requests)]
                results = [future.result() for future in concurrent.futures.as_completed(futures)]
            
            end_time = time.time()
            
            successful_requests = sum(results)
            success_rate = successful_requests / num_requests * 100
            avg_time = (end_time - start_time) / num_requests
            
            assert success_rate >= 80, f"Success rate too low: {success_rate}%"
            
            print(f"✅ Concurrent requests passed - {successful_requests}/{num_requests} successful ({success_rate:.1f}%)")
            print(f"   Average time per request: {avg_time:.2f}s")
            return True
            
        except Exception as e:
            print(f"❌ Concurrent requests failed: {e}")
            return False
    
    def test_performance_benchmarks(self) -> bool:
        """Test performance benchmarks"""
        print("🔍 Testing performance benchmarks...")
        
        # Test response time requirements
        test_cases = [
            {
                "endpoint": "/health",
                "method": "GET",
                "max_time": 0.1,  # 100ms
                "payload": None
            },
            {
                "endpoint": "/v1/models",
                "method": "GET", 
                "max_time": 0.2,  # 200ms
                "payload": None
            },
            {
                "endpoint": "/v1/chat/completions",
                "method": "POST",
                "max_time": 30.0,  # 30 seconds
                "payload": {
                    "model": "test-model",
                    "messages": [{"role": "user", "content": "Hello"}],
                    "max_tokens": 100
                }
            }
        ]
        
        for test_case in test_cases:
            try:
                start_time = time.time()
                
                if test_case["method"] == "GET":
                    response = self.session.get(f"{self.base_url}{test_case['endpoint']}", timeout=test_case["max_time"] + 5)
                else:
                    response = self.session.post(f"{self.base_url}{test_case['endpoint']}", json=test_case["payload"], timeout=test_case["max_time"] + 5)
                
                response_time = time.time() - start_time
                
                assert response.status_code == 200, f"Request failed with status {response.status_code}"
                assert response_time <= test_case["max_time"], f"Response time {response_time:.2f}s exceeds limit {test_case['max_time']}s"
                
                print(f"   ✅ {test_case['endpoint']}: {response_time:.2f}s (limit: {test_case['max_time']}s)")
                
            except Exception as e:
                print(f"   ❌ {test_case['endpoint']}: {e}")
                return False
        
        print("✅ Performance benchmarks passed")
        return True
    
    def run_all_tests(self) -> bool:
        """Run complete test suite"""
        print("🚀 Starting comprehensive model API tests...\n")
        
        tests = [
            ("Health Endpoint", self.test_health_endpoint),
            ("Models Endpoint", self.test_models_endpoint),
            ("Chat Completions", self.test_chat_completions),
            ("Streaming Completions", self.test_streaming_completions),
            ("Error Handling", self.test_error_handling),
            ("Concurrent Requests", self.test_concurrent_requests),
            ("Performance Benchmarks", self.test_performance_benchmarks)
        ]
        
        passed_tests = 0
        total_tests = len(tests)
        
        for test_name, test_func in tests:
            print(f"\n{'='*50}")
            print(f"Running: {test_name}")
            print('='*50)
            
            try:
                if test_func():
                    passed_tests += 1
                    print(f"✅ {test_name} PASSED")
                else:
                    print(f"❌ {test_name} FAILED")
            except Exception as e:
                print(f"❌ {test_name} FAILED with exception: {e}")
        
        print(f"\n{'='*50}")
        print(f"TEST SUMMARY")
        print('='*50)
        print(f"Passed: {passed_tests}/{total_tests}")
        print(f"Success Rate: {passed_tests/total_tests*100:.1f}%")
        
        if passed_tests == total_tests:
            print("🎉 All tests passed! Model is ready for deployment.")
            return True
        else:
            print("⚠️  Some tests failed. Please fix issues before deployment.")
            return False

def main():
    """Main test function"""
    import argparse
    
    parser = argparse.ArgumentParser(description="Test Formation model API")
    parser.add_argument("--url", default="http://localhost:8080", help="Model API base URL")
    parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent requests to test")
    
    args = parser.parse_args()
    
    tester = ModelAPITester(args.url)
    
    # Wait for model to be ready
    print("Waiting for model to be ready...")
    for i in range(30):  # Wait up to 30 seconds
        try:
            response = requests.get(f"{args.url}/health", timeout=5)
            if response.status_code == 200:
                data = response.json()
                if data.get("status") == "healthy":
                    print("✅ Model is ready!")
                    break
        except:
            pass
        
        time.sleep(1)
        print(f"   Waiting... ({i+1}/30)")
    else:
        print("❌ Model failed to become ready within 30 seconds")
        return 1
    
    # Run tests
    success = tester.run_all_tests()
    return 0 if success else 1

if __name__ == "__main__":
    import sys
    sys.exit(main())

4.3 Load Testing

Load Test Script

#!/usr/bin/env python3
"""
Load testing script for Formation models
"""

import time
import json
import statistics
import concurrent.futures
import requests
from typing import List, Dict, Any

class LoadTester:
    def __init__(self, base_url: str):
        self.base_url = base_url
        self.results = []
    
    def single_request(self, payload: Dict[str, Any]) -> Dict[str, Any]:
        """Make a single request and measure performance"""
        start_time = time.time()
        
        try:
            response = requests.post(
                f"{self.base_url}/v1/chat/completions",
                json=payload,
                timeout=60
            )
            
            end_time = time.time()
            response_time = end_time - start_time
            
            success = response.status_code == 200
            
            if success:
                data = response.json()
                tokens = data.get("usage", {}).get("total_tokens", 0)
            else:
                tokens = 0
            
            return {
                "success": success,
                "response_time": response_time,
                "status_code": response.status_code,
                "tokens": tokens,
                "timestamp": start_time
            }
            
        except Exception as e:
            end_time = time.time()
            return {
                "success": False,
                "response_time": end_time - start_time,
                "status_code": 0,
                "tokens": 0,
                "error": str(e),
                "timestamp": start_time
            }
    
    def run_load_test(self, 
                     num_requests: int = 100,
                     concurrent_users: int = 10,
                     payload: Dict[str, Any] = None) -> Dict[str, Any]:
        """Run load test with specified parameters"""
        
        if payload is None:
            payload = {
                "model": "test-model",
                "messages": [
                    {"role": "user", "content": "Generate a short response about AI."}
                ],
                "max_tokens": 100
            }
        
        print(f"🚀 Starting load test:")
        print(f"   Requests: {num_requests}")
        print(f"   Concurrent users: {concurrent_users}")
        print(f"   Payload: {json.dumps(payload, indent=2)}")
        
        start_time = time.time()
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=concurrent_users) as executor:
            futures = [executor.submit(self.single_request, payload) for _ in range(num_requests)]
            results = [future.result() for future in concurrent.futures.as_completed(futures)]
        
        end_time = time.time()
        total_time = end_time - start_time
        
        # Analyze results
        successful_requests = [r for r in results if r["success"]]
        failed_requests = [r for r in results if not r["success"]]
        
        response_times = [r["response_time"] for r in successful_requests]
        total_tokens = sum(r["tokens"] for r in successful_requests)
        
        # Calculate statistics
        stats = {
            "total_requests": num_requests,
            "successful_requests": len(successful_requests),
            "failed_requests": len(failed_requests),
            "success_rate": len(successful_requests) / num_requests * 100,
            "total_time": total_time,
            "requests_per_second": num_requests / total_time,
            "total_tokens": total_tokens,
            "tokens_per_second": total_tokens / total_time if total_time > 0 else 0,
        }
        
        if response_times:
            stats.update({
                "avg_response_time": statistics.mean(response_times),
                "min_response_time": min(response_times),
                "max_response_time": max(response_times),
                "median_response_time": statistics.median(response_times),
                "p95_response_time": self.percentile(response_times, 95),
                "p99_response_time": self.percentile(response_times, 99),
            })
        
        return stats
    
    def percentile(self, data: List[float], percentile: float) -> float:
        """Calculate percentile of data"""
        sorted_data = sorted(data)
        index = int(len(sorted_data) * percentile / 100)
        return sorted_data[min(index, len(sorted_data) - 1)]
    
    def print_results(self, stats: Dict[str, Any]):
        """Print formatted test results"""
        print("\n" + "="*60)
        print("LOAD TEST RESULTS")
        print("="*60)
        
        print(f"Total Requests:      {stats['total_requests']}")
        print(f"Successful:          {stats['successful_requests']}")
        print(f"Failed:              {stats['failed_requests']}")
        print(f"Success Rate:        {stats['success_rate']:.1f}%")
        print(f"Total Time:          {stats['total_time']:.2f}s")
        print(f"Requests/Second:     {stats['requests_per_second']:.2f}")
        print(f"Total Tokens:        {stats['total_tokens']}")
        print(f"Tokens/Second:       {stats['tokens_per_second']:.2f}")
        
        if "avg_response_time" in stats:
            print("\nResponse Time Statistics:")
            print(f"Average:             {stats['avg_response_time']:.3f}s")
            print(f"Minimum:             {stats['min_response_time']:.3f}s")
            print(f"Maximum:             {stats['max_response_time']:.3f}s")
            print(f"Median:              {stats['median_response_time']:.3f}s")
            print(f"95th Percentile:     {stats['p95_response_time']:.3f}s")
            print(f"99th Percentile:     {stats['p99_response_time']:.3f}s")
        
        # Performance assessment
        print("\nPerformance Assessment:")
        if stats['success_rate'] >= 99:
            print("✅ Excellent reliability")
        elif stats['success_rate'] >= 95:
            print("✅ Good reliability")
        elif stats['success_rate'] >= 90:
            print("⚠️  Acceptable reliability")
        else:
            print("❌ Poor reliability - needs improvement")
        
        if "avg_response_time" in stats:
            if stats['avg_response_time'] <= 1.0:
                print("✅ Excellent response time")
            elif stats['avg_response_time'] <= 5.0:
                print("✅ Good response time")
            elif stats['avg_response_time'] <= 15.0:
                print("⚠️  Acceptable response time")
            else:
                print("❌ Poor response time - needs optimization")

def main():
    """Main load testing function"""
    import argparse
    
    parser = argparse.ArgumentParser(description="Load test Formation model API")
    parser.add_argument("--url", default="http://localhost:8080", help="Model API base URL")
    parser.add_argument("--requests", type=int, default=100, help="Total number of requests")
    parser.add_argument("--concurrent", type=int, default=10, help="Number of concurrent users")
    parser.add_argument("--max-tokens", type=int, default=100, help="Max tokens per request")
    
    args = parser.parse_args()
    
    # Test payload
    payload = {
        "model": "test-model",
        "messages": [
            {"role": "user", "content": "Generate a short response about artificial intelligence and its applications."}
        ],
        "max_tokens": args.max_tokens,
        "temperature": 0.7
    }
    
    # Initialize load tester
    tester = LoadTester(args.url)
    
    # Run load test
    stats = tester.run_load_test(
        num_requests=args.requests,
        concurrent_users=args.concurrent,
        payload=payload
    )
    
    # Print results
    tester.print_results(stats)
    
    # Return exit code based on performance
    if stats['success_rate'] >= 95 and stats.get('avg_response_time', 0) <= 30:
        print("\n🎉 Load test passed! Model is ready for production.")
        return 0
    else:
        print("\n⚠️  Load test indicates performance issues. Consider optimization.")
        return 1

if __name__ == "__main__":
    import sys
    sys.exit(main())

4.4 Integration Testing with Formation Network

Formation Integration Test

#!/bin/bash
# integration_test.sh - Test model deployment on Formation network

set -e

echo "🚀 Starting Formation model integration test..."

# Configuration
MODEL_ID="my-test-model-v1"
FORM_STATE_URL="http://localhost:3004"
BUILD_DIR="./build"

# Step 1: Build model container
echo "📦 Building model container..."
docker build -t $MODEL_ID:latest .

# Step 2: Test container locally
echo "🧪 Testing container locally..."
docker run -d --name test-model -p 8080:8080 $MODEL_ID:latest

# Wait for model to start
sleep 30

# Run API tests
python test_model_api.py --url http://localhost:8080

# Stop test container
docker stop test-model
docker rm test-model

# Step 3: Build with form-pack
echo "📦 Building with form-pack..."
form pack build

# Step 4: Deploy to Formation
echo "🚀 Deploying to Formation..."
form pack ship

# Step 5: Test deployed model
echo "🧪 Testing deployed model..."
# Get deployment IP from form-state
DEPLOYMENT_IP=$(curl -s $FORM_STATE_URL/models/$MODEL_ID | jq -r '.deployment.ip')

if [ "$DEPLOYMENT_IP" != "null" ]; then
    echo "Testing deployed model at $DEPLOYMENT_IP..."
    python test_model_api.py --url http://$DEPLOYMENT_IP:8080
    
    # Run load test on deployed model
    python load_test.py --url http://$DEPLOYMENT_IP:8080 --requests 50 --concurrent 5
    
    echo "✅ Integration test completed successfully!"
else
    echo "❌ Failed to get deployment IP"
    exit 1
fi

Step 5: Deployment Monitoring and Optimization

5.1 Monitoring Setup

Health Monitoring Script

#!/usr/bin/env python3
"""
Model deployment monitoring script
"""

import time
import json
import requests
import logging
from typing import Dict, Any, List

class ModelMonitor:
    def __init__(self, model_endpoints: List[str]):
        self.endpoints = model_endpoints
        self.logger = self._setup_logging()
        
    def _setup_logging(self):
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler('model_monitor.log'),
                logging.StreamHandler()
            ]
        )
        return logging.getLogger(__name__)
    
    def check_health(self, endpoint: str) -> Dict[str, Any]:
        """Check health of a single endpoint"""
        try:
            start_time = time.time()
            response = requests.get(f"{endpoint}/health", timeout=10)
            response_time = time.time() - start_time
            
            if response.status_code == 200:
                data = response.json()
                return {
                    "endpoint": endpoint,
                    "status": "healthy",
                    "response_time": response_time,
                    "model_status": data.get("status", "unknown"),
                    "model_loaded": data.get("model_loaded", False),
                    "memory_usage": data.get("memory_usage", {}),
                    "timestamp": time.time()
                }
            else:
                return {
                    "endpoint": endpoint,
                    "status": "unhealthy",
                    "response_time": response_time,
                    "error": f"HTTP {response.status_code}",
                    "timestamp": time.time()
                }
                
        except Exception as e:
            return {
                "endpoint": endpoint,
                "status": "error",
                "error": str(e),
                "timestamp": time.time()
            }
    
    def monitor_continuously(self, interval: int = 60):
        """Monitor endpoints continuously"""
        self.logger.info(f"Starting continuous monitoring of {len(self.endpoints)} endpoints")
        
        while True:
            try:
                for endpoint in self.endpoints:
                    health = self.check_health(endpoint)
                    
                    if health["status"] == "healthy":
                        self.logger.info(f"✅ {endpoint}: {health['model_status']} ({health['response_time']:.2f}s)")
                    else:
                        self.logger.error(f"❌ {endpoint}: {health.get('error', 'unhealthy')}")
                
                time.sleep(interval)
                
            except KeyboardInterrupt:
                self.logger.info("Monitoring stopped by user")
                break
            except Exception as e:
                self.logger.error(f"Monitoring error: {e}")
                time.sleep(interval)

if __name__ == "__main__":
    endpoints = [
        "http://model-instance-1:8080",
        "http://model-instance-2:8080",
        # Add your model endpoints here
    ]
    
    monitor = ModelMonitor(endpoints)
    monitor.monitor_continuously(interval=30)

5.2 Performance Optimization

Model Optimization Checklist

#!/usr/bin/env python3
"""
Model optimization analyzer and recommendations
"""

import psutil
import torch
import time
import json
from typing import Dict, Any, List

class ModelOptimizer:
    def __init__(self):
        self.recommendations = []
    
    def analyze_memory_usage(self) -> Dict[str, Any]:
        """Analyze current memory usage"""
        memory = psutil.virtual_memory()
        
        analysis = {
            "total_memory_gb": memory.total / (1024**3),
            "used_memory_gb": memory.used / (1024**3),
            "available_memory_gb": memory.available / (1024**3),
            "memory_percent": memory.percent
        }
        
        if memory.percent > 90:
            self.recommendations.append({
                "type": "memory",
                "severity": "high",
                "message": "Memory usage is very high (>90%). Consider model quantization or increasing instance memory."
            })
        elif memory.percent > 75:
            self.recommendations.append({
                "type": "memory", 
                "severity": "medium",
                "message": "Memory usage is high (>75%). Monitor for potential issues."
            })
        
        return analysis
    
    def analyze_gpu_usage(self) -> Dict[str, Any]:
        """Analyze GPU usage if available"""
        if not torch.cuda.is_available():
            return {"gpu_available": False}
        
        gpu_memory = torch.cuda.get_device_properties(0).total_memory
        gpu_memory_allocated = torch.cuda.memory_allocated(0)
        gpu_memory_cached = torch.cuda.memory_reserved(0)
        
        analysis = {
            "gpu_available": True,
            "gpu_name": torch.cuda.get_device_name(0),
            "total_vram_gb": gpu_memory / (1024**3),
            "allocated_vram_gb": gpu_memory_allocated / (1024**3),
            "cached_vram_gb": gpu_memory_cached / (1024**3),
            "vram_utilization": gpu_memory_allocated / gpu_memory * 100
        }
        
        if analysis["vram_utilization"] > 90:
            self.recommendations.append({
                "type": "gpu",
                "severity": "high", 
                "message": "GPU memory usage is very high (>90%). Consider model quantization or batch size reduction."
            })
        
        return analysis
    
    def analyze_model_performance(self, model_path: str = None) -> Dict[str, Any]:
        """Analyze model performance characteristics"""
        # This would analyze your specific model
        # For demonstration, we'll use placeholder values
        
        analysis = {
            "model_size_gb": 14.0,  # Example: 7B parameter model in FP16
            "estimated_parameters": 7e9,
            "precision": "fp16",
            "quantization_possible": True,
            "batch_processing_capable": True
        }
        
        # Recommendations based on model characteristics
        if analysis["model_size_gb"] > 10:
            self.recommendations.append({
                "type": "model",
                "severity": "medium",
                "message": "Large model detected. Consider quantization (INT8/INT4) to reduce memory usage."
            })
        
        if not analysis["batch_processing_capable"]:
            self.recommendations.append({
                "type": "performance",
                "severity": "medium",
                "message": "Model doesn't support batch processing. Consider implementing batching for better throughput."
            })
        
        return analysis
    
    def benchmark_inference_speed(self, endpoint: str = "http://localhost:8080") -> Dict[str, Any]:
        """Benchmark inference speed"""
        import requests
        
        test_payload = {
            "model": "test-model",
            "messages": [{"role": "user", "content": "Hello"}],
            "max_tokens": 100
        }
        
        response_times = []
        
        # Warm up
        for _ in range(3):
            try:
                requests.post(f"{endpoint}/v1/chat/completions", json=test_payload, timeout=30)
            except:
                pass
        
        # Benchmark
        for _ in range(10):
            try:
                start_time = time.time()
                response = requests.post(f"{endpoint}/v1/chat/completions", json=test_payload, timeout=30)
                end_time = time.time()
                
                if response.status_code == 200:
                    response_times.append(end_time - start_time)
            except:
                continue
        
        if response_times:
            avg_time = sum(response_times) / len(response_times)
            analysis = {
                "avg_response_time": avg_time,
                "min_response_time": min(response_times),
                "max_response_time": max(response_times),
                "samples": len(response_times)
            }
            
            if avg_time > 10:
                self.recommendations.append({
                    "type": "performance",
                    "severity": "high",
                    "message": f"Average response time is high ({avg_time:.2f}s). Consider model optimization or hardware upgrade."
                })
            elif avg_time > 5:
                self.recommendations.append({
                    "type": "performance",
                    "severity": "medium", 
                    "message": f"Response time could be improved ({avg_time:.2f}s). Consider optimization techniques."
                })
            
            return analysis
        else:
            return {"error": "Could not benchmark inference speed"}
    
    def generate_optimization_report(self) -> Dict[str, Any]:
        """Generate comprehensive optimization report"""
        print("🔍 Analyzing model deployment for optimization opportunities...")
        
        report = {
            "timestamp": time.time(),
            "memory_analysis": self.analyze_memory_usage(),
            "gpu_analysis": self.analyze_gpu_usage(),
            "model_analysis": self.analyze_model_performance(),
            "performance_benchmark": self.benchmark_inference_speed(),
            "recommendations": self.recommendations
        }
        
        return report
    
    def print_report(self, report: Dict[str, Any]):
        """Print formatted optimization report"""
        print("\n" + "="*60)
        print("MODEL OPTIMIZATION REPORT")
        print("="*60)
        
        # Memory Analysis
        memory = report["memory_analysis"]
        print(f"\n📊 Memory Analysis:")
        print(f"   Total Memory:     {memory['total_memory_gb']:.1f} GB")
        print(f"   Used Memory:      {memory['used_memory_gb']:.1f} GB ({memory['memory_percent']:.1f}%)")
        print(f"   Available Memory: {memory['available_memory_gb']:.1f} GB")
        
        # GPU Analysis
        gpu = report["gpu_analysis"]
        if gpu["gpu_available"]:
            print(f"\n🎮 GPU Analysis:")
            print(f"   GPU:              {gpu['gpu_name']}")
            print(f"   Total VRAM:       {gpu['total_vram_gb']:.1f} GB")
            print(f"   Allocated VRAM:   {gpu['allocated_vram_gb']:.1f} GB ({gpu['vram_utilization']:.1f}%)")
            print(f"   Cached VRAM:      {gpu['cached_vram_gb']:.1f} GB")
        else:
            print(f"\n🎮 GPU Analysis: No GPU available")
        
        # Model Analysis
        model = report["model_analysis"]
        print(f"\n🤖 Model Analysis:")
        print(f"   Model Size:       {model['model_size_gb']:.1f} GB")
        print(f"   Parameters:       {model['estimated_parameters']:.0e}")
        print(f"   Precision:        {model['precision']}")
        print(f"   Quantization:     {'Possible' if model['quantization_possible'] else 'Not applicable'}")
        
        # Performance Benchmark
        perf = report["performance_benchmark"]
        if "avg_response_time" in perf:
            print(f"\n⚡ Performance Benchmark:")
            print(f"   Avg Response Time: {perf['avg_response_time']:.2f}s")
            print(f"   Min Response Time: {perf['min_response_time']:.2f}s")
            print(f"   Max Response Time: {perf['max_response_time']:.2f}s")
            print(f"   Samples:          {perf['samples']}")
        
        # Recommendations
        recommendations = report["recommendations"]
        if recommendations:
            print(f"\n💡 Optimization Recommendations:")
            for i, rec in enumerate(recommendations, 1):
                severity_icon = {"high": "🔴", "medium": "🟡", "low": "🟢"}.get(rec["severity"], "ℹ️")
                print(f"   {i}. {severity_icon} [{rec['type'].upper()}] {rec['message']}")
        else:
            print(f"\n✅ No optimization recommendations - model is well optimized!")

def main():
    """Main optimization analysis function"""
    optimizer = ModelOptimizer()
    report = optimizer.generate_optimization_report()
    optimizer.print_report(report)
    
    # Save report to file
    with open("optimization_report.json", "w") as f:
        json.dump(report, f, indent=2)
    
    print(f"\n📄 Full report saved to optimization_report.json")

if __name__ == "__main__":
    main()

Deployment Checklist

Pre-Deployment Validation

Model Container Built: Docker container builds successfully
API Endpoints Working: All required endpoints respond correctly
Resource Requirements Calculated: Accurate resource specifications defined
Local Testing Passed: Comprehensive API tests pass locally
Load Testing Completed: Model handles expected concurrent load
Error Handling Verified: Proper error responses for all failure cases
Health Checks Working: Health endpoint returns accurate status
Usage Metrics Implemented: Accurate token counting and resource tracking
Security Configured: Non-root user, minimal privileges, secure defaults

Registration Validation

Model Metadata Complete: All required fields populated accurately
Formfile Created: Valid Formfile with correct resource specifications
ECDSA Keys Ready: Private key available for signing requests
form-state Accessible: Can connect to form-state service
Registration Successful: Model registered without errors
Registration Verified: Can retrieve model details from form-state

Deployment Validation

form-pack Build Successful: Package builds without errors
VM Deployment Working: Instance deploys and starts correctly
Network Connectivity: Model accessible via formnet
Health Monitoring Active: Health checks passing consistently
Performance Acceptable: Response times meet requirements
Billing Integration Working: Usage metrics being tracked correctly

Post-Deployment Monitoring

Continuous Health Monitoring: Automated health checks running
Performance Monitoring: Response time and throughput tracking
Resource Monitoring: Memory and CPU usage tracking
Error Rate Monitoring: Error rates within acceptable limits
Usage Analytics: Request patterns and user behavior tracking
Cost Monitoring: Resource costs and revenue tracking

Troubleshooting Common Issues

Container Issues

Problem: Container fails to start

# Check container logs
docker logs <container-id>

# Check resource usage
docker stats <container-id>

# Test container locally
docker run -it --entrypoint /bin/bash <image-name>

Problem: Model loading fails

# Add detailed logging to model loading
import logging
logging.basicConfig(level=logging.DEBUG)

# Check model file integrity
import hashlib
def verify_model_checksum(file_path, expected_checksum):
    with open(file_path, 'rb') as f:
        actual_checksum = hashlib.sha256(f.read()).hexdigest()
    return actual_checksum == expected_checksum

API Issues

Problem: Endpoints not responding

# Check if service is listening on correct port
netstat -tlnp | grep 8080

# Test endpoints directly
curl -v http://localhost:8080/health
curl -v http://localhost:8080/v1/models

Problem: Invalid response format

# Validate response against OpenAI schema
import jsonschema

openai_chat_schema = {
    "type": "object",
    "required": ["id", "object", "created", "model", "choices", "usage"],
    "properties": {
        "id": {"type": "string"},
        "object": {"type": "string", "enum": ["chat.completion"]},
        "created": {"type": "integer"},
        "model": {"type": "string"},
        "choices": {"type": "array"},
        "usage": {"type": "object"}
    }
}

# Validate your response
jsonschema.validate(response_data, openai_chat_schema)

Performance Issues

Problem: High response times

# Profile your inference code
import cProfile
import pstats

def profile_inference():
    # Your inference code here
    pass

cProfile.run('profile_inference()', 'profile_stats')
stats = pstats.Stats('profile_stats')
stats.sort_stats('cumulative').print_stats(10)

Problem: Memory leaks

# Monitor memory usage
import tracemalloc
import gc

tracemalloc.start()

# Your code here

current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")

# Force garbage collection
gc.collect()

Registration Issues

Problem: Authentication failures

# Verify ECDSA signature generation
from eth_account import Account
from eth_account.messages import encode_defunct

def test_signature():
    private_key = "0x..."
    account = Account.from_key(private_key)
    
    message = "test_message"
    message_hash = encode_defunct(text=message)
    signed_message = account.sign_message(message_hash)
    
    print(f"Address: {account.address}")
    print(f"Signature: {signed_message.signature.hex()}")
    print(f"Recovery ID: {signed_message.v - 27}")

Problem: Model registration rejected

# Check form-state logs
curl http://localhost:3004/health

# Verify model metadata format
python -m json.tool model_metadata.json

# Test with minimal metadata first
curl -X POST http://localhost:3004/models/create \
  -H "Content-Type: application/json" \
  -H "Authorization: Signature <sig>.<recovery>.<message>" \
  -d '{"model_id": "test", "name": "Test Model", "owner_id": "0x..."}'

Next Steps

Once your model is successfully deployed:

Monitor Performance: Set up comprehensive monitoring and alerting
Configure Pricing: Set up pricing and billing for your model
Scale Deployment: Learn how to scale your model across multiple instances
Optimize Performance: Advanced optimization techniques for better performance

Support and Resources

Documentation

Model Requirements: Technical specifications and API requirements
Model Examples: Complete working deployment examples
Troubleshooting Guide: Common issues and solutions

Tools and APIs

form-pack: Container building and packaging
form-state API: Model registration and management
form-vmm API: Instance deployment and monitoring

Community

Formation Discord: Get help from the community
GitHub Issues: Report bugs and request features
Developer Forum: Share experiences and best practices

Ready to monetize? Continue to the Monetization Guide to configure pricing and billing for your deployed model! 💰