AWS SageMaker

Amazon SageMaker is a fully managed machine learning service that enables you to deploy custom models on dedicated infrastructure with full control over the deployment environment.

Key Features

✓ Custom Model Support: Deploy any model from HuggingFace, local files, or S3
✓ Dedicated Infrastructure: Predictable performance with dedicated GPU/CPU instances
✓ Full Control: Complete control over instance type, scaling, and configuration
✓ JumpStart Catalog: Access to pre-trained models and solutions
✓ Auto-Scaling: Automatic scaling based on traffic patterns

Available Models

SageMaker JumpStart

Access hundreds of pre-trained models from the JumpStart catalog:

Llama Models: Meta's Llama 2 and Llama 3 in various sizes (7B, 13B, 70B)
Falcon Models: TII's Falcon models optimized for inference
Mistral Models: Mistral AI's open-source models
BLOOM: BigScience's multilingual model
Stable Diffusion: Image generation models
Domain-Specific: Financial, healthcare, and legal models

Custom Models

Deploy your own models by providing:

Model artifacts from S3
Inference container image (PyTorch, TensorFlow, custom)
Inference script for request handling

How It Works in Staque IO

1. Model Selection (JumpStart or custom)
   ↓
2. Infrastructure Configuration (instance type, VPC, IAM)
   ↓
3. Deployment (5-10 minutes provisioning time)
   ↓
4. Endpoint Creation (Model → Config → Endpoint)
   ↓
5. Ready for Inference
   ↓
6. Ongoing: Hourly charges apply while endpoint is active

Deployment Process

When you deploy a SageMaker model through Staque IO, the platform:

Creates a Model: Registers the model with SageMaker
Creates Endpoint Configuration: Defines instance type and scaling
Creates Endpoint: Provisions the infrastructure
Waits for InService: Typically takes 5-10 minutes
Enables Inference: Endpoint is ready to receive requests

Automatic Resource Management

Staque IO automatically handles:

Name sanitization (AWS resource naming requirements)
Region-specific inference image selection
VPC configuration for secure deployments
IAM role assignment for execution
Status monitoring and health checks

Instance Types

SageMaker offers various instance types optimized for different workloads:

GPU Instances (Recommended for LLMs)

Instance Type	GPU	Memory	Best For	Est. Cost/hr
`ml.g4dn.xlarge`	1x NVIDIA T4	16 GB	Small models (7B)	$0.85
`ml.g5.xlarge`	1x NVIDIA A10G	24 GB	Medium models (13B)	$1.41
`ml.g5.12xlarge`	4x NVIDIA A10G	96 GB	Large models (70B)	$7.09
`ml.p4d.24xlarge`	8x NVIDIA A100	320 GB	Very large models	$32.77

CPU Instances

For smaller models or non-LLM workloads:

ml.m5.xlarge - General purpose, $0.23/hr
ml.c5.2xlarge - Compute optimized, $0.41/hr
ml.r5.xlarge - Memory optimized, $0.30/hr

Pricing

Hourly Instance Costs

SageMaker charges are based on:

Instance Hours: Charged per second, minimum 60 seconds
Always Running: Costs accrue whether endpoint is used or not
Data Transfer: Minimal charges for requests/responses
Storage: S3 storage for model artifacts

⚠️ Important: Unlike Bedrock, SageMaker incurs costs even when idle. Delete endpoints when not in use to avoid unnecessary charges.

Cost Example

ml.g4dn.xlarge @ $0.85/hour:
- Daily cost: $20.40 (24 hours)
- Monthly cost: $612.00 (30 days)
- With 50% uptime: $306.00/month

ml.g5.xlarge @ $1.41/hour:
- Daily cost: $33.84
- Monthly cost: $1,015.20
- With 50% uptime: $507.60/month

Configuration

Required Environment Variables

# AWS Credentials
STAQUE_AWS_REGION=eu-north-1
STAQUE_AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
STAQUE_AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# SageMaker Configuration
SAGEMAKER_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerRole
SAGEMAKER_SUBNET_IDS=subnet-12345678,subnet-87654321
SAGEMAKER_SECURITY_GROUP_IDS=sg-12345678

# Optional: JWT Secret
JWT_SECRET=your-secret-key-here

IAM Role Requirements

The SageMaker execution role needs these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint",
        "sagemaker:DeleteEndpoint",
        "sagemaker:DescribeEndpoint",
        "sagemaker:InvokeEndpoint",
        "s3:GetObject",
        "s3:ListBucket",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

VPC Configuration

For security, deploy SageMaker endpoints in a VPC:

Subnets: At least 2 subnets in different availability zones
Security Groups: Configure ingress/egress rules for model access
VPC Endpoints: S3 and ECR endpoints for private connectivity

Usage Examples

Deploying a JumpStart Model

POST /api/deploy/sagemaker
{
  "endpointName": "my-llama-endpoint",
  "instanceType": "ml.g4dn.xlarge",
  "modelPackageArn": "arn:aws:sagemaker:eu-north-1:...:model-package/llama-2-7b",
  "dryRun": false
}

// Response (deployment takes 5-10 minutes)
{
  "success": true,
  "message": "Endpoint creation started",
  "endpointName": "my-llama-endpoint",
  "endpoint": "https://runtime.sagemaker.eu-north-1.amazonaws.com/..."
}

Deploying a Custom Model

POST /api/deploy/sagemaker
{
  "endpointName": "my-custom-model",
  "instanceType": "ml.g5.xlarge",
  "inferenceImage": "763104351884.dkr.ecr.eu-north-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310",
  "modelDataUrl": "s3://my-bucket/model.tar.gz",
  "executionRoleArn": "arn:aws:iam::123456789012:role/SageMakerRole",
  "dryRun": false
}

Checking Endpoint Status

GET /api/resources/resource-uuid/status

// Response
{
  "success": true,
  "resource": {
    "type": "sagemaker",
    "status": "InService",  // Creating, Updating, InService, etc.
    "health": "healthy",
    "instance_type": "ml.g4dn.xlarge",
    "instance_count": 1
  },
  "metrics": {
    "response_time_ms": 342,
    "throughput_per_minute": 87,
    "cpu_utilization": 35,
    "memory_utilization": 48
  },
  "costs": {
    "hourly_cost": 0.85,
    "daily_cost": 20.40,
    "monthly_estimate": 612.00
  }
}

Deleting an Endpoint

POST /api/resources/resource-uuid/control
{
  "action": "delete",
  "confirm": true  // Required to prevent accidental deletion
}

// Response
{
  "success": true,
  "message": "Endpoint deletion initiated",
  "action": "delete",
  "status": "deleting"
}

Best Practices

Instance Selection

7B Models: ml.g4dn.xlarge or ml.g5.xlarge
13B Models: ml.g5.2xlarge or ml.g5.4xlarge
70B Models: ml.g5.12xlarge or ml.p4d.24xlarge
Test First: Always start with smaller instances and scale up as needed

Cost Optimization

Delete When Idle: Delete endpoints that aren't actively used
Use Auto-Scaling: Scale instance count based on traffic
Spot Instances: Consider using managed spot training/inference
Right-Size: Monitor utilization and downgrade if under-utilized
Schedule Deletion: Use automation to delete non-production endpoints after hours

Performance Optimization

Model Optimization: Use quantization (INT8, FP16) to reduce memory
Batch Requests: Process multiple requests together when possible
Enable Multi-Model Endpoints: Host multiple models on one endpoint
Use TensorRT: For NVIDIA GPU acceleration
Monitor Metrics: Track latency, throughput, and resource utilization

Security Best Practices

Always deploy in a VPC with private subnets
Use VPC endpoints for S3 and ECR access
Implement least-privilege IAM policies
Enable encryption at rest and in transit
Use security groups to restrict network access

📚 Learn More

Troubleshooting

Endpoint Stuck in "Creating" Status

Problem: Endpoint takes longer than 15 minutes to deploy

Solution:

Check VPC configuration (subnets, security groups, route tables)
Verify ECR and S3 VPC endpoints exist
Check IAM role has required permissions
Review CloudWatch logs for specific errors

Endpoint Failed to Deploy

Problem: Status changes to "Failed"

Solution:

Check model artifacts are accessible in S3
Verify inference image exists and is accessible
Ensure instance type is available in your region
Review CloudWatch logs for error details

High Latency or Timeout

Problem: Slow response times or request timeouts

Solution:

Increase instance size or use GPU instances
Optimize model (quantization, pruning)
Enable auto-scaling to handle traffic spikes
Check network configuration (VPC, security groups)
Monitor endpoint metrics in CloudWatch

Unexpected Costs

Problem: Higher than expected charges

Solution:

Delete unused endpoints immediately
Use Staque IO's cost monitoring dashboard
Set up AWS Budgets and alerts
Review instance utilization and downsize if possible
Consider switching to Bedrock for variable workloads