AWS SageMaker

Amazon SageMaker is a fully managed machine learning service that enables you to deploy custom models on dedicated infrastructure with full control over the deployment environment.

Key Features

  • Custom Model Support: Deploy any model from HuggingFace, local files, or S3
  • Dedicated Infrastructure: Predictable performance with dedicated GPU/CPU instances
  • Full Control: Complete control over instance type, scaling, and configuration
  • JumpStart Catalog: Access to pre-trained models and solutions
  • Auto-Scaling: Automatic scaling based on traffic patterns

Available Models

SageMaker JumpStart

Access hundreds of pre-trained models from the JumpStart catalog:

  • Llama Models: Meta's Llama 2 and Llama 3 in various sizes (7B, 13B, 70B)
  • Falcon Models: TII's Falcon models optimized for inference
  • Mistral Models: Mistral AI's open-source models
  • BLOOM: BigScience's multilingual model
  • Stable Diffusion: Image generation models
  • Domain-Specific: Financial, healthcare, and legal models

Custom Models

Deploy your own models by providing:

  • Model artifacts from S3
  • Inference container image (PyTorch, TensorFlow, custom)
  • Inference script for request handling

How It Works in Staque IO

1. Model Selection (JumpStart or custom)
   ↓
2. Infrastructure Configuration (instance type, VPC, IAM)
   ↓
3. Deployment (5-10 minutes provisioning time)
   ↓
4. Endpoint Creation (Model → Config → Endpoint)
   ↓
5. Ready for Inference
   ↓
6. Ongoing: Hourly charges apply while endpoint is active

Deployment Process

When you deploy a SageMaker model through Staque IO, the platform:

  1. Creates a Model: Registers the model with SageMaker
  2. Creates Endpoint Configuration: Defines instance type and scaling
  3. Creates Endpoint: Provisions the infrastructure
  4. Waits for InService: Typically takes 5-10 minutes
  5. Enables Inference: Endpoint is ready to receive requests

Automatic Resource Management

Staque IO automatically handles:

  • Name sanitization (AWS resource naming requirements)
  • Region-specific inference image selection
  • VPC configuration for secure deployments
  • IAM role assignment for execution
  • Status monitoring and health checks

Instance Types

SageMaker offers various instance types optimized for different workloads:

GPU Instances (Recommended for LLMs)

Instance TypeGPUMemoryBest ForEst. Cost/hr
ml.g4dn.xlarge1x NVIDIA T416 GBSmall models (7B)$0.85
ml.g5.xlarge1x NVIDIA A10G24 GBMedium models (13B)$1.41
ml.g5.12xlarge4x NVIDIA A10G96 GBLarge models (70B)$7.09
ml.p4d.24xlarge8x NVIDIA A100320 GBVery large models$32.77

CPU Instances

For smaller models or non-LLM workloads:

  • ml.m5.xlarge - General purpose, $0.23/hr
  • ml.c5.2xlarge - Compute optimized, $0.41/hr
  • ml.r5.xlarge - Memory optimized, $0.30/hr

Pricing

Hourly Instance Costs

SageMaker charges are based on:

  • Instance Hours: Charged per second, minimum 60 seconds
  • Always Running: Costs accrue whether endpoint is used or not
  • Data Transfer: Minimal charges for requests/responses
  • Storage: S3 storage for model artifacts

⚠️ Important: Unlike Bedrock, SageMaker incurs costs even when idle. Delete endpoints when not in use to avoid unnecessary charges.

Cost Example

ml.g4dn.xlarge @ $0.85/hour:
- Daily cost: $20.40 (24 hours)
- Monthly cost: $612.00 (30 days)
- With 50% uptime: $306.00/month

ml.g5.xlarge @ $1.41/hour:
- Daily cost: $33.84
- Monthly cost: $1,015.20
- With 50% uptime: $507.60/month

Configuration

Required Environment Variables

# AWS Credentials
STAQUE_AWS_REGION=eu-north-1
STAQUE_AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
STAQUE_AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# SageMaker Configuration
SAGEMAKER_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerRole
SAGEMAKER_SUBNET_IDS=subnet-12345678,subnet-87654321
SAGEMAKER_SECURITY_GROUP_IDS=sg-12345678

# Optional: JWT Secret
JWT_SECRET=your-secret-key-here

IAM Role Requirements

The SageMaker execution role needs these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint",
        "sagemaker:DeleteEndpoint",
        "sagemaker:DescribeEndpoint",
        "sagemaker:InvokeEndpoint",
        "s3:GetObject",
        "s3:ListBucket",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

VPC Configuration

For security, deploy SageMaker endpoints in a VPC:

  • Subnets: At least 2 subnets in different availability zones
  • Security Groups: Configure ingress/egress rules for model access
  • VPC Endpoints: S3 and ECR endpoints for private connectivity

Usage Examples

Deploying a JumpStart Model

POST /api/deploy/sagemaker
{
  "endpointName": "my-llama-endpoint",
  "instanceType": "ml.g4dn.xlarge",
  "modelPackageArn": "arn:aws:sagemaker:eu-north-1:...:model-package/llama-2-7b",
  "dryRun": false
}

// Response (deployment takes 5-10 minutes)
{
  "success": true,
  "message": "Endpoint creation started",
  "endpointName": "my-llama-endpoint",
  "endpoint": "https://runtime.sagemaker.eu-north-1.amazonaws.com/..."
}

Deploying a Custom Model

POST /api/deploy/sagemaker
{
  "endpointName": "my-custom-model",
  "instanceType": "ml.g5.xlarge",
  "inferenceImage": "763104351884.dkr.ecr.eu-north-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310",
  "modelDataUrl": "s3://my-bucket/model.tar.gz",
  "executionRoleArn": "arn:aws:iam::123456789012:role/SageMakerRole",
  "dryRun": false
}

Checking Endpoint Status

GET /api/resources/resource-uuid/status

// Response
{
  "success": true,
  "resource": {
    "type": "sagemaker",
    "status": "InService",  // Creating, Updating, InService, etc.
    "health": "healthy",
    "instance_type": "ml.g4dn.xlarge",
    "instance_count": 1
  },
  "metrics": {
    "response_time_ms": 342,
    "throughput_per_minute": 87,
    "cpu_utilization": 35,
    "memory_utilization": 48
  },
  "costs": {
    "hourly_cost": 0.85,
    "daily_cost": 20.40,
    "monthly_estimate": 612.00
  }
}

Deleting an Endpoint

POST /api/resources/resource-uuid/control
{
  "action": "delete",
  "confirm": true  // Required to prevent accidental deletion
}

// Response
{
  "success": true,
  "message": "Endpoint deletion initiated",
  "action": "delete",
  "status": "deleting"
}

Best Practices

Instance Selection

  • 7B Models: ml.g4dn.xlarge or ml.g5.xlarge
  • 13B Models: ml.g5.2xlarge or ml.g5.4xlarge
  • 70B Models: ml.g5.12xlarge or ml.p4d.24xlarge
  • Test First: Always start with smaller instances and scale up as needed

Cost Optimization

  • Delete When Idle: Delete endpoints that aren't actively used
  • Use Auto-Scaling: Scale instance count based on traffic
  • Spot Instances: Consider using managed spot training/inference
  • Right-Size: Monitor utilization and downgrade if under-utilized
  • Schedule Deletion: Use automation to delete non-production endpoints after hours

Performance Optimization

  • Model Optimization: Use quantization (INT8, FP16) to reduce memory
  • Batch Requests: Process multiple requests together when possible
  • Enable Multi-Model Endpoints: Host multiple models on one endpoint
  • Use TensorRT: For NVIDIA GPU acceleration
  • Monitor Metrics: Track latency, throughput, and resource utilization

Security Best Practices

  • Always deploy in a VPC with private subnets
  • Use VPC endpoints for S3 and ECR access
  • Implement least-privilege IAM policies
  • Enable encryption at rest and in transit
  • Use security groups to restrict network access

Troubleshooting

Endpoint Stuck in "Creating" Status

Problem: Endpoint takes longer than 15 minutes to deploy

Solution:

  • Check VPC configuration (subnets, security groups, route tables)
  • Verify ECR and S3 VPC endpoints exist
  • Check IAM role has required permissions
  • Review CloudWatch logs for specific errors

Endpoint Failed to Deploy

Problem: Status changes to "Failed"

Solution:

  • Check model artifacts are accessible in S3
  • Verify inference image exists and is accessible
  • Ensure instance type is available in your region
  • Review CloudWatch logs for error details

High Latency or Timeout

Problem: Slow response times or request timeouts

Solution:

  • Increase instance size or use GPU instances
  • Optimize model (quantization, pruning)
  • Enable auto-scaling to handle traffic spikes
  • Check network configuration (VPC, security groups)
  • Monitor endpoint metrics in CloudWatch

Unexpected Costs

Problem: Higher than expected charges

Solution:

  • Delete unused endpoints immediately
  • Use Staque IO's cost monitoring dashboard
  • Set up AWS Budgets and alerts
  • Review instance utilization and downsize if possible
  • Consider switching to Bedrock for variable workloads