AWS SageMaker Deployment

Deploy custom AI models on dedicated infrastructure with AWS SageMaker.

⚠️ Important

SageMaker deployments provision real infrastructure and incur hourly costs even when idle. Endpoint creation takes 5-10 minutes. Always start with a dry run and carefully review costs.

Prerequisites

AWS Configuration

  • AWS account with SageMaker access
  • IAM execution role for SageMaker
  • VPC with subnets and security groups configured
  • S3 bucket for model artifacts (if using custom models)

Required Environment Variables

# AWS Credentials
STAQUE_AWS_REGION=eu-north-1
STAQUE_AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
STAQUE_AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# SageMaker-specific
SAGEMAKER_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerExecutionRole
SAGEMAKER_SUBNET_IDS=subnet-12345678,subnet-87654321
SAGEMAKER_SECURITY_GROUP_IDS=sg-12345678

Step 1: Prepare IAM Role

Create an IAM role with the following permissions:

Trust Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Required Policies

  • AmazonSageMakerFullAccess - For SageMaker operations
  • AmazonS3ReadOnlyAccess - For reading model artifacts
  • Custom policy for ECR access (if using custom images)

Step 2: Configure VPC

SageMaker endpoints require VPC configuration:

Subnets

  • Use at least 2 subnets in different availability zones
  • Ensure subnets have sufficient IP addresses
  • Private subnets recommended for security

Security Groups

  • Allow HTTPS (443) inbound from your application
  • Allow all outbound traffic for model downloads

Step 3: Choose Deployment Method

Option A: Deploy from JumpStart (Recommended)

Use pre-built models from SageMaker JumpStart:

1. List Available Models

GET /api/models/sagemaker?source=jumpstart&max=20

// Response
{
  "success": true,
  "source": "jumpstart",
  "models": [
    {
      "modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
      "displayName": "Llama 2 7B",
      "supportedRealtimeInferenceInstanceTypes": [
        "ml.g5.xlarge",
        "ml.g5.2xlarge",
        "ml.g4dn.xlarge"
      ]
    }
  ]
}

2. Dry Run Deployment

POST /api/deploy/sagemaker
Content-Type: application/json

{
  "endpointName": "my-llama2-endpoint",
  "instanceType": "ml.g5.xlarge",
  "modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
  "dryRun": true
}

// Response
{
  "success": true,
  "dryRun": true,
  "plan": {
    "endpointName": "my-llama2-endpoint",
    "instanceType": "ml.g5.xlarge",
    "roleArn": "arn:aws:iam::123456789012:role/SageMakerExecutionRole",
    "vpc": {
      "subnets": ["subnet-12345678", "subnet-87654321"],
      "securityGroups": ["sg-12345678"]
    }
  }
}

3. Deploy

POST /api/deploy/sagemaker
Content-Type: application/json

{
  "endpointName": "my-llama2-endpoint",
  "instanceType": "ml.g5.xlarge",
  "modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
  "dryRun": false
}

// Response
{
  "success": true,
  "message": "Endpoint creation started",
  "endpointName": "my-llama2-endpoint",
  "endpoint": "https://runtime.sagemaker.eu-north-1.amazonaws.com/endpoints/my-llama2-endpoint/invocations"
}

Option B: Deploy Custom Model

Deploy your own model from S3:

1. Prepare Model Artifacts

  • Package model files in model.tar.gz
  • Upload to S3 bucket
  • Ensure SageMaker role has read access

2. Deploy Custom Model

POST /api/deploy/sagemaker
Content-Type: application/json

{
  "endpointName": "my-custom-model",
  "instanceType": "ml.g4dn.xlarge",
  "inferenceImage": "763104351884.dkr.ecr.eu-north-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310",
  "modelDataUrl": "s3://my-bucket/models/model.tar.gz",
  "dryRun": false
}

Step 4: Monitor Deployment

Track the deployment status:

// Poll every 30 seconds
GET /api/resources/<resource-id>/status

// During deployment
{
  "success": true,
  "resource": {
    "status": "Creating",
    "health": "unknown"
  }
}

// When ready
{
  "success": true,
  "resource": {
    "status": "InService",
    "health": "healthy",
    "instance_type": "ml.g5.xlarge",
    "instance_count": 1
  },
  "metrics": {
    "response_time_ms": 342,
    "throughput_per_minute": 87,
    "cpu_utilization": 35
  },
  "costs": {
    "hourly_cost": 1.006,
    "daily_cost": 24.14,
    "monthly_estimate": 724.32
  }
}

Instance Types Guide

GPU Instances (Recommended for LLMs)

Instance TypeGPUvCPUsMemoryHourly Cost
ml.g4dn.xlarge1x T4416 GB~$0.74/hr
ml.g5.xlarge1x A10G416 GB~$1.01/hr
ml.g5.2xlarge1x A10G832 GB~$1.21/hr
ml.p3.2xlarge1x V100861 GB~$3.83/hr

CPU Instances (For Smaller Models)

Instance TypevCPUsMemoryHourly Cost
ml.m5.xlarge416 GB~$0.23/hr
ml.c5.2xlarge816 GB~$0.40/hr

Cost Management

Understanding SageMaker Costs

  • Always-on billing: Charged for every hour the endpoint is running
  • No auto-scaling by default: Fixed instance count
  • Data transfer costs: Additional charges for data in/out
  • Storage costs: S3 storage for model artifacts

Cost Optimization Strategies

  • Right-size instances: Start small and scale up if needed
  • Delete unused endpoints: Don't let idle endpoints run
  • Use Savings Plans: Commit to usage for discounts
  • Monitor utilization: Track CPU/GPU usage to optimize

Delete Endpoint When Done

POST /api/resources/<resource-id>/control
Content-Type: application/json

{
  "action": "delete",
  "confirm": true
}

// Response
{
  "success": true,
  "message": "Endpoint deletion initiated",
  "action": "delete",
  "status": "deleting"
}

Endpoint Management

Update Endpoint

Update endpoint configuration (requires creating new endpoint config):

  • Change instance type
  • Modify instance count
  • Update model version

Restart Endpoint

POST /api/resources/<resource-id>/control
Content-Type: application/json

{
  "action": "restart"
}

// Takes 5-10 minutes
{
  "success": true,
  "message": "Endpoint restart initiated",
  "action": "restart",
  "status": "updating"
}

Regional Considerations

eu-north-1 (Stockholm)

  • Limited instance type availability
  • Use older inference images for better compatibility
  • Lower costs than us-east-1

us-east-1 (N. Virginia)

  • Widest selection of instance types
  • Best for testing and development
  • Latest inference images available

Troubleshooting

Common Issues

Error: "SAGEMAKER_SUBNET_IDS must be set"

Cause: Missing VPC configuration environment variables

Solution: Set SAGEMAKER_SUBNET_IDS and SAGEMAKER_SECURITY_GROUP_IDS

Error: "ResourceLimitExceeded"

Cause: Exceeded instance quota for the instance type

Solution: Request quota increase through AWS Service Quotas console

Status: "Failed"

Cause: Various deployment failures (role permissions, VPC config, image issues)

Solution: Check CloudWatch logs for detailed error messages. Common issues: IAM role permissions, invalid VPC configuration, missing inference image in region

Slow Response Times

Cause: Instance too small for model size or high traffic

Solution: Upgrade to larger instance type or add more instances

Best Practices

Development

  • Always start with dry run deployments
  • Use smallest viable instance type for testing
  • Delete test endpoints immediately after testing
  • Monitor costs daily during development

Production

  • Enable auto-scaling for variable workloads
  • Set up CloudWatch alarms for errors and latency
  • Use multiple availability zones for high availability
  • Implement A/B testing with traffic splitting
  • Regular backup of model artifacts

Security

  • Use private subnets for endpoints
  • Restrict security group rules to minimum required
  • Enable VPC endpoints for S3 and ECR
  • Use IAM roles with least-privilege access
  • Enable encryption at rest and in transit

Next Steps