AWS SageMaker Deployment

Deploy custom AI models on dedicated infrastructure with AWS SageMaker.

⚠️ Important

SageMaker deployments provision real infrastructure and incur hourly costs even when idle. Endpoint creation takes 5-10 minutes. Always start with a dry run and carefully review costs.

Prerequisites

AWS Configuration

AWS account with SageMaker access
IAM execution role for SageMaker
VPC with subnets and security groups configured
S3 bucket for model artifacts (if using custom models)

Required Environment Variables

# AWS Credentials
STAQUE_AWS_REGION=eu-north-1
STAQUE_AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
STAQUE_AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# SageMaker-specific
SAGEMAKER_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerExecutionRole
SAGEMAKER_SUBNET_IDS=subnet-12345678,subnet-87654321
SAGEMAKER_SECURITY_GROUP_IDS=sg-12345678

Step 1: Prepare IAM Role

Create an IAM role with the following permissions:

Trust Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Required Policies

AmazonSageMakerFullAccess - For SageMaker operations
AmazonS3ReadOnlyAccess - For reading model artifacts
Custom policy for ECR access (if using custom images)

Step 2: Configure VPC

SageMaker endpoints require VPC configuration:

Subnets

Use at least 2 subnets in different availability zones
Ensure subnets have sufficient IP addresses
Private subnets recommended for security

Security Groups

Allow HTTPS (443) inbound from your application
Allow all outbound traffic for model downloads

Step 3: Choose Deployment Method

Option A: Deploy from JumpStart (Recommended)

Use pre-built models from SageMaker JumpStart:

1. List Available Models

GET /api/models/sagemaker?source=jumpstart&max=20

// Response
{
  "success": true,
  "source": "jumpstart",
  "models": [
    {
      "modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
      "displayName": "Llama 2 7B",
      "supportedRealtimeInferenceInstanceTypes": [
        "ml.g5.xlarge",
        "ml.g5.2xlarge",
        "ml.g4dn.xlarge"
      ]
    }
  ]
}

2. Dry Run Deployment

POST /api/deploy/sagemaker
Content-Type: application/json

{
  "endpointName": "my-llama2-endpoint",
  "instanceType": "ml.g5.xlarge",
  "modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
  "dryRun": true
}

// Response
{
  "success": true,
  "dryRun": true,
  "plan": {
    "endpointName": "my-llama2-endpoint",
    "instanceType": "ml.g5.xlarge",
    "roleArn": "arn:aws:iam::123456789012:role/SageMakerExecutionRole",
    "vpc": {
      "subnets": ["subnet-12345678", "subnet-87654321"],
      "securityGroups": ["sg-12345678"]
    }
  }
}

3. Deploy

POST /api/deploy/sagemaker
Content-Type: application/json

{
  "endpointName": "my-llama2-endpoint",
  "instanceType": "ml.g5.xlarge",
  "modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
  "dryRun": false
}

// Response
{
  "success": true,
  "message": "Endpoint creation started",
  "endpointName": "my-llama2-endpoint",
  "endpoint": "https://runtime.sagemaker.eu-north-1.amazonaws.com/endpoints/my-llama2-endpoint/invocations"
}

Option B: Deploy Custom Model

Deploy your own model from S3:

1. Prepare Model Artifacts

Package model files in model.tar.gz
Upload to S3 bucket
Ensure SageMaker role has read access

2. Deploy Custom Model

POST /api/deploy/sagemaker
Content-Type: application/json

{
  "endpointName": "my-custom-model",
  "instanceType": "ml.g4dn.xlarge",
  "inferenceImage": "763104351884.dkr.ecr.eu-north-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310",
  "modelDataUrl": "s3://my-bucket/models/model.tar.gz",
  "dryRun": false
}

Step 4: Monitor Deployment

Track the deployment status:

// Poll every 30 seconds
GET /api/resources/<resource-id>/status

// During deployment
{
  "success": true,
  "resource": {
    "status": "Creating",
    "health": "unknown"
  }
}

// When ready
{
  "success": true,
  "resource": {
    "status": "InService",
    "health": "healthy",
    "instance_type": "ml.g5.xlarge",
    "instance_count": 1
  },
  "metrics": {
    "response_time_ms": 342,
    "throughput_per_minute": 87,
    "cpu_utilization": 35
  },
  "costs": {
    "hourly_cost": 1.006,
    "daily_cost": 24.14,
    "monthly_estimate": 724.32
  }
}

Instance Types Guide

GPU Instances (Recommended for LLMs)

Instance Type	GPU	vCPUs	Memory	Hourly Cost
`ml.g4dn.xlarge`	1x T4	4	16 GB	~$0.74/hr
`ml.g5.xlarge`	1x A10G	4	16 GB	~$1.01/hr
`ml.g5.2xlarge`	1x A10G	8	32 GB	~$1.21/hr
`ml.p3.2xlarge`	1x V100	8	61 GB	~$3.83/hr

CPU Instances (For Smaller Models)

Instance Type	vCPUs	Memory	Hourly Cost
`ml.m5.xlarge`	4	16 GB	~$0.23/hr
`ml.c5.2xlarge`	8	16 GB	~$0.40/hr

Cost Management

Understanding SageMaker Costs

Always-on billing: Charged for every hour the endpoint is running
No auto-scaling by default: Fixed instance count
Data transfer costs: Additional charges for data in/out
Storage costs: S3 storage for model artifacts

Cost Optimization Strategies

Right-size instances: Start small and scale up if needed
Delete unused endpoints: Don't let idle endpoints run
Use Savings Plans: Commit to usage for discounts
Monitor utilization: Track CPU/GPU usage to optimize

Delete Endpoint When Done

POST /api/resources/<resource-id>/control
Content-Type: application/json

{
  "action": "delete",
  "confirm": true
}

// Response
{
  "success": true,
  "message": "Endpoint deletion initiated",
  "action": "delete",
  "status": "deleting"
}

Endpoint Management

Update Endpoint

Update endpoint configuration (requires creating new endpoint config):

Change instance type
Modify instance count
Update model version

Restart Endpoint

POST /api/resources/<resource-id>/control
Content-Type: application/json

{
  "action": "restart"
}

// Takes 5-10 minutes
{
  "success": true,
  "message": "Endpoint restart initiated",
  "action": "restart",
  "status": "updating"
}

Regional Considerations

eu-north-1 (Stockholm)

Limited instance type availability
Use older inference images for better compatibility
Lower costs than us-east-1

us-east-1 (N. Virginia)

Widest selection of instance types
Best for testing and development
Latest inference images available

Troubleshooting

Common Issues

Error: "SAGEMAKER_SUBNET_IDS must be set"

Cause: Missing VPC configuration environment variables

Solution: Set SAGEMAKER_SUBNET_IDS and SAGEMAKER_SECURITY_GROUP_IDS

Error: "ResourceLimitExceeded"

Cause: Exceeded instance quota for the instance type

Solution: Request quota increase through AWS Service Quotas console

Status: "Failed"

Cause: Various deployment failures (role permissions, VPC config, image issues)

Solution: Check CloudWatch logs for detailed error messages. Common issues: IAM role permissions, invalid VPC configuration, missing inference image in region

Slow Response Times

Cause: Instance too small for model size or high traffic

Solution: Upgrade to larger instance type or add more instances

Best Practices

Development

Always start with dry run deployments
Use smallest viable instance type for testing
Delete test endpoints immediately after testing
Monitor costs daily during development

Production

Enable auto-scaling for variable workloads
Set up CloudWatch alarms for errors and latency
Use multiple availability zones for high availability
Implement A/B testing with traffic splitting
Regular backup of model artifacts

Security

Use private subnets for endpoints
Restrict security group rules to minimum required
Enable VPC endpoints for S3 and ECR
Use IAM roles with least-privilege access
Enable encryption at rest and in transit

Next Steps

Platform Details

Learn more about AWS SageMaker platform features

API Reference

Detailed API documentation for SageMaker endpoints

Using SageMaker Models

Learn how to interact with your deployed endpoints

Cost Management

Optimize your SageMaker deployment costs