AWS SageMaker Deployment
Deploy custom AI models on dedicated infrastructure with AWS SageMaker.
⚠️ Important
SageMaker deployments provision real infrastructure and incur hourly costs even when idle. Endpoint creation takes 5-10 minutes. Always start with a dry run and carefully review costs.
Prerequisites
AWS Configuration
- AWS account with SageMaker access
- IAM execution role for SageMaker
- VPC with subnets and security groups configured
- S3 bucket for model artifacts (if using custom models)
Required Environment Variables
# AWS Credentials STAQUE_AWS_REGION=eu-north-1 STAQUE_AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE STAQUE_AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # SageMaker-specific SAGEMAKER_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerExecutionRole SAGEMAKER_SUBNET_IDS=subnet-12345678,subnet-87654321 SAGEMAKER_SECURITY_GROUP_IDS=sg-12345678
Step 1: Prepare IAM Role
Create an IAM role with the following permissions:
Trust Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}Required Policies
AmazonSageMakerFullAccess- For SageMaker operationsAmazonS3ReadOnlyAccess- For reading model artifacts- Custom policy for ECR access (if using custom images)
Step 2: Configure VPC
SageMaker endpoints require VPC configuration:
Subnets
- Use at least 2 subnets in different availability zones
- Ensure subnets have sufficient IP addresses
- Private subnets recommended for security
Security Groups
- Allow HTTPS (443) inbound from your application
- Allow all outbound traffic for model downloads
Step 3: Choose Deployment Method
Option A: Deploy from JumpStart (Recommended)
Use pre-built models from SageMaker JumpStart:
1. List Available Models
GET /api/models/sagemaker?source=jumpstart&max=20
// Response
{
"success": true,
"source": "jumpstart",
"models": [
{
"modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
"displayName": "Llama 2 7B",
"supportedRealtimeInferenceInstanceTypes": [
"ml.g5.xlarge",
"ml.g5.2xlarge",
"ml.g4dn.xlarge"
]
}
]
}2. Dry Run Deployment
POST /api/deploy/sagemaker
Content-Type: application/json
{
"endpointName": "my-llama2-endpoint",
"instanceType": "ml.g5.xlarge",
"modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
"dryRun": true
}
// Response
{
"success": true,
"dryRun": true,
"plan": {
"endpointName": "my-llama2-endpoint",
"instanceType": "ml.g5.xlarge",
"roleArn": "arn:aws:iam::123456789012:role/SageMakerExecutionRole",
"vpc": {
"subnets": ["subnet-12345678", "subnet-87654321"],
"securityGroups": ["sg-12345678"]
}
}
}3. Deploy
POST /api/deploy/sagemaker
Content-Type: application/json
{
"endpointName": "my-llama2-endpoint",
"instanceType": "ml.g5.xlarge",
"modelPackageArn": "arn:aws:sagemaker:...:model-package/jumpstart-llama2-7b-...",
"dryRun": false
}
// Response
{
"success": true,
"message": "Endpoint creation started",
"endpointName": "my-llama2-endpoint",
"endpoint": "https://runtime.sagemaker.eu-north-1.amazonaws.com/endpoints/my-llama2-endpoint/invocations"
}Option B: Deploy Custom Model
Deploy your own model from S3:
1. Prepare Model Artifacts
- Package model files in
model.tar.gz - Upload to S3 bucket
- Ensure SageMaker role has read access
2. Deploy Custom Model
POST /api/deploy/sagemaker
Content-Type: application/json
{
"endpointName": "my-custom-model",
"instanceType": "ml.g4dn.xlarge",
"inferenceImage": "763104351884.dkr.ecr.eu-north-1.amazonaws.com/pytorch-inference:2.1.0-gpu-py310",
"modelDataUrl": "s3://my-bucket/models/model.tar.gz",
"dryRun": false
}Step 4: Monitor Deployment
Track the deployment status:
// Poll every 30 seconds
GET /api/resources/<resource-id>/status
// During deployment
{
"success": true,
"resource": {
"status": "Creating",
"health": "unknown"
}
}
// When ready
{
"success": true,
"resource": {
"status": "InService",
"health": "healthy",
"instance_type": "ml.g5.xlarge",
"instance_count": 1
},
"metrics": {
"response_time_ms": 342,
"throughput_per_minute": 87,
"cpu_utilization": 35
},
"costs": {
"hourly_cost": 1.006,
"daily_cost": 24.14,
"monthly_estimate": 724.32
}
}Instance Types Guide
GPU Instances (Recommended for LLMs)
| Instance Type | GPU | vCPUs | Memory | Hourly Cost |
|---|---|---|---|---|
ml.g4dn.xlarge | 1x T4 | 4 | 16 GB | ~$0.74/hr |
ml.g5.xlarge | 1x A10G | 4 | 16 GB | ~$1.01/hr |
ml.g5.2xlarge | 1x A10G | 8 | 32 GB | ~$1.21/hr |
ml.p3.2xlarge | 1x V100 | 8 | 61 GB | ~$3.83/hr |
CPU Instances (For Smaller Models)
| Instance Type | vCPUs | Memory | Hourly Cost |
|---|---|---|---|
ml.m5.xlarge | 4 | 16 GB | ~$0.23/hr |
ml.c5.2xlarge | 8 | 16 GB | ~$0.40/hr |
Cost Management
Understanding SageMaker Costs
- Always-on billing: Charged for every hour the endpoint is running
- No auto-scaling by default: Fixed instance count
- Data transfer costs: Additional charges for data in/out
- Storage costs: S3 storage for model artifacts
Cost Optimization Strategies
- Right-size instances: Start small and scale up if needed
- Delete unused endpoints: Don't let idle endpoints run
- Use Savings Plans: Commit to usage for discounts
- Monitor utilization: Track CPU/GPU usage to optimize
Delete Endpoint When Done
POST /api/resources/<resource-id>/control
Content-Type: application/json
{
"action": "delete",
"confirm": true
}
// Response
{
"success": true,
"message": "Endpoint deletion initiated",
"action": "delete",
"status": "deleting"
}Endpoint Management
Update Endpoint
Update endpoint configuration (requires creating new endpoint config):
- Change instance type
- Modify instance count
- Update model version
Restart Endpoint
POST /api/resources/<resource-id>/control
Content-Type: application/json
{
"action": "restart"
}
// Takes 5-10 minutes
{
"success": true,
"message": "Endpoint restart initiated",
"action": "restart",
"status": "updating"
}Regional Considerations
eu-north-1 (Stockholm)
- Limited instance type availability
- Use older inference images for better compatibility
- Lower costs than us-east-1
us-east-1 (N. Virginia)
- Widest selection of instance types
- Best for testing and development
- Latest inference images available
Troubleshooting
Common Issues
Error: "SAGEMAKER_SUBNET_IDS must be set"
Cause: Missing VPC configuration environment variables
Solution: Set SAGEMAKER_SUBNET_IDS and SAGEMAKER_SECURITY_GROUP_IDS
Error: "ResourceLimitExceeded"
Cause: Exceeded instance quota for the instance type
Solution: Request quota increase through AWS Service Quotas console
Status: "Failed"
Cause: Various deployment failures (role permissions, VPC config, image issues)
Solution: Check CloudWatch logs for detailed error messages. Common issues: IAM role permissions, invalid VPC configuration, missing inference image in region
Slow Response Times
Cause: Instance too small for model size or high traffic
Solution: Upgrade to larger instance type or add more instances
Best Practices
Development
- Always start with dry run deployments
- Use smallest viable instance type for testing
- Delete test endpoints immediately after testing
- Monitor costs daily during development
Production
- Enable auto-scaling for variable workloads
- Set up CloudWatch alarms for errors and latency
- Use multiple availability zones for high availability
- Implement A/B testing with traffic splitting
- Regular backup of model artifacts
Security
- Use private subnets for endpoints
- Restrict security group rules to minimum required
- Enable VPC endpoints for S3 and ECR
- Use IAM roles with least-privilege access
- Enable encryption at rest and in transit