Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Streamline machine learning workflows with SkyPilot on Amazon SageMaker HyperPod

    Streamline machine learning workflows with SkyPilot on Amazon SageMaker HyperPod

    July 11, 2025

    This post is co-written with Zhanghao Wu, co-creator of SkyPilot.

    The rapid advancement of generative AI and foundation models (FMs) has significantly increased computational resource requirements for machine learning (ML) workloads. Modern ML pipelines require efficient systems for distributing workloads across accelerated compute resources, while making sure developer productivity remains high. Organizations need infrastructure solutions that are not only powerful but also flexible, resilient, and straightforward to manage.

    SkyPilot is an open source framework that simplifies running ML workloads by providing a unified abstraction layer that helps ML engineers run their workloads on different compute resources without managing underlying infrastructure complexities. It offers a simple, high-level interface for provisioning resources, scheduling jobs, and managing distributed training across multiple nodes.

    Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not only provides the flexibility to create and use your own software stack, but also provides optimal performance through same spine placement of instances, as well as built-in resiliency. Combining the resiliency of SageMaker HyperPod and the efficiency of SkyPilot provides a powerful framework to scale up your generative AI workloads.

    In this post, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI development workflows. This integration makes our advanced GPU infrastructure more accessible to ML engineers, enhancing productivity and resource utilization.

    Challenges of orchestrating machine learning workloads

    Kubernetes has become popular for ML workloads due to its scalability and rich open source tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the power of Kubernetes with the resilient environment of SageMaker HyperPod designed for training large models. Amazon EKS support in SageMaker HyperPod strengthens resilience through deep health checks, automated node recovery, and job auto-resume capabilities, providing uninterrupted training for large-scale and long-running jobs.

    ML engineers transitioning from traditional VM or on-premises environments often face a steep learning curve. The complexity of Kubernetes manifests and cluster management can pose significant challenges, potentially slowing down development cycles and resource utilization.

    Furthermore, AI infrastructure teams faced the challenge of balancing the need for advanced management tools with the desire to provide a user-friendly experience for their ML engineers. They required a solution that could offer both high-level control and ease of use for day-to-day operations.

    SageMaker HyperPod with SkyPilot

    To address these challenges, we partnered with SkyPilot to showcase a solution that uses the strengths of both platforms. SageMaker HyperPod excels at managing the underlying compute resources and instances, providing the robust infrastructure necessary for demanding AI workloads. SkyPilot complements this by offering an intuitive layer for job management, interactive development, and team coordination.

    Through this partnership, we can offer our customers the best of both worlds: the powerful, scalable infrastructure of SageMaker HyperPod, combined with a user-friendly interface that significantly reduces the learning curve for ML engineers. For AI infrastructure teams, this integration provides advanced management capabilities while simplifying the experience for their ML engineers, creating a win-win situation for all stakeholders.

    SkyPilot helps AI teams run their workloads on different infrastructures with a unified high-level interface and powerful management of resources and jobs. An AI engineer can bring in their AI framework and specify the resource requirements for the job; SkyPilot will intelligently schedule the workloads on the best infrastructure: find the available GPUs, provision the GPU, run the job, and manage its lifecycle.

    Solution overview

    Implementing this solution is straightforward, whether you’re working with existing SageMaker HyperPod clusters or setting up a new deployment. For existing clusters, you can connect using AWS Command Line Interface (AWS CLI) commands to update your kubeconfig and verify the setup. For new deployments, we guide you through setting up the API server, creating clusters, and configuring high-performance networking options like Elastic Fabric Adapter (EFA).

    The following diagram illustrates the solution architecture.

    In the following sections, we show how to run SkyPilot jobs for multi-node distributed training on SageMaker HyperPod. We go over the process of creating a SageMaker HyperPod cluster, installing SkyPilot, creating a SkyPilot cluster, and deploying a SkyPilot training job.

    Prerequisites

    You must have the following prerequisites:

    • An existing SageMaker HyperPod cluster with Amazon EKS (to create one, refer to Deploy Your HyperPod Cluster). You must provision a single ml.p5.48xlarge instance for the code samples in the following sections.
    • Access to the AWS CLI and kubectl command line tools.
    • A Python environment for installing SkyPilot.

    Create a SageMaker HyperPod cluster

    You can create an EKS cluster with a single AWS CloudFormation stack following the instructions in Using CloudFormation, configured with a virtual private cloud (VPC) and storage resources.

    To create and manage SageMaker HyperPod clusters, you can use either the AWS Management Console or AWS CLI. If you use the AWS CLI, specify the cluster configuration in a JSON file and choose the EKS cluster created from the CloudFormation stack as the orchestrator of the SageMaker HyperPod cluster. You then create the cluster worker nodes with NodeRecovery set to Automatic to enable automatic node recovery, and for OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to enable deep health checks. See the following code:

    cat > cluster-config.json << EOL
    {
        "ClusterName": "hp-cluster",
        "Orchestrator": {
            "Eks": {
                "ClusterArn": "${EKS_CLUSTER_ARN}"
            }
        },
        "InstanceGroups": [
            {
                "InstanceGroupName": "worker-group-1",
                "InstanceType": "ml.p5.48xlarge",
                "InstanceCount": 2,
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://${BUCKET_NAME}",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "${EXECUTION_ROLE}",
                "ThreadsPerCore": 1,
                "OnStartDeepHealthChecks": [
                    "InstanceStress",
                    "InstanceConnectivity"
                ],
            },
      ....
        ],
        "VpcConfig": {
            "SecurityGroupIds": [
                "$SECURITY_GROUP"
            ],
            "Subnets": [
                "$SUBNET_ID"
            ]
        },
        "ResilienceConfig": {
            "NodeRecovery": "Automatic"
        }
    }
    EOL

    You can add InstanceStorageConfigs to provision and mount additional Amazon Elastic Block Store (Amazon EBS) volumes on SageMaker HyperPod nodes.

    To create the cluster using the SageMaker HyperPod APIs, run the following AWS CLI command:

    aws sagemaker create-cluster  
    --cli-input-json file://cluster-config.json

    You are now ready to set up SkyPilot on your SageMaker HyperPod cluster.

    Connect to your SageMaker HyperPod EKS cluster

    From your AWS CLI environment, run the aws eks update-kubeconfig command to update your local kube config file (located at ~/.kube/config) with the credentials and configuration needed to connect to your EKS cluster using the kubectl command (provide your specific EKS cluster name):

    aws eks update-kubeconfig --name $EKS_CLUSTER_NAME

    You can verify that you are connected to the EKS cluster by running the following command:

    kubectl config current-context

    Install SkyPilot with Kubernetes support

    Use the following code to install SkyPilot with Kubernetes support using pip:

    pip install skypilot[kubernetes]

    This installs the latest build of SkyPilot, which includes the necessary Kubernetes integrations.

    Verify SkyPilot’s connection to the EKS cluster

    Check if SkyPilot can connect to your Kubernetes cluster:

    sky check k8s

    The output should look similar to the following code:

    Checking credentials to enable clouds for SkyPilot.
    Kubernetes: enabled [compute]
    
    To enable a cloud, follow the hints above and rerun: sky check
    If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html
    
    🎉 Enabled clouds 🎉
    Kubernetes [compute]
    Active context: arn:aws:eks:us-east-2:XXXXXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster
    
    Using SkyPilot API server: http://127.0.0.1:46580

    If this is your first time using SkyPilot with this Kubernetes cluster, you might see a prompt to create GPU labels for your nodes. Follow the instructions by running the following code:

    python -m sky.utils.kubernetes.gpu_labeler --context <your-eks-context>

    This script helps SkyPilot identify what GPU resources are available on each node in your cluster. The GPU labeling job might take a few minutes depending on the number of GPU resources in your cluster.

    Discover available GPUs in the cluster

    To see what GPU resources are available in your SageMaker HyperPod cluster, use the following code:

    sky show-gpus --cloud k8s

    This will list the available GPU types and their counts. We have two p5.48xlarge instances, each equipped with 8 NVIDIA H100 GPUs:

     Kubernetes GPUs
    GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
    H100 1, 2, 4, 8 16 16
    
    Kubernetes per node accelerator availability
    NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
    hyperpod-i-00baa178bc31afde3 H100 8 8
    hyperpod-i-038beefa954efab84 H100 8 8

    Launch an interactive development environment

    With SkyPilot, you can launch a SkyPilot cluster for interactive development:

    sky launch -c dev --gpus H100

    This command creates an interactive development environment (IDE) with a single H100 GPU and will sync the local working directory to the cluster. SkyPilot handles the pod creation, resource allocation, and setup of the IDE.

    Considered resources (1 node):
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------
     CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                                 COST ($)   CHOSEN   
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Kubernetes   2CPU--8GB--H100:1   2       8         H100:1         arn:aws:eks:us-east-2:XXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster   0.00          ✔     
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Launching a new cluster 'dev'. Proceed? [Y/n]: Y
    • Launching on Kubernetes.
    Pod is up.
    ✔ Cluster launched: dev. View logs: sky api logs -1 sky-2025-05-05-15-28-47-523797/provision. log
    • Syncing files.
    Run commands not specified or empty.
    Useful Commands
    Cluster name: dey
    To log into the head VM:   ssh dev
    To submit a job:           sky exec dev yaml_file
    To stop the cluster:       sky stop dev
    To teardown the cluster:   sky down dev

    After it’s launched, you can connect to your IDE:

    ssh dev

    This gives you an interactive shell in your IDE, where you can run your code, install packages, and perform ML experiments.

    Run training jobs

    With SkyPilot, you can run distributed training jobs on your SageMaker HyperPod cluster. The following is an example of launching a distributed training job using a YAML configuration file.

    First, create a file named train.yaml with your training job configuration:

    resources:
        accelerators: H100
    
    num_nodes: 1
    
    setup: |
        git clone --depth 1 https://github.com/pytorch/examples || true
        cd examples
        git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
        # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
        uv venv --python 3.10
        source .venv/bin/activate
        uv pip install -r requirements.txt "numpy<2" "torch"
    
    run: |
        cd examples
        source .venv/bin/activate
        cd mingpt
        export LOGLEVEL=INFO
    
        MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
        echo "Starting distributed training, head node: $MASTER_ADDR"
    
        torchrun 
        --nnodes=$SKYPILOT_NUM_NODES 
        --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE 
        --master_addr=$MASTER_ADDR 
        --master_port=8008 
        --node_rank=${SKYPILOT_NODE_RANK} 
        main.py

    Then launch your training job:

    sky launch -c train train.yaml

    This creates a training job on a single p5.48xlarge nodes, equipped with 8 H100 NVIDIA GPUs. You can monitor the output with the following command:

    sky logs train

    Running multi-node training jobs with EFA

    Elastic Fabric Adapter (EFA) is a network interface for Amazon Elastic Compute Cloud (Amazon EC2) instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS through its custom-built operating system bypass hardware interface. This enables applications to communicate directly with the network hardware while bypassing the operating system kernel, significantly reducing latency and CPU overhead. This direct hardware access is particularly beneficial for distributed ML workloads where frequent inter-node communication during gradient synchronization can become a bottleneck. By using EFA-enabled instances such as p5.48xlarge or p6-b200.48xlarge, data scientists can scale their training jobs across multiple nodes while maintaining the low-latency, high-bandwidth communication essential for efficient distributed training, ultimately reducing training time and improving resource utilization for large-scale AI workloads.

    The following code snippet shows how to incorporate this into your SkyPilot job:

    name: nccl-test-efa
    
    resources:
      cloud: kubernetes
      accelerators: H100:8
      image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:latest
    
    num_nodes: 2
    
    envs:
      USE_EFA: "true"
    
    run: |
      if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
        echo "Head node"
    
        # Total number of processes, NP should be the total number of GPUs in the cluster
        NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))
    
        # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to each IP as slots
        nodes=""
        for ip in $SKYPILOT_NODE_IPS; do
          nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
        done
        nodes=${nodes::-1}
        echo "All nodes: ${nodes}"
    
        # Set environment variables
        export PATH=$PATH:/usr/local/cuda-12.2/bin:/opt/amazon/efa/bin:/usr/bin
        export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:/usr/local/nvidia/lib:$LD_LIBRARY_PATH
        export NCCL_HOME=/opt/nccl
        export CUDA_HOME=/usr/local/cuda-12.2
        export NCCL_DEBUG=INFO
        export NCCL_BUFFSIZE=8388608
        export NCCL_P2P_NET_CHUNKSIZE=524288
        export NCCL_TUNER_PLUGIN=/opt/aws-ofi-nccl/install/lib/libnccl-ofi-tuner.so
    
        if [ "${USE_EFA}" == "true" ]; then
          export FI_PROVIDER="efa"
        else
          export FI_PROVIDER=""
        fi
    
        /opt/amazon/openmpi/bin/mpirun 
          --allow-run-as-root 
          --tag-output 
          -H $nodes 
          -np $NP 
          -N $SKYPILOT_NUM_GPUS_PER_NODE 
          --bind-to none 
          -x FI_PROVIDER 
          -x PATH 
          -x LD_LIBRARY_PATH 
          -x NCCL_DEBUG=INFO 
          -x NCCL_BUFFSIZE 
          -x NCCL_P2P_NET_CHUNKSIZE 
          -x NCCL_TUNER_PLUGIN 
          --mca pml ^cm,ucx 
          --mca btl tcp,self 
          --mca btl_tcp_if_exclude lo,docker0,veth_def_agent 
          /opt/nccl-tests/build/all_reduce_perf 
          -b 8 
          -e 2G 
          -f 2 
          -g 1 
          -c 5 
          -w 5 
          -n 100
      else
        echo "Worker nodes"
      fi
    
    config:
      kubernetes:
        pod_config:
          spec:
            containers:
            - resources:
                limits:
                  
                  vpc.amazonaws.com/efa: 32
                requests:
                  
                  vpc.amazonaws.com/efa: 32

    Clean up

    To delete your SkyPilot cluster, run the following command:

    sky down <cluster_name>

    To delete the SageMaker HyperPod cluster created in this post, you can user either the SageMaker AI console or the following AWS CLI command:

    aws sagemaker delete-cluster --cluster-name <cluster_name>

    Cluster deletion will take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker AI console.

    If you used the CloudFormation stack to create resources, you can delete it using the following command:

    aws cloudformation delete-stack --stack-name <stack_name>

    Conclusion

    By combining the robust infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased a solution that helps teams focus on innovation rather than infrastructure complexity. This approach not only simplifies operations but also enhances productivity and resource utilization across organizations of all sizes. To get started, refer to SkyPilot in the Amazon EKS Support in Amazon SageMaker HyperPod workshop.


    About the authors

    Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS. He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. He is passionate about computational optimization problems and improving the performance of AI workloads.

    Zhanghao Wu is a co-creator of the SkyPilot open source project and holds a PhD in computer science from UC Berkeley. He works on SkyPilot core, client-server architecture, managed jobs, and improving the AI experience on diverse cloud infrastructure in general.

    Ankit Anand is a Senior Foundation Models Go-To-Market (GTM) Specialist at AWS. He partners with top generative AI model builders, strategic customers, and AWS service teams to enable the next generation of AI/ML workloads on AWS. Ankit’s experience includes product management expertise within the financial services industry for high-frequency and low-latency trading and business development for Amazon Alexa.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAdvanced fine-tuning methods on Amazon SageMaker AI
    Next Article Intelligent document processing at scale with generative AI and Amazon Bedrock Data Automation

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Chaos RAT Malware Targets Windows and Linux via Fake Network Tool Downloads

    Development

    CVE-2025-27459 – VNC Weak Password Storage

    Common Vulnerabilities and Exposures (CVEs)

    This Week in Laravel: New Starter Kit, Multi-Tenancy, OTPs and Claude 4

    Development

    CVE-2025-6217 – PEAK-System PCANFD Driver Information Disclosure Kernel Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-5544 – Aluoxiang OA System Path Traversal Vulnerability

    June 3, 2025

    CVE ID : CVE-2025-5544

    Published : June 3, 2025, 11:15 p.m. | 3 hours, 8 minutes ago

    Description : A vulnerability was found in aaluoxiang oa_system up to 5b445a6227b51cee287bd0c7c33ed94b801a82a5. It has been rated as problematic. Affected by this issue is the function image of the file src/main/java/cn/gson/oasys/controller/user/UserpanelController.java. The manipulation leads to path traversal. The attack may be launched remotely. The exploit has been disclosed to the public and may be used. Continious delivery with rolling releases is used by this product. Therefore, no version details of affected nor updated releases are available.

    Severity: 4.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Microsoft Edge for Android may suggest SteamDB extension when visiting Steam

    June 25, 2025

    The top 4 Bluetooth speakers I’m taking everywhere this summer (including a surprise pick)

    June 27, 2025

    CVE-2025-49820 – Apache HTTP Server Cross-Site Request Forgery

    June 12, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.