Skip to content

Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #1921

@nectario

Description

@nectario

When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!

Please advise why this is happening.

See image attachment.

Thank you!

Nektarios

Describe the bug
Training very slow on p3.2xlarge. Maybe GPU not being used.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
SageMaker_Slow_Training

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
  • Framework version: 2.3
  • Python version: 3.7
  • CPU or GPU: GPU
  • Custom Docker image (Y/N):

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions