logo
SIGN UP

Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Metrics

Overview

The #Let's Data dataset execution generates metrics that can be used to monitor the dataset execution and troubleshoot problem issues. These are added on the Dataset page as a Dataset Summary dashboard and a Task Details Dashboard for quick access and the raw data for these dashboards are available via the API. Here is an an in-depth look at each of these dashboards.

Dashboards

Dashboard: Dataset Summary

  • Task Success & Volume:
    •    Each datapoint is the percentage of tasks that completed successfully at that minute, plotted on the left axis. 1.0 means 100% tasks succeeded, 0.0 means all tasks that completed at that minute failed.
    •    Plotted on the right axis is the number of tasks that completed at that minute.
    •    For example, % of 0.75 and number of tasks = 4 means 75% of the 4 completing tasks succeeded at that minute and 25% of the 4 tasks failed.

  • •    Each datapoint is the percentage of tasks that completed successfully at that minute, plotted on the left axis. 1.0 means 100% tasks succeeded, 0.0 means all tasks that completed at that minute failed.
    •    Plotted on the right axis is the number of tasks that completed at that minute.
    •    For example, % of 0.75 and number of tasks = 4 means 75% of the 4 completing tasks succeeded at that minute and 25% of the 4 tasks failed.
  • Task Latency:
    •    Each datapoint is the average latency of completion for the tasks that completed at that minute.
    •    For example, a value of 196,926.0 milliseconds means that the tasks completing at that minute took 196 seconds to complete on average.
    •    This metric can be correlated to the number of tasks metric for that minute in the previous graph.
    •    For example, if 4 tasks completed at that minute and the average latency is 196,926.0 at that minute, then the sample size for the average is the 4 tasks that completed that minute.

  • •    Each datapoint is the average latency of completion for the tasks that completed at that minute.
    •    For example, a value of 196,926.0 milliseconds means that the tasks completing at that minute took 196 seconds to complete on average.
    •    This metric can be correlated to the number of tasks metric for that minute in the previous graph.
    •    For example, if 4 tasks completed at that minute and the average latency is 196,926.0 at that minute, then the sample size for the average is the 4 tasks that completed that minute.
  • Task Checkpoint Success & Latency:
    •    Each datapoint is the percentage of task checkpoints that completed successfully at that minute, plotted on the left axis.
    •    1.0 means 100% checkpoints succeeded, 0.0 means all checkpoints at that minute failed.
    •    Plotted on the right axis is the average latency of each checkpoint in milliseconds.
    •    A task checkpoints after reading, for example, every 500 records from the file. In a minute, the task can create multiple checkpoints. And there are multiple tasks running concurrently. This is the average across all these checkpoints.
    •    While this isn't controllable by the user, knowing how much time each checkpoint is taking can help with diagnosing task issues. The #Let's Data team closely monitors this metric to find any DB throttling / DB latency issues. We expect this metric to be within the ~25 msec range.

  • •    Each datapoint is the percentage of task checkpoints that completed successfully at that minute, plotted on the left axis.
    •    1.0 means 100% checkpoints succeeded, 0.0 means all checkpoints at that minute failed.
    •    Plotted on the right axis is the average latency of each checkpoint in milliseconds.
    •    A task checkpoints after reading, for example, every 500 records from the file. In a minute, the task can create multiple checkpoints. And there are multiple tasks running concurrently. This is the average across all these checkpoints.
    •    While this isn't controllable by the user, knowing how much time each checkpoint is taking can help with diagnosing task issues. The #Let's Data team closely monitors this metric to find any DB throttling / DB latency issues. We expect this metric to be within the ~25 msec range.
  • Task Records Processed:
    •    Each datapoint is the sum of the number of records processed (red), skipped (green) and errored (blue) by each task at that minute, plotted on the left axis.
    •    This record is a composite record which is being sent to the write destination - not necessarily the records read from the file.
    •    For example, if each task is reading data from 2 files, and it produces a composite document (or decision to skip / error) from multiple records from each file, the metric will count these as 1.

  • •    Each datapoint is the sum of the number of records processed (red), skipped (green) and errored (blue) by each task at that minute, plotted on the left axis.
    •    This record is a composite record which is being sent to the write destination - not necessarily the records read from the file.
    •    For example, if each task is reading data from 2 files, and it produces a composite document (or decision to skip / error) from multiple records from each file, the metric will count these as 1.
  • Task Write Connector Success & Volume:
    •    Each datapoint is the percentage of task checkpoints that completed successfully at that minute, plotted on the left axis.
    •    1.0 means 100% checkpoints succeeded, 0.0 means all checkpoints at that minute failed.
    •    Plotted on the right axis is the average latency of each checkpoint in milliseconds.
    •    A task checkpoints after reading, for example, every 500 records from the file. In a minute, the task can create multiple checkpoints. And there are multiple tasks running concurrently. This is the average across all these checkpoints.
    •    While this isn't controllable by the user, knowing how much time each checkpoint is taking can help with diagnosing task issues. The #Let's Data team closely monitors this metric to find any DB throttling / DB latency issues. We expect this metric to be within the ~25 msec range.

  • •    Each datapoint is the percentage of task checkpoints that completed successfully at that minute, plotted on the left axis.
    •    1.0 means 100% checkpoints succeeded, 0.0 means all checkpoints at that minute failed.
    •    Plotted on the right axis is the average latency of each checkpoint in milliseconds.
    •    A task checkpoints after reading, for example, every 500 records from the file. In a minute, the task can create multiple checkpoints. And there are multiple tasks running concurrently. This is the average across all these checkpoints.
    •    While this isn't controllable by the user, knowing how much time each checkpoint is taking can help with diagnosing task issues. The #Let's Data team closely monitors this metric to find any DB throttling / DB latency issues. We expect this metric to be within the ~25 msec range.
  • Task Write Connector Put Retry %:
    •    Each datapoint is the average percentage of the Write Connector Put API calls that were retried by the tasks.
    •    A 0.0 value means that there were no retries to Write Connector (Write Connector is adequately scaled and one can even see if descaling a little would cause any issues or not).
    •    In a minute, the task can call Write Connector Put API multiple times and there are multiple tasks running concurrently. This is the average retry percentage across all these task Put API calls.
    •    A value of 0.25 means that 25% of the Write Connector calls were retried across the tasks on average. If there are 4 tasks concurrently running, then this is the average of retry percentage for each task. For example, either each task could be retrying 25% of the time before the call succeeds (possible Write Connector scaling issues) or maybe 1 task is retrying 100% of the time and 3 tasks are retrying 0.0% (possible some issues with the task - this is contrived example and is unlikely).
    •    User can use the Put Retries (and the Write Connector Bytes Written and the volume, latency) graphs to determine whether the Write Connector stream scaling needs fine-tuning.

  • •    Each datapoint is the average percentage of the Write Connector Put API calls that were retried by the tasks.
    •    A 0.0 value means that there were no retries to Write Connector (Write Connector is adequately scaled and one can even see if descaling a little would cause any issues or not).
    •    In a minute, the task can call Write Connector Put API multiple times and there are multiple tasks running concurrently. This is the average retry percentage across all these task Put API calls.
    •    A value of 0.25 means that 25% of the Write Connector calls were retried across the tasks on average. If there are 4 tasks concurrently running, then this is the average of retry percentage for each task. For example, either each task could be retrying 25% of the time before the call succeeds (possible Write Connector scaling issues) or maybe 1 task is retrying 100% of the time and 3 tasks are retrying 0.0% (possible some issues with the task - this is contrived example and is unlikely).
    •    User can use the Put Retries (and the Write Connector Bytes Written and the volume, latency) graphs to determine whether the Write Connector stream scaling needs fine-tuning.
  • Task Record Latencies & Volume:
    •    Each datapoint plotted on the left axis is the (avg, min and max) latency of extraction of the record by the user handlers (readers and parser).
    •    Plotted on the right axis is the sample count for the latency metric.
    •    This is pure CPU work that the parsers and readers do on the bytes from the file to extract the records and create composite records (plus some work that the system does to put them into buffers etc).
    •    This may be a good metric to look at to find performance issues with the parser, we expect these latency to be < 10 ms (min, avg) and < 30 ms (max for example for large documents).

  • •    Each datapoint plotted on the left axis is the (avg, min and max) latency of extraction of the record by the user handlers (readers and parser).
    •    Plotted on the right axis is the sample count for the latency metric.
    •    This is pure CPU work that the parsers and readers do on the bytes from the file to extract the records and create composite records (plus some work that the system does to put them into buffers etc).
    •    This may be a good metric to look at to find performance issues with the parser, we expect these latency to be < 10 ms (min, avg) and < 30 ms (max for example for large documents).
  • Task Bytes Read and Bytes Written:
    •    Each datapoint plotted on the left axis is the bytes read by the readers from S3 (in KBs).
    •    In a minute, the readers can read from S3 many times and there could be multiple readers in a task (one per file type). There could be many tasks running concurrently. This is the average KBs read by all readers across all concurrently running tasks.
    •    Each datapoint plotted on the right axis is the bytes written by the task to Write Connector (in KBs).
    •    In a minute, the tasks can write to Write Connector multiple times. There could be many tasks running concurrently. This is the average KBs written per min across all concurrently running tasks.
    •    Users can look at these metrics to reason about the system's throughput and debug any issues that may arise from network read / write.
    •    We expect max / avg network throughput of XYZ /ABC from S3 file read. We expect max / avg network throughput of XYZ / ABC to Write Connector for each shard (multiply by number of shards).

  • •    Each datapoint plotted on the left axis is the bytes read by the readers from S3 (in KBs).
    •    In a minute, the readers can read from S3 many times and there could be multiple readers in a task (one per file type). There could be many tasks running concurrently. This is the average KBs read by all readers across all concurrently running tasks.
    •    Each datapoint plotted on the right axis is the bytes written by the task to Write Connector (in KBs).
    •    In a minute, the tasks can write to Write Connector multiple times. There could be many tasks running concurrently. This is the average KBs written per min across all concurrently running tasks.
    •    Users can look at these metrics to reason about the system's throughput and debug any issues that may arise from network read / write.
    •    We expect max / avg network throughput of XYZ /ABC from S3 file read. We expect max / avg network throughput of XYZ / ABC to Write Connector for each shard (multiply by number of shards).
    On This Page

    CLI Commands

    Command Syntax:

    Command Help:

    Show Help
     

    Command Examples:

    Show Examples