QuitoBench is a comprehensive benchmark for evaluating time series forecasting models on billion-scale CloudOps data. This repository contains the official implementation accompanying the paper:
QuitoBench: A High-Quality Billion-Scale CloudOps Time Series Benchmark
Alipay
QuitoBench provides a unified framework that supports multiple state-of-the-art time series models including PatchTST, iTransformer, TSMixer, Crossformer, Pyraformer, and more. It offers a standardized interface for training, fine-tuning, evaluation, and hyperparameter tuning across different models and datasets.
- About
- Features
- Installation
- The QuitoBench Dataset
- Quick Start
- Benchmarking Protocol
- Supported Models
- Documentation
- Citation
- License
This repository provides:
- Benchmark Dataset: Access to the QuitoBench dataset with billions of CloudOps time series observations
- Evaluation Framework: Standardized protocols for fair comparison across models
- Model Zoo: Pre-configured implementations of 10+ state-of-the-art forecasting models
- Quality Analysis Tools: Comprehensive dataset quality assessment utilities
- Baseline Results: Reference performance metrics for all benchmark tasks
QuitoBench aims to advance time series forecasting research by providing a large-scale, high-quality benchmark that reflects real-world CloudOps challenges.
- Billion-Scale Benchmark: High-quality CloudOps time series data at unprecedented scale
- Multiple Model Support: PatchTST, iTransformer, TSMixer, Crossformer, Pyraformer, DLinear, TiRex, Chronos, TimesFM
- Unified Interface: Consistent API across all models through YAML configuration
- Distributed Training: Multi-GPU and multi-node support via PyTorch DistributedDataParallel
- Hyperparameter Tuning: Built-in Ray Tune integration for efficient parameter search
- Dataset Quality Analysis: Comprehensive tools for evaluating time series data quality
- Zero-Shot Inference: Support for pre-trained foundation models (Chronos, TimesFM, TiRex)
- Reproducible Benchmarks: Standardized evaluation protocols and metrics
- Python >= 3.11
- PyTorch >= 2.8.0
- CUDA (for GPU support)
git clone https://github.com/alipay/quito-10b.git
cd quito-10b
pip install -r requirements.txtInstall QuitoBench with CLI support:
pip install -e .This will install the quito-cli command for easy access to all training and evaluation scripts.
For zero-shot inference with foundation models:
# Chronos-2
pip install chronos-forecasting
# TimesFM-2.5
# Follow instructions at: https://github.com/google-research/timesfm/tree/master
# TiRex-Zero
# Follow instructions at: https://github.com/NX-AI/tirex/tree/main
# Dataset quality analysis
pip install statsmodels arch matplotlibDownload data from https://huggingface.co/collections/hq-bench/quitobench place the data in examples/datasets/cluster_data.
Prepare your Quito model checkpoints here and place them in models/{model_name}
QuitoBench is a billion-scale benchmark dataset derived from real-world CloudOps operations at Alipay. The dataset features:
- Scale: Billions of time series observations from production systems
- Quality: High-quality, curated data with comprehensive quality metrics
- Diversity: Multiple frequency patterns (hourly, daily, etc.) and characteristics
- Real-World: Actual CloudOps metrics from large-scale cloud infrastructure
- Forecasting Tasks: Multiple prediction horizons (96, 192, 336, 720 steps)
Each time series in QuitoBench is evaluated using:
- Forecastability (0-1): Measures predictability based on spectral entropy
- Seasonality Strength (0-1): Quantifies seasonal pattern strength
- Stationarity: ADF test statistics for trend analysis
- Missing Data: Percentage and patterns of missing values
- Variability: Coefficient of variation and statistical properties
See docs/DATASET_QUALITY.md for detailed information.
# Example: Evaluate Chronos on QuitoBench test data:
```bash
quito-cli evaluate --config_path configs/evaluate/chronos/config.yamlExample: Evaluate Chronos on QuitoBench test data:
cd scripts
quito-cli evaluate --config_path ../configs/evaluate/chronos/config.yamlAlternatively, you can run the scripts directly from the repository root:
# Pre-training with distributed training
torchrun --nproc_per_node 4 quito/scripts/pretrain.py \
--config_path configs/pretrain/patchtst/config.yaml --use_gpu 1
# Fine-tuning
torchrun --nproc_per_node 4 quito/scripts/finetune.py \
--config_path configs/finetune/patchtst/config.yaml --use_gpu 1
# Evaluation
python quito/scripts/evaluate.py \
--config_path configs/evaluate/patchtst/config.yaml \
--num_processes 2 --use_gpu 1
# Hyperparameter tuning
python quito/scripts/tune.py \
--config_path configs/tune/patchtst/config.yaml \
--tuning_config_path configs/tune/patchtst/tune_config.yaml \
--num_processes 4 \
--num_samples 100 \
--use_gpu 1All models use YAML configuration files. Example structure:
data:
common:
seq_len: 512 # Input sequence length
forecast_horizon: 96 # Prediction horizon
features: "S" # S: univariate, M: multivariate
freq: "H" # H: hourly, D: daily, etc.
datasets:
- dataset_name: "my_dataset"
file_name: "datasets/parquet_data/open_hour_train/data.parquet"
model:
model_name: "patchtst"
# Model-specific parameters
training:
task_type: "pretrain" # pretrain, finetune, evaluate
num_epochs: 10
batch_size: 32
learning_rate: 0.001
device: "cuda"
num_gpus: 1Configuration files are organized in configs/:
configs/pretrain/- Pre-training configurationsconfigs/finetune/- Fine-tuning configurationsconfigs/evaluate/- Evaluation configurationsconfigs/tune/- Hyperparameter tuning configurations
- PatchTST: Patch-based transformer for long-term forecasting
- iTransformer: Inverted transformer architecture
- TSMixer: MLP-based time series model
- Crossformer: Cross-dimension attention for multivariate forecasting
- Pyraformer: Pyramidal attention mechanism
- DLinear: Simple linear model baseline
- TSTrans former: Classic transformer for time series
- Chronos-2: Amazon's pre-trained foundation model
- TimesFM-2.5: Google's time series foundation model
- TiRex-Zero: NX-AI's zero-shot forecasting model
Note: Zero-shot models are for inference only and cannot be fine-tuned.
Train a model from scratch on your pre-training dataset:
quito-cli pretrain --config_path configs/pretrain/patchtst/config.yamlThis trains the model on unlabeled time series data to learn general patterns.
Fine-tune a pre-trained model on specific downstream tasks:
quito-cli finetune --config_path configs/finetune/patchtst/config.yamlFine-tuning uses the TRAIN portion of your TRAIN/TEST split.
Optimize hyperparameters using TRAIN/VALID split:
quito-cli tune --config_path configs/tune/patchtst/config.yaml \
--tuning_config_path configs/tune/patchtst/tune_config.yaml \
--num_workers 4 \
--num_samples 100The tuning process uses Ray Tune for efficient hyperparameter search.
Evaluate model performance on test data:
quito-cli evaluate --config_path configs/evaluate/patchtst/config.yaml --num_gpus 2Evaluation computes forecasting metrics (MSE, MAE, etc.) on the TEST set.
QUITO includes comprehensive tools for analyzing time series dataset quality:
# Analyze individual dataset
python examples/data_analysis/analyze_dataset_quality.py
# Compare multiple datasets
python examples/data_analysis/compare_datasets_quality.py
# Analyze your own parquet files
python examples/data_analysis/analyze_open_hour_train_quality.py \
--max_length 5000 \
--max_series_per_file 50 \
--sampling_strategy uniform- Forecastability (0-1): Predictability based on spectral entropy
- Seasonality Strength (0-1): Strength of seasonal patterns
- Missing Data: Percentage of missing values
- Coefficient of Variation: Relative variability
- ADF Statistic: Stationarity measure
See docs/DATASET_QUALITY.md for detailed information.
QuitoBench provides a standardized benchmarking protocol:
- Standardized Splits: Pre-defined train/validation/test splits for fair comparison
- Multiple Horizons: Evaluate across 96, 192, 336, and 720 step forecasts
- Comprehensive Metrics: MSE, MAE, MASE, MAPE, SMAPE, and domain-specific metrics
- Quality Stratification: Evaluate model performance across different data quality tiers
- Zero-Shot Evaluation: Test foundation models on unseen time series
QuitoBench uses the following metrics for comprehensive evaluation:
- MSE (Mean Squared Error): Standard squared error metric
- MAE (Mean Absolute Error): Absolute error metric
- MASE (Mean Absolute Scaled Error): Scale-independent metric
- MAPE (Mean Absolute Percentage Error): Percentage-based error
- SMAPE (Symmetric MAPE): Symmetric percentage error
- MASE-Leak: MASE with leakage considerations
QuitoBench expects Parquet files with the following structure:
Required columns:
- timestamp: Time index
- value: Time series values
Optional columns:
- item_id: For multiple series in one file
Example dataset structure:
datasets/
└── parquet_data/
└── open_hour_train/
├── hour_train_hour_p1.parquet
├── hour_train_hour_p2.parquet
└── ...
Generate sample data:
python examples/data_analysis/create_data.py# Using torchrun (recommended)
torchrun --nproc_per_node 4 quito/scripts/pretrain.py \
--config_path configs/pretrain/patchtst/config.yaml --use_gpu 1
# Or using quito-cli
CUDA_VISIBLE_DEVICES=0,1,2,3 quito-cli pretrain \
--config_path configs/pretrain/patchtst/config.yaml --num_processes 4# Node 0 (master)
torchrun --nproc_per_node 4 --nnodes 2 --node_rank 0 \
--master_addr master_ip --master_port 29500 \
quito/scripts/pretrain.py --config_path configs/pretrain/patchtst/config.yaml --use_gpu 1
# Node 1 (worker)
torchrun --nproc_per_node 4 --nnodes 2 --node_rank 1 \
--master_addr master_ip --master_port 29500 \
quito/scripts/pretrain.py --config_path configs/pretrain/patchtst/config.yaml --use_gpu 1The examples/data_analysis/ directory contains self-contained scripts:
create_data.py: Generate synthetic time series dataanalyze_dataset_quality.py: Analyze dataset quality metricscompare_datasets_quality.py: Compare multiple datasetsanalyze_open_hour_train_quality.py: Analyze your own parquet filesbuild_cluster_files.py: Build cluster-specific datasets
See examples/data_analysis/README.md for detailed information.
- pretrain.md: Pre-training guide
- finetune.md: Fine-tuning guide
- evaluate.md: Evaluation guide
- tune.md: Hyperparameter tuning guide
- DATASET_QUALITY.md: Dataset quality analysis guide
quito-10b/
├── configs/ # YAML configuration files
│ ├── pretrain/ # Pre-training configs
│ ├── finetune/ # Fine-tuning configs
│ ├── evaluate/ # Evaluation configs
│ └── tune/ # Hyperparameter tuning configs
├── docs/ # Documentation
├── examples/ # Example scripts and data analysis tools
├── quito/ # Core package
│ ├── config/ # Configuration classes
│ ├── datasets.py # Dataset loading
│ ├── metrics.py # Evaluation metrics
│ ├── models/ # Model implementations
│ ├── trainers/ # Training logic
│ └── utils/ # Utilities
├── scripts/ # Main training scripts
│ ├── pretrain.py # Pre-training script
│ ├── finetune.py # Fine-tuning script
│ ├── evaluate.py # Evaluation script
│ └── tune.py # Hyperparameter tuning script
├── cli.py # Command-line interface
├── pyproject.toml # Package configuration
└── README.md # This file
# Create sample data (from repo root)
python examples/data_analysis/create_data.py
# Analyze data quality (from repo root)
python examples/data_analysis/analyze_dataset_quality.py
# Pre-train model
quito-cli pretrain --config_path configs/pretrain/patchtst/config.yaml# Fine-tune on your specific task
quito-cli finetune --config_path configs/finetune/patchtst/config.yaml# Tune hyperparameters
quito-cli tune --config_path configs/tune/patchtst/config.yaml \
--tuning_config_path configs/tune/patchtst/tune_config.yaml \
--num_processes 4 \
--num_samples 100# Evaluate pre-trained foundation model
quito-cli evaluate --config_path configs/evaluate/chronos/config.yaml# Evaluate multiple configs (from repo root using scripts directly)
for config in configs/evaluate/*/config.yaml; do
python quito/scripts/evaluate.py --config_path $config --num_processes 2
doneModuleNotFoundError
# Install missing dependencies
pip install -r requirements.txt
pip install -r requirements-optional.txt # For foundation modelsCUDA Out of Memory
- Reduce
batch_sizein config - Reduce
seq_lenorforecast_horizon - Use gradient accumulation
- Enable mixed precision training
FileNotFoundError: parquet file not found
# Generate sample data first
python examples/data_analysis/create_data.py"RuntimeError: element 0 of tensors does not require grad"
- This occurs with zero-shot models (Chronos, TimesFM, TiRex)
- Use
evaluateinstead ofpretrainorfinetune
- Use appropriate batch size: Start with 32 and adjust based on GPU memory
- Enable mixed precision: Set
use_amp: truein config - Use multiple GPUs: Leverage distributed training for faster training
- Analyze data quality first: Use quality analysis tools before training
- Start with small models: Test with DLinear or smaller models first
If you use QuitoBench in your research, please cite:
See LICENSE file for details.
QuitoBench is developed and maintained by the Alipay Research Team. We thank all contributors who have helped make this benchmark possible. The dataset is derived from real-world CloudOps operations at Alipay, representing years of production system experience.
For issues and questions:
- Open an issue on GitHub Issues
- Check existing documentation in
docs/ - Review examples in
examples/ - Read the paper for detailed methodology and results
We welcome contributions to QuitoBench! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-model) - Make your changes with clear commit messages
- Add tests and documentation as needed
- Submit a pull request
See CONTRIBUTING.md for detailed guidelines (if available).
QuitoBench - Advancing Time Series Forecasting Research with Billion-Scale CloudOps Data 📈
Developed by Alipay Research Team