This project aims to predict future high-frequency stock returns using past returns. The goal is to build, train, and evaluate a deep learning model capable of forecasting the next 10-minute return for a cross-section of stocks. The project includes data preprocessing, model training, performance evaluation, and a simulated trading backtest to assess the strategy's potential, accounting for factors like transaction costs.
This repository is submitted as part of the “Machine-Learning in Finance” course project.
The codebase is organized to be modular, clean, and reproducible, as per the project guidelines:
High-Frequency-Trading-with-Deep-Learning/
|
├── data/
│ ├── raw/ # Original, untouched 10-minute frequency data
│ └── processed/ # Processed data ready for modeling
|
├── notebooks/ # Jupyter notebooks for exploration and prototyping
|
├── results/
│ ├── models/ # Saved model weights
│ ├── figures/ # All plots and charts for the final report
│ ├── tables/ # All LaTeX tables for the final report
│ ├── parameters/ # Saved parameters for benchmark models
│ └── predictions/ # Saved out-of-sample predictions from benchmarks
│
├── src/
│ ├── __init__.py
│ ├── models/ # PyTorch model definitions & datasets
│ ├── transformer/ # Custom transformer components
│ ├── data_analysis.py # EDA and descriptive statistics pipeline
│ ├── linear_benchmarks.py # OLS, Ridge, and Lasso model pipeline
│ ├── time_series_analysis.py # ARIMA and GARCH model pipeline
│ ├── transformer_train_experiments.py # Transformer training experiments
│ ├── strategy.py # Portfolio backtesting logic
│ └── utils.py # Helper functions used across modules
|
├── .gitattributes
├── .gitignore
├── main.py # Main controller to run pipeline stages
├── run.sh # Shell script to execute the full end-to-end pipeline
├── requirements.txt # Required Python libraries
└── README.md # This file
- Rui Azevedo rui.azevedoleitao@epfl.ch
- Nicolò Baldovin nicolo.baldovin@epfl.ch
- Emanuele Durante emanuele.durante@epfl.ch
- Alex Martínez alex.martinezdefrancisco@epfl.ch
- Filippo Passerini filippo.passerini@epfl.ch
- Letizia Seveso letizia.seveso@epfl.ch
Follow these steps to set up the project environment and run the pipeline:
git clone [your-repository-url]
cd high_frequency_projectYou can set up the project environment using either Conda (recommended) or venv:
- Create the environment from the YAML file:
This single command creates a new environment named
ml_financeand installs all the required packages from the specified channels.
conda env create -f environment.yml- Activate the environment:
conda activate ml_financeOr alternatively:
python -m venv venv
source venv/bin/activate # On Windows: `venv\Scripts\activate`Also, using conda is recommended:
conda create -n ml_finance python=3.11
conda activate ml_financepip install -r requirements.txt- Insert the raw data file into the "data/raw/high_10m" directory. all files inside that director should be in the same format "*.csv.gz" in order to be processed correctly.
python main.py --load-data
This command will process the raw data files, generating the necessary processed data files in the "data/processed" directory. The processed data will be used for training and evaluation of the models.
The project is controlled via main.py, which allows you to run each stage of the pipeline independently using flags.
Run Data Analysis & EDA:
- This generates descriptive statistics, tables, and plots about the dataset.
python main.py --data-analysisTrain Benchmark Models:
- This runs the OLS, Ridge, Lasso, ARIMA, and GARCH models for all stocks, saving their parameters and predictions.
python main.py --train-benchmarksEvaluate Benchmark Models:
- This uses the saved parameters to generate summary tables and figures for the benchmark models.
python main.py --evaluate-benchmarksTrain Transformer Model:
- This trains 15 experiment on our transformer model, saving the best model weights and all evaluation metrics.
python main.py --train-transformerRun Trading Strategy:
- This runs a portfolio optimization backtest using the predictions from all previously trained models. It simulates performance with different transaction costs and saves the resulting cumulative return plots.
python main.py --strategyFor a full list of commands and arguments, you can use the help flag:
python main.py --helpTo ensure full reproducibility and execute the entire pipeline from data processing to the final backtest, use the provided shell script. This is the recommended method for generating the final results for the report.
sh run.shThis script will execute all necessary stages in the correct sequence (e.g., data analysis, benchmark training, transformer training, evaluation, etc.).
This project is licensed under the MIT License.