Machine Learning for High-Frequency Return Prediction

Description

This project aims to predict future high-frequency stock returns using past returns. The goal is to build, train, and evaluate a deep learning model capable of forecasting the next 10-minute return for a cross-section of stocks. The project includes data preprocessing, model training, performance evaluation, and a simulated trading backtest to assess the strategy's potential, accounting for factors like transaction costs.

This repository is submitted as part of the “Machine-Learning in Finance” course project.

Project Structure

The codebase is organized to be modular, clean, and reproducible, as per the project guidelines:

High-Frequency-Trading-with-Deep-Learning/
|
├── data/
│   ├── raw/                # Original, untouched 10-minute frequency data
│   └── processed/          # Processed data ready for modeling
|
├── notebooks/              # Jupyter notebooks for exploration and prototyping
|
├── results/
│   ├── models/                 # Saved model weights
│   ├── figures/                # All plots and charts for the final report
│   ├── tables/                 # All LaTeX tables for the final report
│   ├── parameters/             # Saved parameters for benchmark models
│   └── predictions/            # Saved out-of-sample predictions from benchmarks
│
├── src/
│   ├── __init__.py
│   ├── models/                             # PyTorch model definitions & datasets
│   ├── transformer/                        # Custom transformer components
│   ├── data_analysis.py                    # EDA and descriptive statistics pipeline
│   ├── linear_benchmarks.py                # OLS, Ridge, and Lasso model pipeline
│   ├── time_series_analysis.py             # ARIMA and GARCH model pipeline
│   ├── transformer_train_experiments.py    # Transformer training experiments
│   ├── strategy.py                         # Portfolio backtesting logic
│   └── utils.py                            # Helper functions used across modules
|
├── .gitattributes
├── .gitignore
├── main.py                 # Main controller to run pipeline stages
├── run.sh                  # Shell script to execute the full end-to-end pipeline
├── requirements.txt        # Required Python libraries
└── README.md               # This file

Authors

Rui Azevedo rui.azevedoleitao@epfl.ch
Nicolò Baldovin nicolo.baldovin@epfl.ch
Emanuele Durante emanuele.durante@epfl.ch
Alex Martínez alex.martinezdefrancisco@epfl.ch
Filippo Passerini filippo.passerini@epfl.ch
Letizia Seveso letizia.seveso@epfl.ch

Quickstart

Follow these steps to set up the project environment and run the pipeline:

1. Clone the Repository

git clone [your-repository-url]
cd high_frequency_project

2. Create and Activate a Virtual Environment

You can set up the project environment using either Conda (recommended) or venv:

Create the environment from the YAML file: This single command creates a new environment named ml_finance and installs all the required packages from the specified channels.

conda env create -f environment.yml

Activate the environment:

conda activate ml_finance

Or alternatively:

python -m venv venv
source venv/bin/activate   # On Windows: `venv\Scripts\activate`

Also, using conda is recommended:

conda create -n ml_finance python=3.11
conda activate ml_finance

3. Install dependencies

pip install -r requirements.txt

4. Process the Data

Insert the raw data file into the "data/raw/high_10m" directory. all files inside that director should be in the same format "*.csv.gz" in order to be processed correctly.

python main.py --load-data

This command will process the raw data files, generating the necessary processed data files in the "data/processed" directory. The processed data will be used for training and evaluation of the models.

5. Run a pipeline stage

The project is controlled via main.py, which allows you to run each stage of the pipeline independently using flags.

Run Data Analysis & EDA:

This generates descriptive statistics, tables, and plots about the dataset.

python main.py --data-analysis

Train Benchmark Models:

This runs the OLS, Ridge, Lasso, ARIMA, and GARCH models for all stocks, saving their parameters and predictions.

python main.py --train-benchmarks

Evaluate Benchmark Models:

This uses the saved parameters to generate summary tables and figures for the benchmark models.

python main.py --evaluate-benchmarks

Train Transformer Model:

This trains 15 experiment on our transformer model, saving the best model weights and all evaluation metrics.

python main.py --train-transformer

Run Trading Strategy:

This runs a portfolio optimization backtest using the predictions from all previously trained models. It simulates performance with different transaction costs and saves the resulting cumulative return plots.

python main.py --strategy

For a full list of commands and arguments, you can use the help flag:

python main.py --help

6. Run End-to-End Pipeline with `run.sh`

To ensure full reproducibility and execute the entire pipeline from data processing to the final backtest, use the provided shell script. This is the recommended method for generating the final results for the report.

sh run.sh

This script will execute all necessary stages in the correct sequence (e.g., data analysis, benchmark training, transformer training, evaluation, etc.).

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning for High-Frequency Return Prediction

Description

Project Structure

Authors

Quickstart

1. Clone the Repository

2. Create and Activate a Virtual Environment

3. Install dependencies

4. Process the Data

5. Run a pipeline stage

6. Run End-to-End Pipeline with `run.sh`

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
data		data
notebooks		notebooks
results		results
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

Machine Learning for High-Frequency Return Prediction

Description

Project Structure

Authors

Quickstart

1. Clone the Repository

2. Create and Activate a Virtual Environment

3. Install dependencies

4. Process the Data

5. Run a pipeline stage

6. Run End-to-End Pipeline with run.sh

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

6. Run End-to-End Pipeline with `run.sh`

Packages