Skip to content
Snippets Groups Projects

Leveraging Data Aggregation and Transfer Learning for Enhanced Machine Learning Model Performance

Overview

This project explores the use of data aggregation and transfer learning to improve machine learning model performance for global rental price prediction. It evaluates various methodologies to enhance generalization across diverse and region-specific rental markets using publicly available Kaggle data.


Table of Contents

  1. Introduction
  2. Features
  3. Data
  4. Methodologies
  5. Results
  6. Contributing
  7. License

Introduction

This project addresses the challenges of market-specific machine learning models, including data limitations and fragmented approaches. It aims to test the hypothesis that a global model trained on aggregated data can outperform isolated, market-specific models.

Project Structure

Leveraging-Data-Aggregation-and-Transfer-Learning/

├── workspace/                          
│   ├── data/  
│   │   ├── 0_raw_data/
│   │   │   └── cleaning_NB.ipybn     
│   │   ├── 1_clean_data/      
│   │   │   ├── clean_merge.ipybn      
│   │   │   ├── feature_engineering.ipybn      
│   │   │   └── scraper.py      
│   │   └── 2_working_data/   
│   │       ├── experimental_notebooks/      
│   │       │    ├── encoder_nb.ipynb      
│   │       │    └── experiment.ipybn          
│   │       ├── 3_encoded_combined_data.csv
│   │       ├── standardize_target.ipybn
│   │       └── total_merge.ipybn                                        
│   └── src/
│       ├── models/
│       │   ├──0_local_model.py
│       │   ├──1_leave_one_out_model.py
│       │   ├──2_data_augmentation_model.py
│       │   └──3_transfer_learning_model.py
│       └── utils/      
│           ├──data_utils.py
│           ├──evaluation.py
│           ├──mlflow_utils.py
│           └──model_utils.py
├── .gitignore      
├── LICENSE.txt                                       
├── README.md                      
└── requirements.txt   

Key Objectives:

  • Improve prediction accuracy through data aggregation and transfer learning.
  • Adapt open-source rental price data to construct a global prediction model.

Features

  • Data Aggregation: Combines datasets from 13 countries to build a global dataset.
  • Transfer Learning: Pretrains a global model and fine-tunes it for regional data.
  • Advanced Metrics: Evaluates models with MAE and R² to capture predictive accuracy and variance explanation.

Data

The dataset comprises 13 publicly available datasets aggregated to form a global dataset. It includes rental property data from countries like Germany, India and the USA, sourced from Kaggle. Features were standardized (e.g., converting local currencies to EUR and adjusting for inflation) to ensure compatibility.


Methodologies

  • Baseline Model: Individual market-specific models for comparison.
  • Leave One Out (LOO): Tests generalization without specific regional data.
  • Data Augmentation: Enriches local data with global information.
  • Transfer Learning: Pretraining on global data and fine-tuning for regional dynamics.

Results

  • Baseline Performance: Highlighted variability in prediction accuracy across markets.
  • LOO: Demonstrated the need for region-specific data.
  • Data Augmentation: Showed mixed results depending on the enrichment level.
  • Transfer Learning: Achieved incremental improvements in most regions, underscoring its potential.

License

This project is licensed under the MIT License. See the LICENSE file for details.