Go to file

Claudio Maggioni 07232eddcc Final version of the bug-triaging project Commit history has been discarded to remove large files from the repo.		2024-01-03 15:22:56 +01:00
docs	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
out	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
scripts	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
src	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
tests	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
.env.template	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
.gitattributes	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
.gitignore	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
.gitlab-ci.yml	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
README.md	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
coveragerc	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
environment-dev.yml	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
environment-server.yml	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
requirements.txt	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00
sonar-project.properties	Final version of the bug-triaging project	2024-01-03 15:22:56 +01:00

README.md

Assignment 1: Automated Bug Triaging

Group 2: Baris Aksakal, Edoardo Riggio, Claudio Maggioni

Repository structure

/docs: LaTeX report code;
/out
- /csv: Cleaner output;
- /json: Scraper output;
- /model: Pickled models (model training output) and model evaluation output;
- /plots: Plots for the dataset statistical analysis;
/src
- /analysis: Notebook for the dataset statistical analysis;
- /model-dl
  - /bert_medium.ipynb: Original implementation of the classifier model. Now broke down in python files;
  - /model*.ipynb: Alternative model implementation by Baris Aksakal. Not used in the final implementation;
- /{cleaner,modelimpl,scraper}: Python modules used for scraper, cleaner, and model script implementation;
- /auc.py: ROC curve generation script;
- /clean.py: Cleaner script;
- /runmodel.py: Model execution script;
- /scrape.py: Scraper script;
- /trainmodel.py: Model training script;
/environment-dev.yaml: Conda environment file for development environment;
/environment-server.yml: Conda environment file for model training and execution (to be used with gym.si.usi.ch).

Setup

Conda Environment

Training and running models is only supported on a CUDA 11.6 compatible environment like gym.si.usi.ch. The following instructions will create and activate a Conda environment with all required dependencies to scrape, clean, train and run the model:

conda env remove -n bug-triaging-env || true  # delete environment if already present
conda env create --name bug-triaging-env --file=environment-server.yml
conda activate bug-triaging-env

Development environment

(may not work on all platforms/architectures)

A pytorch-free version of the environment can be installed for development purposes. Only the scraper and cleaner script may be run using this environment. To install the development environment run:

conda env remove -n bug-triaging-env-dev || true  # delete environment if already present
conda env create --name bug-triaging-env-dev --file=environment-dev.yml
conda activate bug-triaging-env-dev

GitHub API token

In order to be able to run the scraper and the model executor, a GitHub API token is needed. The token must be placed in a .env file in this directory in a variable named GITHUB_TOKEN. The contents of the file should look like this:

GITHUB_TOKEN=<insert-token-here>

Scraper

The scraper script is located in src/scrape.py and takes no arguments. It will download and save all issues in the microsoft/vscode repository in a gzip-compressed archive of JSON files, one per issue. The file will be saved in out/json/issues.tar.gz. The file is deleted if it already exists.

To run the scraper run:

python3 src/scrape.py

Cleaner

The cleaner script is located in src/clean.py and takes no arguments. It will read the out/json/issues.tar.gz, perform the cleaning process, and perform the train-test split according to the instructions given in the assignment document. The output of the cleaning process is saved in 3 CSV files and one text file:

out/csv/issues_train_000001_170000.csv, including all issues that belong to the complete training set;
out/csv/issues_train_recent_150000_170000.csv, including all issues that belong to the training set made up of " recent" issues;
out/csv/issues_test_170001_180000.csv, including all issues that belong to the test set.
out/csv/issues_removed_count.txt, including the count of issues (excluding PRs) that were discarded by the cleaning process in the entire dataset.

The script will overwrite these files if they exist. To run the cleaner script run:

python3 src/clean.py

Training script

The script used to train the model is located in src/trainmodel.py. The script takes the following arguments:

usage: trainmodel.py [-h] [-r LEARNING_RATE] [-c] [-f] {all,recent} epochs

Training and evaluation script. The script will train and save the obtained model and then perform test set evaluation.
If the given parameters match with a model that was already saved, the script only runs the evaluation procedure.

positional arguments:
  {all,recent}          The dataset to train with
  epochs                Number of epochs of the training process

options:
  -h, --help              show this help message and exit
  -r LEARNING_RATE, --learning-rate LEARNING_RATE
                          The learning rate fed in the Adam optimizer
  -c, --force-cpu         disables CUDA support. Useful when debugging
  -f, --force-retraining  forces training of a new model even if a matching model is already found within the saved 
                          models

The script loads the generated CSV datasets in out/csv and will output three files in out/model:

out/model/bug_triaging_{all,recent}_{epochs}e_{LEARNING_RATE}lr_final.pt, the pytorch "pickled" model;
out/model/bug_triaging_{all,recent}_{epochs}e_{LEARNING_RATE}lr_final.label_range.txt, a text file containing two lines which determine the numeric range of classification labels outputted by the model (this file is used when using the ROC and model execution scripts);
out/model/bug_triaging_{all,recent}_{epochs}e_{LEARNING_RATE}lr_final.labels.csv, a CSV file matching the assignee usernames with the numeric encoding used to train and execute the model with (this file is used when using the ROC and model execution scripts).

({all,recent}, {epochs} and {LEARNING_RATE} are placeholders whose value will match the parameters given to the script)

To train the configurations that were chosen for the report execute:

python3 src/trainmodel.py all 4 -r '5e-6'
python3 src/trainmodel.py recent 4 -r '5e-6'

NOTE: The pickled pytorch model files have not been committed to this repo due to file size restrictions. They are however saved in gym.si.usi.ch:/home/SA23-G2/bug-triaging/out/model.

ROC curve generation script

The script used to train the model is located in src/auc.py. The script takes the following arguments:

usage: auc.py [-h] [-c] modelfile

ROC curve and AUC computation script. The script evaluates the given model against the test set and generates a OvR ROC
curve plot with one curve per class, a micro-averaged OvR ROC plot and the corresponding AUC value.

positional arguments:
  modelfile        Path to the pickled pytorch model to classify the issue with

options:
  -h, --help       show this help message and exit
  -c, --force-cpu  disables CUDA support. Useful when debugging

modelfile must contain a path to one of the .pt files generated with the training script. The label range text file and the labels CSV file are assumed to be in the same directory of the picked model.

The script outputs two PNG plots and a text file:

out/model/{model}.ovr_curves.png contains a plot of the One-vs-Rest ROC curves for each class (assignee) appearing both in the train and test set;
out/model/{model}.ovr_avg.png contains a plot of the micro-averaged One-vs-Rest ROC curve;
out/model/{model}.auc.txt contains the AUC for the micro-average ROC curve.

({model} is a placeholder for the filename without extension - the output of the shell command basename {modelfile} .pt - for the pickled pytorch model given as argument)

To generate the curves for the two trained models run:

python3 src/auc.py out/model/bug_triaging_all_4e_5e-06lr_final.pt
python3 src/auc.py out/model/bug_triaging_recent_4e_5e-06lr_final.pt

Execution script

The script used to train the model is located in src/runmodel.py. The script takes the following arguments:

usage: runmodel.py [-h] [-t TOP] [-c] modelfile issue_id

Model execution script. Downloads a given issue id from the microsoft/vscode repository, performs the cleaning process 
and recommends an assignee using the given model. The script may fail if the issue title and body do not contain any 
latin characters.

positional arguments:
  modelfile          Path to the pickled pytorch model to classify the issue with
  issue_id           The microsoft/vscode GitHub issue id to classify

options:
  -h, --help         show this help message and exit
  -t TOP, --top TOP  Number of recommendations to output
  -c, --force-cpu    disables CUDA support. Useful when debugging

The script outputs the top-5 assignee recommendations for the given issue, and the actual assignee if the issue has already been assigned.

Alongside each assignee, the script outputs the corresponding numerical embedding. A numerical embedding equal to -1 in the truth label denotes that the assignee does not appear in the training set (after the train/validation split).

The script also outputs the number of commits each assignee authored in the repository.

This is an example of the script output for issue 192213:

1: 'roblourens' (44) (confidence: 16.37%) (3932 commits authored)
2: 'lramos15' (36) (confidence: 12.62%) (829 commits authored)
3: 'bpasero' (16) (confidence: 7.29%) (11589 commits authored)
4: 'jrieken' (32) (confidence: 4.53%) (9726 commits authored)
5: 'hediet' (28) (confidence: 3.84%) (1231 commits authored)
Truth: 'alexdima' (9) (6564 commits authored)

To execute both the model trained on the recent dataset for issue 192213 run:

python3 src/runmodel.py out/model/bug_triaging_all_4e_5e-06lr_final.pt 192213

To execute both the model trained on the all dataset for issue 192213 run:

python3 src/runmodel.py out/model/bug_triaging_recent_4e_5e-06lr_final.pt 192213

Report

To compile the report run:

cd docs
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex