ESPnet Tutorial

.huge.bold[A quick tutorial on how to use<br/><br/>ESPnet]

---

# Table of Contents

1. Basic directory structure

* [Recipe](#espnet_recipe)

* [Framework](#espnet_framework)

2. [Data format](#data_format): `dump/raw/<subset-name>/`

3. [Task definition](#task_definition): `espnet2/tasks/enh.py`

4. [Model definition](#model_definition): `espnet2/enh/`

5. [Model configuration](#model_configuration): `conf/tuning/xxx.yaml`

6. [Script flow of training and inference](#script_flow)

---

# Basic directory structure - Recipe

</div>
]

.right-column-61[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
*** Recipe for a specific dataset
**** Speech enhancement recipe
***** Model configurations (YAML files)
***** Data preparation scripts (used by run.sh)
***** .red[Directory generated by local/data.sh] <a href="#data_format" style="text-decoration:none">👉 <code class="remark-inline-code">Data format</code></a>
***** .red[Data directory used for training/evaluation] <a href="#data_format" style="text-decoration:none">👆</a>
***** SLURM config (cmd_backend: 'local' / 'slurm' /...)
***** Envrionment config file (export XXX=YYY)
***** Common script used for speech enhancement
***** .red[Entry script for running the recipe]
** Python scripts used for training/inference
** Used for toolkit installation

</div>
]

.full-column[
```bash
cd <espnet-root-dir>/egs2/urgent24/enh1
./run.sh --stage 1 --stop-stage 1
mkdir -p dump/raw
cp -r data/* dump/raw/
```
]

---

# Basic directory structure - Recipe (Cont'd)

.left-column-48[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/egs2" style="text-decoration:none">📁 egs2</a>
│   └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24" style="text-decoration:none">📁 urgent24</a>
│       └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1" style="text-decoration:none">📁 enh1</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/conf" style="text-decoration:none; color:red;">📁 conf</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/local" style="text-decoration:none">📁 local</a>
│           ├─ <span style="color:#004276;">📁 data</span>
│           ├─ <span style="color:#004276;">📁 dump</span>
│           ├─ .red[📁 exp]
│           │   └─ .red[📁 enh_stats_16k]
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/cmd.sh" style="text-decoration:none">📄 cmd.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/path.sh" style="text-decoration:none">📄 path.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/enh.sh" style="text-decoration:none">📄 enh.sh</a>
│           └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/run.sh" style="text-decoration:none; color:red;">📄 run.sh</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2" style="text-decoration:none">📁 espnet2</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/tools" style="text-decoration:none">📁 tools</a>

</div>
]

.right-column-51[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
*** Recipe for a specific dataset
**** Speech enhancement recipe
***** .red[Model configurations] <a href="#model_configuration" style="text-decoration:none">👉 <code class="remark-inline-code">Model configuration</code></a>
***** Data preparation scripts (used by run.sh)
***** Directory generated by local/data.sh
***** Data directory used for training/evaluation
***** .red[ Exp directory for model training/inference]
****** .red[Length stats used for minibatch construction]
***** SLURM config file ('local', 'slurm', etc.)
***** Envrionment config file (export XXX=YYY)
***** Common script used for speech enhancement
***** .red[Entry script for running the recipe]
** Python scripts used for training/inference
** Used for toolkit installation

</div>
]

.full-column[
```bash
cd <espnet-root-dir>/egs2/urgent24/enh1
./run.sh --stage 5 --stop-stage 5 --nj 8
```
]

---

# Basic directory structure - Recipe (Cont'd)

.left-column-48[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/egs2" style="text-decoration:none">📁 egs2</a>
│   └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24" style="text-decoration:none">📁 urgent24</a>
│       └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1" style="text-decoration:none">📁 enh1</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/conf" style="text-decoration:none; color:red;">📁 conf</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/local" style="text-decoration:none">📁 local</a>
│           ├─ <span style="color:#004276;">📁 data</span>
│           ├─ <span style="color:#004276;">📁 dump</span>
│           ├─ .red[📁 exp]
│           │   └─ .red[📁 enh_train_enh_xxx_raw]
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/cmd.sh" style="text-decoration:none">📄 cmd.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/path.sh" style="text-decoration:none">📄 path.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/enh.sh" style="text-decoration:none">📄 enh.sh</a>
│           └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/run.sh" style="text-decoration:none; color:red;">📄 run.sh</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2" style="text-decoration:none">📁 espnet2</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/tools" style="text-decoration:none">📁 tools</a>

</div>
]

.right-column-51[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
*** Recipe for a specific dataset
**** Speech enhancement recipe
***** .red[Model configurations] <a href="#model_configuration" style="text-decoration:none">👉 <code class="remark-inline-code">Model configuration</code></a>
***** Data preparation scripts (used by run.sh)
***** Directory generated by local/data.sh
***** Data directory used for training/evaluation
***** .red[Exp directory for model training/inference]
<span>          ****** <span style="color:red">Trained model directory</span></span>
***** SLURM config file ('local', 'slurm', etc.)
***** Envrionment config file (export XXX=YYY)
***** Common script used for speech enhancement
***** .red[Entry script for running the recipe]
** Python scripts used for training/inference
** Used for toolkit installation

</div>
]

.full-column[
```bash
cd <espnet-root-dir>/egs2/urgent24/enh1
./run.sh --stage 6 --stop-stage 6 --enh_config conf/tuning/xxx.yaml --num_nodes 1 --ngpu 1
```
]

---

# Basic directory structure - Recipe (Cont'd)

.left-column-48[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/egs2" style="text-decoration:none">📁 egs2</a>
│   └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24" style="text-decoration:none">📁 urgent24</a>
│       └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1" style="text-decoration:none">📁 enh1</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/conf" style="text-decoration:none; color:red;">📁 conf</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/local" style="text-decoration:none">📁 local</a>
│           ├─ <span style="color:#004276;">📁 data</span>
│           ├─ <span style="color:#004276;">📁 dump</span>
│           ├─ .red[📁 exp]
│           │   └─ .red[📁 enh_train_enh_xxx_raw]
│           │       ├─ <span style="color:#004276;">📁 images</span>
│           │       ├─ <span style="color:#004276;">📄 1epoch.pth</span>
│           │       ├─ <span style="color:#004276;">📄 valid.loss.best.pth</span>
│           │       └─ <span style="color:#004276;">📄 train.log</span>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/cmd.sh" style="text-decoration:none">📄 cmd.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/path.sh" style="text-decoration:none">📄 path.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/enh.sh" style="text-decoration:none">📄 enh.sh</a>
│           └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/run.sh" style="text-decoration:none; color:red;">📄 run.sh</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2" style="text-decoration:none">📁 espnet2</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/tools" style="text-decoration:none">📁 tools</a>

</div>
]

.right-column-51[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
*** Recipe for a specific dataset
**** Speech enhancement recipe
***** .red[Model configurations] <a href="#model_configuration" style="text-decoration:none">👉 <code class="remark-inline-code">Model configuration</code></a>
***** Data preparation scripts (used by run.sh)
***** Directory generated by local/data.sh
***** Data directory used for training/evaluation
***** .red[Exp directory for model training/inference]
<span>          ****** <span style="color:red">Trained model directory</span></span>
<span>          ****** <span style="color:#004276;">Training curves</span></span>
<span>          ****** <span style="color:#004276;">Latest checkpoint</span></span>
<span>          ****** <span style="color:#004276;">Best checkpoint (when finished)</span></span>
<span>          ****** <span style="color:#004276;">Training log</span></span>
***** SLURM config file ('local', 'slurm', etc.)
***** Envrionment config file (export XXX=YYY)
***** Common script used for speech enhancement
***** .red[Entry script for running the recipe]
** Python scripts used for training/inference
** Used for toolkit installation

</div>
]

---

# Basic directory structure - Recipe (Cont'd)

.left-column-48[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/egs2" style="text-decoration:none">📁 egs2</a>
│   └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24" style="text-decoration:none">📁 urgent24</a>
│       └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1" style="text-decoration:none">📁 enh1</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/conf" style="text-decoration:none; color:red;">📁 conf</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/local" style="text-decoration:none">📁 local</a>
│           ├─ <span style="color:#004276;">📁 data</span>
│           ├─ <span style="color:#004276;">📁 dump</span>
│           ├─ .red[📁 exp]
│           │    └─ .red[📁 enh_train_enh_xxx_raw]
│           │        └─ .red[📁 enhanced_validation]
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/cmd.sh" style="text-decoration:none">📄 cmd.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/path.sh" style="text-decoration:none">📄 path.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/enh.sh" style="text-decoration:none">📄 enh.sh</a>
│           └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/run.sh" style="text-decoration:none; color:red;">📄 run.sh</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2" style="text-decoration:none">📁 espnet2</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/tools" style="text-decoration:none">📁 tools</a>

</div>
]

.right-column-51[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
*** Recipe for a specific dataset
**** Speech enhancement recipe
***** .red[Model configurations] <a href="#model_configuration" style="text-decoration:none">👉 <code class="remark-inline-code">Model configuration</code></a>
***** Data preparation scripts (used by run.sh)
***** Directory generated by local/data.sh
***** Data directory used for training/evaluation
***** .red[Exp directory for model training/inference]
<span>          ****** <span style="color:red">Trained model directory</span></span>
<span>          ******* <span style="color:red">Enhanced audios (spk1.scp)</span></span>
***** SLURM config file ('local', 'slurm', etc.)
***** Envrionment config file (export XXX=YYY)
***** Common script used for speech enhancement
***** .red[Entry script for running the recipe]
** Python scripts used for training/inference
** Used for toolkit installation

</div>
]

.full-column[
```bash
cd <espnet-root-dir>/egs2/urgent24/enh1
./run.sh --stage 7 --stop-stage 7 --enh_config conf/tuning/xxx.yaml --inference_nj 8 --gpu_inference 8
```
]

---

# Basic directory structure - Recipe (Cont'd)

.left-column-48[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/egs2" style="text-decoration:none">📁 egs2</a>
│   └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24" style="text-decoration:none">📁 urgent24</a>
│       └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1" style="text-decoration:none">📁 enh1</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/conf" style="text-decoration:none; color:red;">📁 conf</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/local" style="text-decoration:none">📁 local</a>
│           ├─ <span style="color:#004276;">📁 data</span>
│           ├─ <span style="color:#004276;">📁 dump</span>
│           ├─ .red[📁 exp]
│           │    └─ .red[📁 enh_train_enh_xxx_raw]
│           │        └─ .red[📁 enhanced_validation]
│           │            ├─ <span style="color:#004276;">📁 logdir</span>
│           │            └─ <span style="color:#004276;">📄 spk1.scp</span>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/cmd.sh" style="text-decoration:none">📄 cmd.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/path.sh" style="text-decoration:none">📄 path.sh</a>
│           ├─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/enh.sh" style="text-decoration:none">📄 enh.sh</a>
│           └─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/run.sh" style="text-decoration:none; color:red;">📄 run.sh</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2" style="text-decoration:none">📁 espnet2</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/tools" style="text-decoration:none">📁 tools</a>

</div>
]

.right-column-51[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
*** Recipe for a specific dataset
**** Speech enhancement recipe
***** .red[Model configurations] <a href="#model_configuration" style="text-decoration:none">👉 <code class="remark-inline-code">Model configuration</code></a>
***** Data preparation scripts (used by run.sh)
***** Directory generated by local/data.sh
***** Data directory used for training/evaluation
***** .red[Exp directory for model training/inference]
<span>          ****** <span style="color:red">Trained model directory</span></span>
<span>          ******* <span style="color:red">Enhanced audios (spk1.scp)</span></span>
<span>          ******** <span style="color:#004276;">Enhanced audio directory</span></span>
<span>          ******** <span style="color:#004276;">list of all enhanced audios</span></span>
***** SLURM config file ('local', 'slurm', etc.)
***** Envrionment config file (export XXX=YYY)
***** Common script used for speech enhancement
***** .red[Entry script for running the recipe]
** Python scripts used for training/inference
** Used for toolkit installation

</div>
]

---

# Basic directory structure - Framework

.left-column-38[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/egs2" style="text-decoration:none">📁 egs2</a>
├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2" style="text-decoration:none">📁 espnet2</a>
│   ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/bin" style="text-decoration:none">📁 bin</a>
│   │   ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/bin/enh_train.py" style="text-decoration:none">📄 enh_train.py</a>
│   │   ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/bin/enh_inference.py" style="text-decoration:none; color:red;">📄 enh_inference.py</a>
│   │   └─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/bin/enh_scoring.py" style="text-decoration:none">📄 enh_scoring.py</a>
│   ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/tasks" style="text-decoration:none">📁 tasks</a>
│   │   ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/tasks/abs_task.py" style="text-decoration:none">📄 abs_task.py</a>
│   │   └─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/tasks/enh.py" style="text-decoration:none; color:red;">📄 enh.py</a>
│   └─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh" style="text-decoration:none">📁 enh</a>
│       ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/encoder" style="text-decoration:none">📁 encoder</a>
│       ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/separator" style="text-decoration:none">📁 separator</a>
│       ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/decoder" style="text-decoration:none">📁 decoder</a>
│       ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/layers" style="text-decoration:none">📁 layers</a>
│       ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/loss" style="text-decoration:none">📁 loss</a>
│       │   ├─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/loss/criterions" style="text-decoration:none">📁 criterions</a>
│       │   └─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/loss/wrappers" style="text-decoration:none">📁 wrappers</a>
│       └─ <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/espnet_model.py" style="text-decoration:none; color:red;">📄 espnet_model.py</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/tools" style="text-decoration:none">📁 tools</a>
</div>
]

.right-column-61[
<div class="code-like-block remark-code">
* Root directory
** Directory containing all recipes
** Python scripts used for training/inference
*** Entry scripts (used by <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1/enh.sh" style="text-decoration:none">enh.sh</a>)
**** Entry script for training
**** .red[Entry script for inference (enhancing a subset)]
**** Entry script for scoring (evaluating enhanced audios)
*** Task definition
**** Abstract task definition (shared by all tasks)
**** .red[Speech enhancement task definition] <a href="#task_definition" style="text-decoration:none">👉 <code class="remark-inline-code">Task definition</code></a>
*** Speech enhancement model definition <a href="#model_definition" style="text-decoration:none">👉 <code class="remark-inline-code">Model definition</code></a>
**** Available encoders (<a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/encoder/stft_encoder.py" style="text-decoration:none;">STFT</a>, <a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/encoder/conv_encoder.py" style="text-decoration:none;">Conv1d</a>, etc.)
**** Available separators (<a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/separator/bsrnn_separator.py" style="text-decoration:none;">BSRNN</a>, <a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/separator/tfgridnetv3_separator.py" style="text-decoration:none;">TF-GridNet</a>, etc.)
**** Available decoders (<a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/decoder/stft_decoder.py" style="text-decoration:none;">iSTFT</a>, <a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/decoder/conv_decoder.py" style="text-decoration:none;">ConvTranspose1d</a>, etc.)
**** Detailed layer definitions used in separators
*** Loss functions
**** Criterion functions (<a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/loss/criterions/time_domain.py" style="text-decoration:none;">time</a> / <a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/loss/criterions/tf_domain.py" style="text-decoration:none;">time-frequency</a> domain)
**** Wrappers (<a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/loss/wrappers/fixed_order.py" style="text-decoration:none;">fixed permutation</a>, <a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/loss/wrappers/pit_solver.py" style="text-decoration:none;">PIT</a>, etc.)
*** .red[Common framework of speech enhancement models]
** Used for toolkit installation
</div>
]

---

# Data format

In speech enhancement recipes, we need to prepare the following seven files:
.left-column-40[
<div class="code-like-block remark-code">
<a href="https://github.com/espnet/espnet" style="text-decoration:none">📁 [espnet-root-dir]</a>
└─ <a href="https://github.com/espnet/espnet/tree/master/egs2/urgent24/enh1" style="text-decoration:none">📁 egs2/urgent24/enh1</a>
    ├─ <span style="color:#004276;">📁 data</span>
    └─ <span style="color:#004276;">📁 dump/raw</span>
         ├─ <span style="color:#004276;">📁 train</span>
         │   ├─ <span style="color:red">📄 wav.scp</span>
         │   ├─ <span style="color:red">📄 spk1.scp</span>
         │   ├─ <span style="color:red">📄 text</span>
         │   ├─ <span style="color:red">📄 utt2spk</span>
         │   ├─ <span style="color:red">📄 spk2utt</span>
         │   ├─ <span style="color:red">📄 utt2fs</span>
         │   └─ <span style="color:red">📄 utt2category</span>
         ├─ <span style="color:#004276;">📁 validation</span>
         └─ <span style="color:#004276;">📁 test</span>

</div>
]

.right-column-59[
<div class="code-like-block remark-code">
* Root directory
** Recipe for the URGENT 2024 Challenge
*** Directory generated by local/data.sh
*** Data directory used for training/evaluation
**** Training set
***** .red[List of degraded speech (model input)]
***** .red[Corresponding clean speech (label)]
***** .red[Corresponding transcripts (for ASR evaluation)]
***** .red[Corresponding speaker IDs]
***** .red[Generated by utils/utt2spk_to_spk2utt.pl]
***** .red[Corresponding sampling rate in Hz]
***** .red[Corresponding category (for minibatch construction)]
**** Validation set
**** Test set
</div>
]

.full-column[
In each data file, the data format follows the <a href="https://kaldi-asr.org/doc/data_prep.html" style="text-decoration:none;">Kaldi</a> convention, i.e., a <span style="color:#004276;">table-style</span> format with the first column as the utterance ID (key) and the rest columns as the value.
]

---

# Data format (Cont'd)

|File name|Example content|Note|
|---|---|---|
|wav.scp|<span style="color:red">uid1</span> /path/to/noisy/uid1.flac|Essential|
|spk1.scp|<span style="color:red">uid1</span> /path/to/clean/uid1.flac|Needed in stage 8|
|text|<span style="color:red">uid1</span> It is also very valuable.|Needed in stage 8+|
|utt2spk|<span style="color:red">uid1</span> sid1|Essential|
|spk2utt|<span style="color:red">sid1</span> uid1 uid7 uid16 uid99|Essential|
|utt2fs|<span style="color:red">uid1</span> 32000|Essential for multi-fs data|
|utt2category|<span style="color:red">uid1</span> 1ch_32000Hz|Essential for multi-fs data|

> Note that `utt2fs` is required to run model inference in stage 7 of `run.sh` so that the enhanced audio can be stored in the correct sampling rate.
> <br/><br/>
> For the officially released validation/test subset (downloaded audios), you will need to generate the above data files manually.

---

# Data format (Cont'd)

To manually generate data files for an audio directory `audios`, you could run the following commands (assuming all file names are unique):
```bash
mkdir -p dump/raw/<subset-name>
find audios/ -iname '*.flac' | awk -F'[/.]' '{print($(NF-1)" "$0)}' | \
    sort -u > dump/raw/<subset-name>/wav.scp
find audios/ -iname '*.flac' | awk -F'[/.]' '{print($(NF-1)" "$(NF-1))}' | \
    sort -u > dump/raw/<subset-name>/utt2spk
find audios/ -iname '*.flac' | awk -F'[/.]' '{print($(NF-1)" "$(NF-1))}' | \
    sort -u > dump/raw/<subset-name>/spk2utt
python -c '
import soundfile as sf
with open("dump/raw/<subset-name>/utt2fs", "w") as f1:
    with open("dump/raw/<subset-name>/utt2category", "w") as f2:
        with open("dump/raw/<subset-name>/wav.scp", "r") as f3:
            for line in f3:
                uid, path = line.strip().split(maxsplit=1)
                info = sf.info(path)
                f1.write(f"{uid} {info.samplerate}\n")
                f2.write(f"{uid} {info.channels}ch_{info.samplerate}Hz\n")
'
```

---

# Task definition

The speech enhancement task <a href="https://github.com/espnet/espnet/tree/master/espnet2/tasks/enh.py" style="text-decoration:none;">`espnet2.tasks.enh.EnhancementTask`</a> is a high-level class used for model training as well as for loading pre-trained models.

It generally defines the supported

* .darkred[model architectures] (encoder_choices, separator_choices, decoder_choices)

* .darkred[loss functions] (loss_wrapper_choices, criterion_choices)

* The wrapper is used mainly for handling the permutation problem in speech separation tasks.

* For single-speaker tasks, please simply use `wrapper: fixed_order`.

* .darkred[preprocessing] (preprocessor_choices)

* used for preprocessing each input sample during both training and inference stages

* e.g., for data augmentation / dynamic mixing, and normalization

* and the .darkred[arguments] used for training.

---

# Task definition (Cont'd)

To use a pre-trained speech enhancement model for enhancing audios, you can run the following script:

<d-code block language="python">
import soundfile as sf
from espnet2.bin.enh_inference import SeparateSpeech
<br/>
model = SeparateSpeech(
    train_config="exp/xxx/config.yaml",
    model_file="exp/xx/valid.loss.best.pth",
    normalize_output_wav=True,
    device="cuda",
)
<br/>
audio, fs = sf.read("/path/to/noisy/utt1.flac")
enhanced = model(audio[None, :], fs=fs)[0]
</d-code>

---

# Model definition

All speech enhancement models generally share the same template defined in <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/espnet_model.py" style="text-decoration:none;">`espnet2.enh.espnet_model.ESPnetEnhancementModel`</a>, which consists of three modules: <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/encoder" style="text-decoration:none">encoder</a> → <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/separator" style="text-decoration:none">separator</a> → <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/decoder" style="text-decoration:none">decoder</a>.

The main enhancement function is achieved by the separator module, which can be of various architectures, e.g., <a href="https://github.com/espnet/espnet/tree/master/espnet2/enh/separator/bsrnn_separator.py" style="text-decoration:none;">BSRNN</a>, <a href="https://github.com/espnet/espnet/blob/master/espnet2/enh/separator/tfgridnetv3_separator.py" style="text-decoration:none;">TF-GridNet</a>, and so on.

To add your own model, you can follow the instructions in <a href="https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/enh1/README.md#instructions-on-creating-a-new-model" style="text-decoration:none;">egs2/TEMPLATE/enh1/README.md</a>.

---

# Model configuration

The configuration (YAML) file is defined in <a href="https://github.com/espnet/espnet/blob/master/egs2/urgent24/enh1/conf/tuning" style="text-decoration:none;">`conf/tuning/`</a>.