emonet.data_prep module#

Preparing data manifests dataset creation.

This only needs to be run once, initially. Assumes that you have metadata and wav files within a emonet-data directory in your home folder; wavs should be in ~/emonet-data/wavs or in ~/emonet-data/vad_wavs if you’ve run VAD model..

Intent is to create a master dataset manifest, including paths to files, emotion ratings by therapist, sample duration, and other metadata. Master manifest is further broken into train/valid/test manifests for each therapist.

Sets seed for reproducibility.

emonet.data_prep.get_metadata(files: List[pathlib.Path], labels_fn: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data/voice_labeling_report/voice_labels.json'), data_dir: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data')) Dict[source]#

Get metadata from audio files, compile into manifest.

Parameters
  • files (List[pathlib.Path]) – List of files to include in manifest.

  • labels_fn (pathlib.Path) – File containing original voice labels/report.

  • data_dir (pathlib.Path) – Directory containing all data files.

Returns

Dict – Dictionary mapping a unique file ID (stem) to respective metadata and labels.

emonet.data_prep.get_entry(file: pathlib.Path, data_dir=PosixPath('/Users/christophersantiago/emonet-data')) Dict[source]#

Create a dictionary entry based on file metadata. :Parameters: * file (pathlib.Path) – Path to audio file.

  • data_dir (pathlib.Path) – Directory containing all data files.

Returns

Dict – Dictionary mapping file stem to file metadata.

emonet.data_prep.get_labels_by_id(filepath: pathlib.Path, therapist: Optional[str] = None) Dict[source]#

Get emotion labels for each file within a metadata manifest.

Parameters
  • filepath (pathlib.Path) – Path to metadata file.

  • therapist (Optional[str]) – Specific therapist to retrieve labels from. Default None return all therapist labels.

Returns

Dict – A dictionary of file keys mapped to labels.

emonet.data_prep.prepare_data(metadata: Union[pathlib.Path, Dict], therapist: Optional[str] = None, split_ratio: Union[List, Tuple] = (80, 10, 10), data_dir: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data'), return_dict: bool = False) Optional[Dict][source]#

Prepare data for training/validation/testing.

Parameters
  • metadata (Union[pathlib.Path, Dict]) – Dataset metadata.

  • therapist (Optional[str]) – Optional therapist to filter on.

  • split_ratio (Union[List, Tuple]) – Train/valid/test split ratio.

  • data_dir (pathlib.Path) – Path to data directory.

  • return_dict (bool) – Whether to return dictionary or not; if not returned output exported to .json file.

Returns

Optional[Dict] – Dictionary containing data manifest or None, depending on call args.

emonet.data_prep.filter_bad_files(meta: Dict) Dict[source]#

Drop bad files from manifest.

Parameters

meta (Dict) – Dataset metadata.

Returns

Dict – Dictionary of metadata with bad files removed.

emonet.data_prep.filter_meta(files: List, meta: Dict) Dict[source]#

Remove files from metadata manifest not included in dataset.

Parameters
  • files (List) – List of files included in dataset.

  • meta (Dict) – Dictionary of dataset metadata.

Returns

Dict – Dictionary containing metadata of only audio in files arg.

emonet.data_prep.split_sets(meta: Dict, split_ratio: Union[List, Tuple] = (80, 10, 10), data_dir: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data')) Dict[source]#

Split data manifest into train/valid/test splits.

Parameters
  • meta (Dict) – Master manifest.

  • split_ratio (Tuple[int]) – Tuple containing split percents; default (80, 10, 10).

  • data_dir (pathlib.Path) – Directory containing all data files.

Returns

Dict – Dictionary containing train/valid/test dataset metadata.

emonet.data_prep.check_therapist(therapist: str) None[source]#

Check valid therapist name passed.

emonet.data_prep.check_labels(labels: Dict) None[source]#

Check files for number of labels/ratings.

emonet.data_prep.get_therapist_metadata(meta: Dict, therapist: str) Dict[source]#

Get therapist-specific metadata.

Parameters
  • meta (Dict) – Dataset metadata.

  • therapist (str) – Therapist to filter data on.

Returns

Dict – Dictionary of therapist-specific metadata.

emonet.data_prep.to_records(meta: Dict, therapists: List[str] = ['Michelle Lyn', 'Pegah Moghaddam', 'Sedara Burson', 'Yared Alemu']) List[Dict][source]#

Reformat metdata into a list of records.

Parameters
  • meta (Dict) – Dataset metadata.

  • therapists (List[str]) – List of therapists.

Returns

List[Dict] – List of metadata records.

emonet.data_prep.filter_splits(therapist: str) None[source]#

Filter existing train/valid/test manifests within therapist folders.

emonet.data_prep.get_avg_score(item: Dict, emotion: str) float[source]#

Get average emotional severity across therapists.

Used for regression task.

Parameters

item (Dict)

emonet.data_prep.get_metadata_from_files()[source]#

Iterate through all wav files to create metadata manifest.

emonet.data_prep.make_therapist_train_valid_test_sets(remove_splits: bool = False)[source]#

Make train/valid/test sets per therapist.

emonet.data_prep.add_avg_scores_to_meta()[source]#

Add average scores data to metadata manifest.

emonet.data_prep.make_main_train_valid_test_sets() None[source]#

Make train/valid/test sets for full wav files.

emonet.data_prep.make_therapist_splits_train_valid_test_sets()[source]#

Use per therapist splits to make aggregate splits.

emonet.data_prep.main()[source]#

Run data preparation.