emonet.data_prep module
emonet.data_prep module#
Preparing data manifests dataset creation.
This only needs to be run once, initially. Assumes that you have metadata and wav files within a emonet-data directory in your home folder; wavs should be in ~/emonet-data/wavs or in ~/emonet-data/vad_wavs if you’ve run VAD model..
Intent is to create a master dataset manifest, including paths to files, emotion ratings by therapist, sample duration, and other metadata. Master manifest is further broken into train/valid/test manifests for each therapist.
Sets seed for reproducibility.
- emonet.data_prep.get_metadata(files: List[pathlib.Path], labels_fn: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data/voice_labeling_report/voice_labels.json'), data_dir: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data')) Dict [source]#
Get metadata from audio files, compile into manifest.
- Parameters
files (List[pathlib.Path]) – List of files to include in manifest.
labels_fn (pathlib.Path) – File containing original voice labels/report.
data_dir (pathlib.Path) – Directory containing all data files.
- Returns
Dict – Dictionary mapping a unique file ID (stem) to respective metadata and labels.
- emonet.data_prep.get_entry(file: pathlib.Path, data_dir=PosixPath('/Users/christophersantiago/emonet-data')) Dict [source]#
Create a dictionary entry based on file metadata. :Parameters: * file (pathlib.Path) – Path to audio file.
data_dir (pathlib.Path) – Directory containing all data files.
- Returns
Dict – Dictionary mapping file stem to file metadata.
- emonet.data_prep.get_labels_by_id(filepath: pathlib.Path, therapist: Optional[str] = None) Dict [source]#
Get emotion labels for each file within a metadata manifest.
- Parameters
filepath (pathlib.Path) – Path to metadata file.
therapist (Optional[str]) – Specific therapist to retrieve labels from. Default None return all therapist labels.
- Returns
Dict – A dictionary of file keys mapped to labels.
- emonet.data_prep.prepare_data(metadata: Union[pathlib.Path, Dict], therapist: Optional[str] = None, split_ratio: Union[List, Tuple] = (80, 10, 10), data_dir: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data'), return_dict: bool = False) Optional[Dict] [source]#
Prepare data for training/validation/testing.
- Parameters
metadata (Union[pathlib.Path, Dict]) – Dataset metadata.
therapist (Optional[str]) – Optional therapist to filter on.
split_ratio (Union[List, Tuple]) – Train/valid/test split ratio.
data_dir (pathlib.Path) – Path to data directory.
return_dict (bool) – Whether to return dictionary or not; if not returned output exported to .json file.
- Returns
Optional[Dict] – Dictionary containing data manifest or None, depending on call args.
- emonet.data_prep.filter_bad_files(meta: Dict) Dict [source]#
Drop bad files from manifest.
- Parameters
meta (Dict) – Dataset metadata.
- Returns
Dict – Dictionary of metadata with bad files removed.
- emonet.data_prep.filter_meta(files: List, meta: Dict) Dict [source]#
Remove files from metadata manifest not included in dataset.
- Parameters
files (List) – List of files included in dataset.
meta (Dict) – Dictionary of dataset metadata.
- Returns
Dict – Dictionary containing metadata of only audio in files arg.
- emonet.data_prep.split_sets(meta: Dict, split_ratio: Union[List, Tuple] = (80, 10, 10), data_dir: pathlib.Path = PosixPath('/Users/christophersantiago/emonet-data')) Dict [source]#
Split data manifest into train/valid/test splits.
- Parameters
meta (Dict) – Master manifest.
split_ratio (Tuple[int]) – Tuple containing split percents; default (80, 10, 10).
data_dir (pathlib.Path) – Directory containing all data files.
- Returns
Dict – Dictionary containing train/valid/test dataset metadata.
- emonet.data_prep.check_labels(labels: Dict) None [source]#
Check files for number of labels/ratings.
- emonet.data_prep.get_therapist_metadata(meta: Dict, therapist: str) Dict [source]#
Get therapist-specific metadata.
- Parameters
meta (Dict) – Dataset metadata.
therapist (str) – Therapist to filter data on.
- Returns
Dict – Dictionary of therapist-specific metadata.
- emonet.data_prep.to_records(meta: Dict, therapists: List[str] = ['Michelle Lyn', 'Pegah Moghaddam', 'Sedara Burson', 'Yared Alemu']) List[Dict] [source]#
Reformat metdata into a list of records.
- Parameters
meta (Dict) – Dataset metadata.
therapists (List[str]) – List of therapists.
- Returns
List[Dict] – List of metadata records.
- emonet.data_prep.filter_splits(therapist: str) None [source]#
Filter existing train/valid/test manifests within therapist folders.
- emonet.data_prep.get_avg_score(item: Dict, emotion: str) float [source]#
Get average emotional severity across therapists.
Used for regression task.
- Parameters
item (Dict)
- emonet.data_prep.get_metadata_from_files()[source]#
Iterate through all wav files to create metadata manifest.
- emonet.data_prep.make_therapist_train_valid_test_sets(remove_splits: bool = False)[source]#
Make train/valid/test sets per therapist.
- emonet.data_prep.make_main_train_valid_test_sets() None [source]#
Make train/valid/test sets for full wav files.