2022 Short Course on the Application of Machine Learning for Automated Quantification of Behavior ################################################################################################################################## Day 4 Workshop - Training Classifiers to Classify Animal Behavior =========================================================================================== Purpose/Expectations ==================== The contents of this course will grant insight into important characteristics while trainig behavioral classifiers. The usage of JABS code in this tutorial is strictly limited to reading in project annotations and feature vectors. Within this tutorial, you will be provided with 2 example annotated projects that contain sparse labels that will be used for training and dense labels that you will use for final evaluation of your best model. Expected knowledge before coming doing the course * Familiarity with animal pose data and frame-level features derived from that data * Basic python experience * Some numpy experience * Familiarity with generating plots in python (preference to using plotnine, which is ggplot-style syntax) Expected takeaways of the course * Learn the core process to train a frame-wise predictor of behavior * Become familiar with key components to take into account when training these types of classifiers * Gain insight for some of the design decision considerations our group integrated into the JABS software This is a static web-version of the ipython notebook found `here `_. Step 0 ------ Import a lot of libraries that we will be using in this tutorial .. code-block:: console import numpy as np import pandas as pd from itertools import chain import sklearn # JABS install directory import sys sys.path.append('JABS-behavior-classifier/') from src.project import Project from src.classifier import Classifier # Plotting library import plotnine as p9 %matplotlib notebook Step 1 ------ Import the project annotations that we are providing These annotations were created using our JABS software, so we can use that library to extract the annotations in python .. code-block:: console project = Project('JABS-Training-Data/') # We're picking the first behavior in the project. Each JABS project can contain multiple behavior annotations. behavior = project.load_metadata()['behaviors'][0] # Here we read in the all important information for all labels training_data = project.get_labeled_features(behavior, window_size=5, use_social_features=True) # Since this is more of an internal format, we can separate out the data to be more easily accessible all_labels = training_data[0]['labels'] annotation_animal = training_data[0]['groups'] annotation_animal_names = [training_data[1][x]['video'] + ' animal ' + str(training_data[1][x]['identity']) for x in annotation_animal] feature_names = training_data[0]['column_names'] all_features = Classifier().combine_data(training_data[0]['per_frame'],training_data[0]['window']) Also extract some bout-level groups, something that JABS doesn’t normally do .. code-block:: console raw_annotations = [project.load_video_labels(training_data[1][x]['video']).as_dict()['labels'][str(training_data[1][x]['identity'])][behavior] for x in np.unique(annotation_animal)] num_animals_annotated = len(raw_annotations) bout_data = [[np.repeat(animal_idx+bout_idx*num_animals_annotated, bout['end']-bout['start']+1) for bout_idx, bout in enumerate(animal_data)] for animal_idx, animal_data in enumerate(raw_annotations)] bout_data = np.concatenate(list(chain.from_iterable(bout_data))) At this point we have extracted feature data and annotations from a JABS project. Here’s a description of important variables: * ``all_features`` contains the matrix of all the features. The first dimension is the frame index and the second dimension is the feature index. * ``all_labels`` contains a vector of labels. 0 = not behavior, 1 = behavior * ``annotation_animal`` contains a vector with a unique number for each animal * ``bout_data`` contains a vector with a unique bout number for each bout annotated \* ``feature_names`` contains the name of each feature. Each value here is the name of the 2nd dimension of the feature vector. Step 2 ------ Read in the held-out test set This set is distinct from the validation dataset we’ll be splitting in 2 key factors: 1. This set will be held out and should not be used in tuning model parameters. 2. It is also densely annotated (all frames contain a label), rather than the sparse annotation provided for training. .. code-block:: console test_project = Project('JABS-Test-Data/') test_data = test_project.get_labeled_features(behavior, window_size=5, use_social_features=True) test_labels = test_data[0]['labels'] test_animals = test_data[0]['groups'] test_features = Classifier().combine_data(test_data[0]['per_frame'],test_data[0]['window']) Step 3 ------ Inspect characteristics of the dataset Since the annotations are located in ``all_labels``, we should inspect some characteristics of it Question 1 ~~~~~~~~~~ How many annotations do we have? .. code-block:: console print(len(all_labels)) .. parsed-literal:: 3407 Question 2 ~~~~~~~~~~ How many labels do we have of each class: 0 (not-behavior) and 1 (behavior) .. code-block:: console num_not_behavior = sum(all_labels==0) print(num_not_behavior) num_behavior = sum(all_labels==1) print(num_behavior) .. parsed-literal:: 2294 1113 Question 3-4 ~~~~~~~~~~~~ How many bouts were annotated? How many animals were annotated? .. code-block:: console num_bouts = len(np.unique(bout_data)) print(num_bouts) num_animals = len(np.unique(annotation_animal)) print(num_animals) .. parsed-literal:: 235 21 Discussion 1 ------------ What are characteristics of the dataset that may be important about creating a good model? Hypotheticals to think about: * If you have a rare behavior, is it okay to label a lot more “not behavior”? What might the model try and do to improve accuracy if that ratio becomes 1000:1? * If one animal tends to express the behavior more than another, is it okay to provide more labels on from that individual? Discussion on the balancing of data Rules of thumb: * Balanced annotations tend to create better classifiers. The higher the degree of imbalance, the more likely the classifier will just cheat and learn to predict one state over another. Generally, you should aim to not have more than a 2:1 ratio, but performance only particularly degrades when worse than a 5:1 or 10:1 ratio. Re-sampling can address this. * More bouts is generally better, because within-bout annotations have high temporal correlation and are therefore less informative for making decisions that generalize in favor of making decisions specific to that bout. * More animals is generally better, because some animals may express behavior in a more unique style. Having more animals enables better generalization. We’ve observed in the past that limiting to specific strains of animals in training hurts generalization to different strains. This keys in on adequately sampling from the population variation of the experiments you plan on running inferences on - include at least some of each genotype. Extra visualizations of the dataset ----------------------------------- .. code-block:: console plotting_df = pd.DataFrame({'annotations':all_labels, 'bout':bout_data, 'animal':annotation_animal, 'annotation_idx':np.arange(len(all_labels))}) # Histogram of the class labels plot_labels = p9.ggplot() + \ p9.geom_histogram(mapping=p9.aes(x='factor(annotations)'), data=plotting_df, bins=2, fill='#9e9e9e', color='#000000', size=2) + \ p9.labs(title='Class Labels', x='Class', y='Count') + \ p9.scale_x_discrete(labels=['Not-Behavior',behavior]) + \ p9.theme_bw() plot_labels.draw().show() # Class labels per-animal plot_animals = p9.ggplot() + \ p9.geom_histogram(mapping=p9.aes(x='factor(annotations)'), data=plotting_df, bins=2, fill='#9e9e9e', color='#000000', size=2) + \ p9.labs(title='Class Labels by Animal', x='Class', y='Count') + \ p9.scale_x_discrete(labels=['Not-Behavior',behavior]) + \ p9.facet_wrap('animal') + \ p9.theme_bw() plot_animals.draw().show() .. parsed-literal:: Example Output: .. raw:: html .. raw:: html Step 4 ------ Splitting the training data into train and validation We will be using the training portion of the data to allow the algorithm to learn the best parameters to make predictions The validation will be held out to evaluate performance of the classifier Here we define a handful of functions that accepts the features and labels and returns the split data .. code-block:: console # Defining functions to split the data # Here we're going to manually shuffle and split # Naive approach 1 - random sorted split (deterministic) def split_data(features, labels, percent_train=0.75): available_examples = np.arange(len(labels)) train_idxs = available_examples[:int(len(labels)*percent_train)] test_idxs = available_examples[int(len(labels)*percent_train):] # Separate out the training data train_features = features[train_idxs] train_labels = labels[train_idxs] # Separate out the validation data valid_features = features[test_idxs] valid_labels = labels[test_idxs] return train_features, train_labels, valid_features, valid_labels .. code-block:: console # Naive approach 2 - random shuffle def random_split_data(features, labels, percent_train=0.75): available_examples = np.arange(len(labels)) np.random.shuffle(available_examples) train_idxs = available_examples[:int(len(labels)*percent_train)] test_idxs = available_examples[int(len(labels)*percent_train):] # Separate out the training data train_features = features[train_idxs] train_labels = labels[train_idxs] # Separate out the validation data valid_features = features[test_idxs] valid_labels = labels[test_idxs] return train_features, train_labels, valid_features, valid_labels .. code-block:: console # Using sklearn to split the data # Stratified splitting: # Attempts to preserve the train/valid class representations def sklean_stratified_split(features, labels, percent_train=0.75): train_idxs, test_idxs = list(sklearn.model_selection.StratifiedShuffleSplit(n_splits=1, train_size=percent_train).split(features, labels))[0] # Separate out the training data train_features = features[train_idxs] train_labels = labels[train_idxs] # Separate out the validation data valid_features = features[test_idxs] valid_labels = labels[test_idxs] return train_features, train_labels, valid_features, valid_labels .. code-block:: console # Group splitting # Leaves one group out within the split # Note that this one group that is left out is not guaranteed to contain both labels, so sometime performance will contain things like division by 0. def sklearn_logo_split(features, labels, groups, percent_train=0.75): train_idxs, test_idxs = list(sklearn.model_selection.LeaveOneGroupOut().split(features, labels, groups))[0] # Separate out the training data train_features = features[train_idxs] train_labels = labels[train_idxs] # Separate out the validation data valid_features = features[test_idxs] valid_labels = labels[test_idxs] return train_features, train_labels, valid_features, valid_labels .. code-block:: console # Group splitting #2 # Leaves multiple groups out with a target percent annotations that fall into the training set def sklearn_group_split(features, labels, groups, percent_train=0.75): train_idxs, test_idxs = list(sklearn.model_selection.GroupShuffleSplit(n_splits=1, train_size=percent_train).split(features, labels, groups))[0] # Separate out the training data train_features = features[train_idxs] train_labels = labels[train_idxs] # Separate out the validation data valid_features = features[test_idxs] valid_labels = labels[test_idxs] return train_features, train_labels, valid_features, valid_labels Experiment 1 ------------ Inspect/Visualize the effects of different splits of the annotations .. code-block:: console train_features, train_labels, valid_features, valid_labels = sklearn_group_split(all_features, all_labels, annotation_animal) train_df = pd.DataFrame({'state':'train', 'label':train_labels}) valid_df = pd.DataFrame({'state':'valid', 'label':valid_labels}) plot_df = pd.concat([train_df, valid_df]) split_plot = p9.ggplot(plot_df) + \ p9.geom_bar(p9.aes(x='state', fill='factor(label)')) + \ p9.theme_bw() split_plot.draw().show() .. parsed-literal:: Example Output: .. raw:: html Label Balancing --------------- Since we’re aware that we have roughly a 3x more annotations for not-behavior, we should probably consider balancing the labels We can take multiple approaches to creating a more balanced dataset: 1. Downsampling: Discard data from the class which has the most annotation until the labels are balanced 2. Upsampling: Duplicate data from the class which has the least annotation until the labels are balanced .. code-block:: console # Downsamples the featuers, labels, and groups such that the labels are balanced def downsample_balanced(features, labels, groups=None): _, label_counts = np.unique(labels, return_counts=True) smallest_class_count = np.min(label_counts) class_0_idxs = np.where(labels==0)[0] class_1_idxs = np.where(labels==1)[0] class_0_idxs = np.random.choice(class_0_idxs, smallest_class_count, replace=False) class_1_idxs = np.random.choice(class_1_idxs, smallest_class_count, replace=False) selected_samples = np.sort(np.concatenate([class_0_idxs, class_1_idxs])) new_features = features[selected_samples,:] new_labels = labels[selected_samples] if groups is not None: new_groups = groups[selected_samples] return new_features, new_labels, new_groups else: return new_features, new_labels, None Question 5 ~~~~~~~~~~ Write a function to upsample the data .. code-block:: console # Downsamples the featuers, labels, and groups such that the labels are balanced def upsample_balanced(features, labels, groups=None): _, label_counts = np.unique(labels, return_counts=True) largest_class_count = np.max(label_counts) class_0_idxs = np.where(labels==0)[0] class_1_idxs = np.where(labels==1)[0] class_0_idxs = np.random.choice(class_0_idxs, largest_class_count, replace=True) class_1_idxs = np.random.choice(class_1_idxs, largest_class_count, replace=True) selected_samples = np.sort(np.concatenate([class_0_idxs, class_1_idxs])) new_features = features[selected_samples,:] new_labels = labels[selected_samples] if groups is not None: new_groups = groups[selected_samples] return new_features, new_labels, new_groups else: return new_features, new_labels, None Experiment 2 ------------ Inspect/Visualize the effects of downsampling the annotations .. code-block:: console balanced_features, balanced_labels, balanced_groups = downsample_balanced(all_features, all_labels, annotation_animal) train_features, train_labels, valid_features, valid_labels = sklearn_group_split(balanced_features, balanced_labels, balanced_groups) train_df = pd.DataFrame({'state':'train', 'label':train_labels}) valid_df = pd.DataFrame({'state':'valid', 'label':valid_labels}) plot_df = pd.concat([train_df, valid_df]) split_plot = p9.ggplot(plot_df) + \ p9.geom_bar(p9.aes(x='state', fill='factor(label)')) + \ p9.theme_bw() split_plot.draw().show() .. parsed-literal:: Example Output: .. raw:: html Step 5 ------ Train classifiers For the purposes of this tutorial, we will be relying on using SKLearn to training classifiers. Good recommended classifiers: 1. `Adaboost `__ 2. `Random Forest `__ .. code-block:: console # Define how to train a decision tree classifier def train_dt_classifier(train_features, train_labels): # Create a classifier object # Note here we are using a bunch of default values that SKLearn has picked for us. # Check the documentation of the various parameters you can adjust for the decision trees: # https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier classifier = sklearn.tree.DecisionTreeClassifier() # Train the classifier on our training features + labels classifier = classifier.fit(train_features, train_labels) return classifier Question 6 ~~~~~~~~~~ Write functions to train an adaboost and random forest classifier .. code-block:: console # Adaboost Classifier def train_ada_classifier(train_features, train_labels): classifier = sklearn.ensemble.AdaBoostClassifier() classifier = classifier.fit(train_features, train_labels) return classifier .. code-block:: console # Random forest Classifier def train_rf_classifier(train_features, train_labels): classifier = sklearn.ensemble.RandomForestClassifier() classifier = classifier.fit(train_features, train_labels) return classifier Step 6 ------ Frame-wise evaluation of the classifier Since our classifier predicts for every frame, we can estimate its performance by comparing the held-out validation data with the predictions it is making. For evaluating classifier performance, we must first calculate the 4 values of a confusion matrix: 1. True positives (TP): When both the classifier and the ground truth assign “behavior” 2. False negatives (FN): When the classifier predicts “not behavior” and the ground truth assigns “behavior” 3. False positives (FP): When the classifier predicts “behavior” and the ground truth assigns “not behavior” 4. True negatives (TN): When both the classifier and the ground truth assign “not behavior” From these, we can calculate 4 additional useful performance metrics for describing the classifier. Wikipedia has a `good article `__ outlining these as well as more, but these are the 4 most common. 1. Accuracy: The total number of predictions that were correct. 2. Precision: Of the predictions for “behavior”, how many were correct? 3. Recall: Of the ground truths labeled “behavior”, how many did the classifier predict? 4. F-beta (F1, F-score): Harmonic mean of precision and recall Question 7 ~~~~~~~~~~ Complete the definitions of the confusion matrix and the performance metrics below .. code-block:: console # Routine for evaluating the classifier performance def eval_classifier_performance(classifier, features, labels, print_results=True): # Predict on held-out set predictions = classifier.predict(features) # Accuracy # Students will write these acc = np.mean(predictions == labels) # Confusion matrix values for a binary classifier tp_count = np.sum(np.logical_and(predictions==1, labels==1)) tn_count = np.sum(np.logical_and(predictions==0, labels==0)) fp_count = np.sum(np.logical_and(predictions==1, labels==0)) fn_count = np.sum(np.logical_and(predictions==0, labels==1)) # Calculate other metrics # Of the predictions we made, how many were correct precision = tp_count/(tp_count + fp_count) # Of the things we were looking for, how many did we find recall = tp_count/(tp_count + fn_count) # Harmonic mean of Pr and Re f1 = 2*(precision * recall)/(precision + recall) if print_results: print('Accuracy: ' + str(acc)) print('Precision: ' + str(precision)) print('Recall: ' + str(recall)) print('F1-score: ' + str(f1)) # We can also show that sklearn has tools for providing these metrics # Note that we've calculated the "precision" and "recall" for the behavior (value == 1) # print(sklearn.metrics.classification_report(labels, predictions)) # Manually obtaining each # sklearn.metrics.precision_score(labels, predictions) # sklearn.metrics.recall_score(labels, predictions) # sklearn.metrics.f1_score(labels, predictions) return acc, precision, recall, f1 Step 7 ------ Run through the steps: 1. Split the training data into train/valid 2. Train a classifier on the train portion of the split 3. Evaluate the classifier .. code-block:: console # Step 1: Split the data into train/valid train_features, train_labels, valid_features, valid_labels = random_split_data(all_features, all_labels) # Train a classifier dt_classifier = train_dt_classifier(train_features, train_labels) # Evaluate the classifier eval_classifier_performance(dt_classifier, valid_features, valid_labels) # Repeat these steps to test how different parameters influence performance .. parsed-literal:: Accuracy: 0.9647887323943662 Precision: 0.9651567944250871 Recall: 0.9326599326599326 F1-score: 0.9486301369863013 Experiment 3 ------------ See performance of different classifiers and sampling strategies .. code-block:: console # Example with extra step for balanced labels balanced_features, balanced_labels, balanced_groups = downsample_balanced(all_features, all_labels, annotation_animal) train_features, train_labels, valid_features, valid_labels = sklearn_group_split(balanced_features, balanced_labels, balanced_groups) dt_classifier = train_dt_classifier(train_features, train_labels) eval_classifier_performance(dt_classifier, valid_features, valid_labels) .. parsed-literal:: Accuracy: 0.7521663778162911 Precision: 0.7605633802816901 Recall: 0.7422680412371134 F1-score: 0.7513043478260869 Experiment 4 ------------ Expand the search of parameters by also adjusting the parameters of the model, eg the number of trees in the random forest. We provide an example structure for looping through 10 training splits for the decision tree classifier. Suggested tests: 1. Change the classifier type 2. Change the train/valid split approach 3. Include balanced data 4. Test various parameters of the classifier .. code-block:: console all_results = [] # random splits for test_model in np.arange(10): train_features, train_labels, valid_features, valid_labels = random_split_data(all_features, all_labels) dt_classifier = train_dt_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(dt_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['decision tree'], 'split_method':['random'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) ada_classifier = train_ada_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(ada_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['ada boost'], 'split_method':['random'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) rf_classifier = train_rf_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(rf_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['random forest'], 'split_method':['random'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) # leave bouts out splits for test_model in np.arange(10): train_features, train_labels, valid_features, valid_labels = sklearn_group_split(all_features, all_labels, bout_data) dt_classifier = train_dt_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(dt_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['decision tree'], 'split_method':['leave bouts out'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) ada_classifier = train_ada_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(ada_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['ada boost'], 'split_method':['leave bouts out'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) rf_classifier = train_rf_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(rf_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['random forest'], 'split_method':['leave bouts out'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) # leave animals out splits for test_model in np.arange(10): train_features, train_labels, valid_features, valid_labels = sklearn_group_split(all_features, all_labels, annotation_animal) dt_classifier = train_dt_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(dt_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['decision tree'], 'split_method':['leave animal out'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) ada_classifier = train_ada_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(ada_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['ada boost'], 'split_method':['leave animal out'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) rf_classifier = train_rf_classifier(train_features, train_labels) acc, pr, re, f1 = eval_classifier_performance(rf_classifier, valid_features, valid_labels, print_results=False) all_results.append(pd.DataFrame({'classifier':['random forest'], 'split_method':['leave animal out'], 'accuracy':[acc], 'precision':[pr], 'recall':[re], 'f1':[f1]})) # Flatten the list of dataframes into one results_df = pd.concat(all_results) Inspect the results of your scan(s) .. code-block:: console accuracy_plot = p9.ggplot(data=results_df) + \ p9.geom_point(p9.aes(x='classifier', y='accuracy')) + \ p9.facet_wrap('split_method') + \ p9.labs(title='Classifier accuracy', x='Classifier type', y='Accuracy') + \ p9.theme_bw() accuracy_plot.draw().show() .. parsed-literal:: Example Output: .. raw:: html Step 8 ------ Evaluate your best classifier on the held-out test dataset You could also potentially just train a classifier on the entire training set (after you’ve tuned the numbers with the validation set) .. code-block:: console #best_classifier = dt_classifier best_classifier = train_rf_classifier(all_features, all_labels) eval_classifier_performance(best_classifier, test_features, test_labels) .. parsed-literal:: Accuracy: 0.9844444444444445 Precision: 0.8036332179930796 Recall: 0.6761280931586608 F1-score: 0.7343873517786562 You may observe that performance has dropped a bit, but if you selected the correct train/valid split approach and didn’t over-fit on your validation data, it should still perform well. Discussion 2 ------------ Up until now, we’ve been focussing on frame-level agreement, which is simple enough for training classifiers. Are there better ways to evaluate performance of a behavior classifier? One approach could be evaluating dense annotations for bout-level agreement Other notes on the importance of bout-level agreement – eg filter/stitching/other post-processing steps Step 9 ------ Bout-level agreement on dense annotation In order to measure bout-level agreement, we need to transform the data into starts and ends We can either write the detection of bouts within predictions manually or re-use a classical compression algorithm called run length encoding Note that typical RLE algorithms will encode all states (both “behavior” and “not-behavior”), while we primarily care about only 1 of the 2 states. .. code-block:: console # Run length encoding, implemented using numpy # Accepts a 1d vector # Returns a tuple containing (starts, durations, values) def rle(inarray): ia = np.asarray(inarray) n = len(ia) if n == 0: return (None, None, None) else: y = ia[1:] != ia[:-1] i = np.append(np.where(y), n - 1) z = np.diff(np.append(-1, i)) p = np.cumsum(np.append(0, z))[:-1] return(p, z, ia[i]) Use the RLE algorithm to encode predictions We also need to separate the data by animal, because bouts can only exist within animal .. code-block:: console test_predictions = best_classifier.predict(test_features) # Arrays to store the list of bouts by animal all_pr_bouts = [] all_gt_bouts = [] # Loop over each animal and add their bouts to the lists for animal_id in np.unique(test_animals): animal_indices = test_animals==animal_id test_bout_predictions = rle(test_predictions[animal_indices]) # Only store the bouts of behavior behavior_bouts = test_bout_predictions[2]==1 if np.any(behavior_bouts): test_bout_predictions = (test_bout_predictions[0][behavior_bouts], test_bout_predictions[1][behavior_bouts]) all_pr_bouts.append(test_bout_predictions) else: all_pr_bouts.append(([],[])) test_bout_gt = rle(test_labels[animal_indices]) # Only store the bouts of behavior behavior_bouts = test_bout_gt[2]==1 if np.any(behavior_bouts): test_bout_gt = (test_bout_gt[0][behavior_bouts], test_bout_gt[1][behavior_bouts]) all_gt_bouts.append(test_bout_gt) else: all_gt_bouts.append(([],[])) Display the predictions next to their GT .. code-block:: console bout_data = [] for animal_idx, animal_id in enumerate(np.unique(test_animals)): gt_bout_data = all_gt_bouts[animal_idx] if len(gt_bout_data[0])>0: gt_bout_df = pd.DataFrame({'state':'GT', 'animal':animal_id, 'animal_idx':animal_idx, 'start_time':gt_bout_data[0], 'end_time':gt_bout_data[0]+gt_bout_data[1]}) bout_data.append(gt_bout_df) pr_bout_data = all_pr_bouts[animal_idx] if len(pr_bout_data[0])>0: pr_bout_df = pd.DataFrame({'state':'PR', 'animal':animal_id, 'animal_idx':animal_idx, 'start_time':pr_bout_data[0], 'end_time':pr_bout_data[0]+pr_bout_data[1]}) bout_data.append(pr_bout_df) bout_data = pd.concat(bout_data) bout_plot = p9.ggplot(bout_data) + \ p9.geom_rect(p9.aes(xmin='start_time', xmax='end_time', ymin='animal_idx-0.25', ymax='animal_idx+0.25', fill='factor(state)'), alpha=0.5) + \ p9.labs(title='Bout annotations and predictions', x='Time, frame', y='Animal index', fill='State') + \ p9.theme_bw() bout_plot.draw().show() .. parsed-literal:: Example Output: .. raw:: html Step 10 ------- Strategies for evaluating performance of bout-based metrics Since predictions are almost never going to be identical to the ground truth, we need to adjust our strategy for calling correct and incorrect classifications of the confusion matrix. To do this, we should attempt to detect how much the predictions overlap. In classical image processing, this was done via calculating the intersection over union (IoU) between predictions and ground truth. We can adopt the same technique for our 1-D problem (time). .. code-block:: console # Calculates the intersection of 2 bouts # Each bout is a tuple of (start, duration) def calculate_intersection(gt_bout, pr_bout): # Detect the larger of the 2 start times max_start_time = np.max([gt_bout[0], pr_bout[0]]) # Detect the smaller of the 2 end times gt_bout_end = gt_bout[0]+gt_bout[1] pr_bout_end = pr_bout[0]+pr_bout[1] min_end_time = np.min([gt_bout_end, pr_bout_end]) # Detect if the 2 bouts intersected at all if max_start_time < min_end_time: return min_end_time-max_start_time else: return 0 # Calculates the union of 2 bouts # Each bout is a tuple of (start, duration) def calculate_union(gt_bout, pr_bout): # If the 2 don't intersect, we can just sum up the durations if calculate_intersection(gt_bout, pr_bout) == 0: return gt_bout[1] + pr_bout[1] # They do intersect else: min_start_time = np.min([gt_bout[0], pr_bout[0]]) gt_bout_end = gt_bout[0]+gt_bout[1] pr_bout_end = pr_bout[0]+pr_bout[1] max_end_time = np.max([gt_bout_end, pr_bout_end]) return max_end_time - min_start_time .. code-block:: console all_intersections = [] all_unions = [] all_ious = [] # Loop over the animals again num_test_animals = len(np.unique(test_animals)) for animal_idx in np.arange(num_test_animals): # For each animal, we want a matrix of intersections, unions, and ious num_gt_bouts = len(all_gt_bouts[animal_idx][0]) num_pr_bouts = len(all_pr_bouts[animal_idx][0]) animal_intersection_mat = np.zeros([num_gt_bouts, num_pr_bouts]) animal_union_mat = np.zeros([num_gt_bouts, num_pr_bouts]) # Do a second loop for each GT bout for gt_idx in np.arange(num_gt_bouts): gt_bout = [all_gt_bouts[animal_idx][0][gt_idx], all_gt_bouts[animal_idx][1][gt_idx]] # Final loop for each proposed bout for pr_idx in np.arange(num_pr_bouts): pr_bout = [all_pr_bouts[animal_idx][0][pr_idx], all_pr_bouts[animal_idx][1][pr_idx]] # Calculate the intersections, unions, and IoUs animal_intersection_mat[gt_idx, pr_idx] = calculate_intersection(gt_bout, pr_bout) animal_union_mat[gt_idx, pr_idx] = calculate_union(gt_bout, pr_bout) # The IoU matrix will just be i / u animal_iou_mat = animal_intersection_mat/animal_union_mat # Add the animal data to the resulting lists all_intersections.append(animal_intersection_mat) all_unions.append(animal_union_mat) all_ious.append(animal_iou_mat) Now that we have IoUs, we can try and apply thresholds for calculating things like precision/recall Since it’s hard to define “True Negatives” with the IoU technique, we can skip that for now .. code-block:: console # Define detection metrics, given a IoU threshold def calc_temporal_iou_metrics(iou_data, threshold): tp_counts = 0 fn_counts = 0 fp_counts = 0 for cur_iou_mat in iou_data: tp_counts += np.sum(np.any(cur_iou_mat>threshold, axis=1)) fn_counts += np.sum(np.all(cur_iou_mat Discussion 3 ------------ Pros and Cons to bout based analysis False positives are overestimated, because there may be multiple predicted bouts within 1 real bout. Are there potential methods to try and fix this issue? Possible techniques include: 1. Filtering out short/spurious predictions 2. Stitching together multiple predictions close in time 3. Adjusting what we’re calling TP/FN/FP to be more lenient on the exact number of bouts Step 11 ------- Post-processing techniques .. code-block:: console def merge_behavior_gaps(bout_starts, bout_durations, bout_states, max_gap_size, state_to_merge = False): gaps_to_remove = np.logical_and(bout_states==state_to_merge, bout_durations .. code-block:: console bout_data = [] for animal_idx, animal_id in enumerate(np.unique(test_animals)): gt_bout_data = all_gt_bouts[animal_idx] if len(gt_bout_data[0])>0: gt_bout_df = pd.DataFrame({'state':'GT', 'animal':animal_id, 'animal_idx':animal_idx, 'start_time':gt_bout_data[0], 'end_time':gt_bout_data[0]+gt_bout_data[1]}) bout_data.append(gt_bout_df) pr_bout_data = filtered_pr_bouts[animal_idx] if len(pr_bout_data[0])>0: pr_bout_df = pd.DataFrame({'state':'PR', 'animal':animal_id, 'animal_idx':animal_idx, 'start_time':pr_bout_data[0], 'end_time':pr_bout_data[0]+pr_bout_data[1]}) bout_data.append(pr_bout_df) bout_data = pd.concat(bout_data) bout_plot = p9.ggplot(bout_data) + \ p9.geom_rect(p9.aes(xmin='start_time', xmax='end_time', ymin='animal_idx-0.25', ymax='animal_idx+0.25', fill='factor(state)'), alpha=0.5) + \ p9.labs(title='Bout annotations and predictions', x='Time, frame', y='Animal index', fill='State') + \ p9.theme_bw() bout_plot.draw().show() .. parsed-literal:: Example Output: .. raw:: html