The algoritem selects the most difficult sequences from each cluster.
Prerequisites:
Apply sequence clustering from automatically calculated attributes (without occlusion attribute) using the AP and Hamming distance.
We get K clusters and each cluster contains N_k sequences. We need to get a dataset of M sequences that are proportionally sampled from the clusters as follows.
Compute the final maximal number of sequences that are allowed to be sampled from each cluster as floor(N_k*M/N_all).
Initialization:
Main loop:
Compute the average attribute vector of the selected sequences. Normalize this vector so that it sums to 1 (denote this as a balance vector). This gives for each attribute a value between 0 and 1 indicating how much of the particular attribute is present in the selected dataset.
Identify the attribute that is least represented in the selected set of sequences (find minimum on the ballance vector). There may be several equally poorly presented attributes. Use the following equation, let h = [h1 h2 h3] be the ballence vector. Modify the vector by h/h_max, where h_max is the most presented attribute and then select all attributes that give 1 from the following relation h<(h_min+0.1/nc) -- the hysteresis of +0.1/nc gives 10% of uniform distribution for nc classes (nc = number of attributes used for sampling -- config.sequence_selection_attr)
Among all remaining sequences in all clusters identify the sequences that contain the missing identified attributes (ignore the clusters whose final number of sampled sequences has been reached).
From the set of sequences from step (7) select the sequence with the highest level of difficulty and add it to the new dataset.
Go back to step 5 and continue until you reach a dataset with M sequences.
Input:
config (structure): config structure
sequences (cell): an array of sequence structures.
feature_vectors_scaled (matrix #sequences x #num_attributes): feature vector for each sequence (row-wise)
clusters_struct (structure): clustering structure from compute_clusters function (clusters_ap was used for VOT)
performance_accumulated (matrix #sequences x 2): average quantized robustness and accuracy for each sequence (row-wise)
Output:
'final_selection.txt' file with the selected sequences in the directory specified in the config.result_base_dir variable
(optional if config.show_visualization == 1) 'cluster-selection_*.png' visualization of the selected sequences in the config.result_directory_clusters_img directory