Evaluation ðŸ“Š


Score Metrics

We used the mean F1 (mF1) score as a primary metric which is the average of the F1 score of all cell classes. The F1 score is a commonly used metric for cell detection that considers precision and sensitivity simultaneously. For each cell class, the F1 score is computed by the following equation,

  • F1 = 2 * Precision * Recall / (Precision + Recall)
  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)

where TP, FP, and FN denote True Positive, False Positive, and False Negative detections, respectively. Note that, only mF1 is used for ranking the algorithms. Other metrics (per-class precision, recall, F1) are visualized in the leaderboard but used just as a reference.


Hit Criterion for Cell Detection 

To determine the TP, FP, and FN, we followed the below process per cell class.

Step 1. Retrieve cell predictions and ground-truth cells from a certain class.
Step 2.
Sort cell predictions by their confidence score.
Step 3. Starting from a cell prediction with the highest confidence score, check whether any ground-truth cell is within a valid distance (~15 pixels, ~3um) from the cell prediction.
      Step 3.1. If there is no ground-truth cell within a valid distance, the cell prediction is counted as an FP.  
      Step 3.2. If there are one or more ground-truth cells within a valid distance, the cell prediction is counted as a TP. The nearest ground-truth cell is matched with the cell prediction and not considered for further matching.
Step 4. Go back to Step 3 until the cell prediction with the lowest confidence score is reached.
Step 5. The remaining ground-truth cells that are not matched with any cell prediction are counted as FN.

Please refer to the below diagram to better understand how the process works.




Evaluation code in GitHub

Please refer to https://github.com/lunit-io/ocelot23algo/tree/main/evaluation to get more details and the actual code used for the evaluation.