Evaluation 📊¶
Score Metrics¶
We used the mean F1 (mF1) score as a primary metric which is the average of the F1 score of all cell classes. The F1 score is a commonly used metric for cell detection that considers precision and sensitivity simultaneously. For each cell class, the F1 score is computed by the following equation,
- F1 = 2 * Precision * Recall / (Precision + Recall)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
where TP, FP, and FN denote True Positive, False Positive, and False Negative detections, respectively. Note that, only mF1 is used for ranking the algorithms. Other metrics (per-class precision, recall, F1) are visualized in the leaderboard but used just as a reference.
Hit Criterion for Cell Detection¶
To determine the TP, FP, and FN, we followed the below
process per cell class.
Step 1. Retrieve cell predictions and ground-truth cells from a
certain class.
Step 2. Sort cell predictions by their confidence score.
Step 3. Starting from a cell prediction with the highest confidence
score, check whether any ground-truth cell is within a valid distance
(~15 pixels, ~3um) from the cell prediction.
Step 3.1. If there is no ground-truth cell within a valid
distance, the cell prediction is counted as an FP.
Step 3.2. If there are one or more ground-truth cells within a
valid distance, the cell prediction is counted as a TP. The
nearest ground-truth cell is matched with the cell prediction and not
considered for further matching.
Step 4. Go back to Step 3 until the cell prediction with the lowest
confidence score is reached.
Step 5. The remaining ground-truth cells that are not matched with
any cell prediction are counted as FN.
Please refer to the below diagram to better understand how the process works.
Evaluation code in GitHub¶
Please refer to https://github.com/lunit-io/ocelot23algo/tree/main/evaluation to get more details and the actual code used for the evaluation.