Dataset📂


Download the OCELOT Dataset

OCELOT dataset can be downloaded in Zenodo. We are collecting basic information (name, email, institution) with a short justification for requiring the dataset.


Introduction

The OCELOT dataset is a histopathology dataset designed to facilitate the development of methods that utilize cell and tissue relationships. The dataset is comprised of both small and large field-of-view (FoV) patches extracted from digitally scanned whole slide images (WSIs), with overlapping regions. The small and large FoV patches are accompanied by annotations of cells and tissues, respectively. The WSIs are sourced from the publicly available TCGA database and were stained using the H&E method before being scanned with an Aperio scanner. Each sample of the OCELOT dataset is composed of six components,

where x_s, x_l are the small and large FoV patches extracted from the WSI, y_s^c, y_l^t refer to the corresponding cell and tissue annotations, respectively, and c_x, c_y are the relative coordinates of the center of x_s within x_l. The figure below shows the visualization of a sample.

Each sample of the dataset consists of two input patches and the corresponding annotations. The left shows the large FoV patch x_l with the tissue segmentation annotation y_l^t, where green denotes the cancer area. The right shows the small FoV patch x_s with cell point annotation y_s^c, where blue and yellow dots denote tumor and background cells, respectively. The red box indicates the size and location of the x_s, with respect to the x_l. Note that for every sample, x_s and x_l are overlapping, i.e. x_s exists inside x_l. However, a relative location of x_s over x_l varies per sample.


Patch Configurations

Cell detection tasks benefit from fine-grained spatial information to better capture detailed cell properties (e.g. border, shape, color, and opacity). In contrast, tissue segmentation requires a larger context to enable a better understanding of the overall structural information. Therefore, we define the FoV sizes of x_s (cell detection) and x_l (tissue segmentation) as 1024×1024 and 4096×4096 pixels, respectively, at a resolution of 0.2 Microns-per-Pixel (MPP). Finally, the large FoV patches and tissue annotations (x_l, y_l^t) are down-sampled by a factor of 4, resulting in a size of 1024x1024 pixels.


Subsets

The dataset is divided into three subsets: training, validation, and test, following a 6:2:2 ratio. Precisely, training, validation, and test splits consist of 400, 137, and 130 patch pairs, respectively. To prevent information leaking among the data subsets, we randomly split the dataset per WSI, so that different patches from the same WSI are not included in different splits. We maintain consistent cancer-type ratios in each subset.


Label Information

We have the following class schemes,

  • Cell: Background Cell (BC, index 1) and Tumor Cell (TC, index 2)
  • Tissue: Background (BG, index 1), Cancer Area (CA, index 2), and Unknown (not labeled, index 255)

For cell point annotation, we followed an x-y coordinate system starting from Top-Left (0,0) and ending with Bottom-Right (1023,1023).


For more details about the dataset, please refer to https://lunit-io.github.io/research/ocelot_dataset/.