BRSET: A Brazilian Multilabel Ophthalmological Dataset of Retina Fundus Photos

Introduction: The Brazilian Multilabel Ophthalmological Dataset (BRSET) addresses the scarcity of publicly available ophthalmological datasets in Latin America. BRSET comprises 16,266 color fundus retinal photos from 8,524 Brazilian patients, aiming to enhance data representativeness, serving as a research and teaching tool. It contains sociodemographic information, enabling investigations into differential model performance across demographic groups. Methods: Data from three São Paulo outpatient centers yielded demographic and medical information from electronic records, including nationality, age, sex, clinical history, insulin use, and duration of diabetes diagnosis. A retinal specialist labeled images for anatomical features (optic disc, blood vessels, macula), quality control (focus, illumination, image field, artifacts), and pathologies (e.g., diabetic retinopathy). Diabetic retinopathy was graded using International Clinic Diabetic Retinopathy and Scottish Diabetic Retinopathy Grading. Validation used Dino V2 Base for feature extraction, with 70% training and 30% testing subsets. Support Vector Machines (SVM) and Logistic Regression (LR) were employed with weighted training. Performance metrics included area under the receiver operating curve (AUC) and Macro F1-score. Results: BRSET comprises 65.1% Canon CR2 and 34.9% Nikon NF5050 images. 61.8% of the patients are female, and the average age is 57.6 years. Diabetic retinopathy affected 15.8% of patients, across a spectrum of disease severity. Anatomically, 20.2% showed abnormal optic discs, 4.9% abnormal blood vessels, and 28.8% abnormal macula. Models were trained on BRSET in three prediction tasks: “diabetes diagnosis”; “sex classification”; and “diabetic retinopathy diagnosis”. Discussion: BRSET is the first multilabel ophthalmological dataset in Brazil and Latin America. It provides an opportunity for investigating model biases by evaluating performance across demographic groups. The model performance of three prediction tasks demonstrates the value of the dataset for external validation and for teaching medical computer vision to learners in Latin America using locally relevant data sources.


Introduction
In ophthalmological practice, imaging assists in the diagnosis and follow-up of ocular conditions, including retinal fundus photos, ocular anterior segment photos, corneal topography, visual field tests, and optical coherence tomography [1,2].Artificial intelligence (AI) algorithms can potentially improve medical care by facilitating access to screening, diagnosis, and monitoring in resource-limited settings and assist with the decision-making process [1][2][3].In ophthalmology, AI holds promise for ocular diseases such as diabetic retinopathy, age-related macular degeneration, glaucoma, and retinopathy of prematurity [1,2,[4][5][6][7][8].While AI represents a breakthrough technology, concerns for unfair algorithms resulting from non-representative data and biased models cannot be ignored [9][10][11].
The open science movement in healthcare has not gained traction in Latin America [14].In ophthalmology, the majority of the available datasets come from high-income countries, as can be seen in Table 1.In addition, datasets lack demographic and crucial clinical information such as comorbidities [12].In low and middle-income countries (LMIC), the number of ophthalmologists relative to the population is not adequate [15].Autonomous systems may increase ophthalmological coverage and reduce preventable blindness; however, datasets that do not adequately represent those who are disproportionately impacted by the disease lead to biased and harmful algorithms [12].
. BRSET is the first publicly available Latin American ophthalmological dataset across a range of disease severity and sociodemographic categories.

Materials and methods
This study was approved by the Sao Paulo Federal University (UNIFESP) IRB (CAAE 33842220.7.0000.5505)and included retinal fundus photos and clinical data.In this dataset, identifiable patient information was removed from all images.

Data sources
We included data from three outpatient Brazilian ophthalmological centers in São Paulo evaluated from 2010 to 2020 and from the Sao Paulo Federal University ophthalmology sector.

Data Collection
Images present in this dataset were collected using different retinal fundus cameras, including Nikon NF505 (Nikon, Tokyo, Japan), Canon CR-2 (Canon Inc, Melville, NY, USA) retinal camera, and Phelcom Eyer (Phelcom Technologies, MA, USA).Retinal photos were taken by previously trained non-medical professionals in pharmacological mydriasis.

Dataset preparation
The file identification was removed from all fundus photos, as well as sensitive data (e.g., patient name, exam date).Every image was reviewed to ensure the absence of protected health information in images.The images were exported directly from retinal cameras in JPEG format, and no preprocessing techniques were performed.The image viewpoint can be macula-centered or optic disc-centered.The dataset does not include fluorescein angiogram photos, non-retinal images, or duplicated images.

Metadata
Each retinal image is labeled with the retinal camera device, image center position, patient nationality, age in years, sex, comorbidities, insulin use, and duration of diabetes diagnosis.
The demographics and medical features were collected from the electronic medical records.
A retinal specialist ophthalmologist labeled all the images based on criteria that were established by the research group [16].The following characteristics were labeled: • Anatomic classification: The retinal optic disc (vertical cup-disc ratio of ≥0.65 [17]), retinal vessels (tortuosity and width), and macula (abnormal findings) were categorized as either normal or abnormal.
• Quality control parameters: Parameters including image focus, illumination, image field, and artifacts, were assessed and classified as satisfactory or unsatisfactory.
The criteria are defined in Table 2.
• Diabetic retinopathy classification: Diabetic retinopathy was classified using the International Clinic Diabetic Retinopathy (ICDR) grading and Scottish Diabetic Retinopathy Grading (SDRG), as can be seen in Table 3 [18,19].

Illumination
This parameter is graded as adequate when both of the following requirements are met: 1) Absence of dark, bright, or washed-out areas that interfere with detailed grading; 2) In the case of peripheral shadows (e.g., due to pupillary constriction) the readable part should reach more than 80% of the whole image.

Image Field
This parameter is graded as adequate when all the following requirements are met: 1) The optic disc is at least 1 disc diameter (DD) from the nasal edge; 2) The macular center is at least 2 DD from the temporal edge; 3) The superior and inferior temporal arcades are visible in a length of at least 2 DD

Artifacts
The following artifacts are considered: haze, dust, and dirt.This parameter is graded as adequate when the image is sufficiently artifact-free to allow adequate grading.

Focus
This parameter is graded as adequate when the focus is sufficient to identify third-generation branches within one optic disc diameter around the macula.Among the anatomical criteria, 3,281 (20.2%) of the images have an abnormal optic disc, 807 (4.9%) have abnormal blood vessels, and 4,685 (28.8%) have an abnormal macula.The distribution of normal exams and pathological findings is described in Table 5.

Quality Metadata:
On the other hand, regarding the quality characteristics of the metadata duplicate data and missing data were assessed.The dataset does not have duplicate data, and only 3 columns contain missing data.As can be seen in Figure 3, the columns patient_age, diabetes_time_y, and insulin contain missing values.Dino V2 Base [18], a foundational computer vision model, was used in all the experiments to extract embeddings from the BRSET images.Embeddings are representations of the original data in a lower dimensionality space, which allows us to develop a parsimonious solution, without the need for large computational resources.Although the use of these embeddings provides an effective and computationally efficient strategy, these results must be interpreted as a baseline, and other models or more complex techniques are likely to improve the tasks' performance.
After being extracted, the image embeddings were saved in a CSV file with 769 columns, where the first column represented the identifier of the image, and the remaining 768 columns represented the embeddings of each image.These resulting embeddings were given the imbalance, were measured as performance metrics (Table 6).
As depicted in Table 6, the dataset can be used, not only for clinical classification tasks but also to identify and quantify differential model performance across sex.Table 6: Performance metrics of the BRSET for the three selected binary classification tasks.
The tasks are "diabetes diagnosis", "sex classification", and "diabetic retinopathy diagnosis".training, and sign a data use agreement that forbids re-identification of patients and sharing it those who are not credentialed.

Limitations
Our dataset only includes one nationality and represents general ophthalmological outpatients.As a result, the disease distribution is imbalanced, with high percentage of normal and mild cases.

Strengths
The BRSET is the first multilabel ophthalmological dataset from Brazil and Latin America.
BRSET aims to improve data representativeness and create the framework for ophthalmological dataset development.To the best of our knowledge, BRSET is the only publicly available retinal image dataset that also contains sociodemographic data such as sex and age in Latin America, which allows us to investigate algorithmic bias across different demographic groups.

Figure 1 .
Figure 1.Histogram of age distribution in BRSET.The x-axis represents age groups, while the y-axis indicates the frequency of individuals in each age category.

Figure 2 .
Figure 2. Sample Retina Images with and without Diabetic Retinopathy (DR) from the BRSET Dataset.This figure presents a visual sample of three six images selected from the dataset, with labels indicating the presence (1) or absence (0) of diabetic retinopathy (DR).

Figure 3 .
Figure 3. Missing Values Percentage per Column in the BRSET Dataset

.
used as features to train classification Machine Learning (ML) models.The three tasks were binary classification tasks: "diabetes diagnosis", "sex classification", and "diabetic retinopathy diagnosis".The dataset was divided into training and testing using 70%(11,386 embeddings)  for training and 30% (4,880 embeddings) for evaluation.The chosen models for the classification tasks were Support Vector Machines (SVM) with linear kernel and Logistic Regression (LR).The models were trained using class weights to tackle class imbalance during training.The weights were calculated using the equation (1), where w are the weights of class C, N is the total samples in the training dataset, K is the number of classes, and Nc is the samples of class in the training dataset.Finally, the AUC, and the Macro F1-score,

Table 1 :
Comparative table with open-access ophthalmological datasets.
This dataset may be used to build computer vision models that predict demographic characteristics and multi-label disease classification.BRSET consists of 16,266 images from 8,524 Brazilian patients, and a metadata file called labels.csv.Columns are detailed in

Table 4 .Table 4 .
Data Dictionary -description of present columns

Table 5 :
Distribution of retina images according to diagnoses and anatomical criteria