ABSTRACT
Purpose We present a first epidemiological survey from North Cyprus to determine predictive factors for breast cancer subtypes.
Methods More than 300 breast cancer patients, 90% of them having subtype information, are surveyed from the State Hospital in Nicosia between 2006 – 2015 for their demographic, reproductive, genetic, epidemiological factors. The breast cancer subtypes, Estrogen receptor (ER) +/-, Progesterone receptor (PR) +/-, and human epidermal growth factor 2 (HER2) +/- status, are determined. Single and multiple variable, regularized regressions, with predictive factors as independent variables, breast cancer subtypes as dependent variables are conducted.
Results Our cohort differs significantly from larger cohorts (e.g., Breast Cancer Family Registry), in age, menopause status, age of menarche, parity, education, oral contraceptive use, breastfeeding, but the distribution of breast subtypes is not significantly different. Subtype distribution in our cohort is also not different from another Turkish cohort. We show that the ER+ subtype is positively related to age/post-menopause; ER+/PR+ is positively associated with age, but negatively associated with cancer stage; HER2+, which is negatively correlated with ER+ and ER+/PR+, is positively related to cancer stage but negatively associated with age/post-menopause.
Conclusion Assuming ER+ and ER+/PR+ to have better prognostic, HER+ to have worse prognostic, then older age and postmenopause seem to be beneficial, smoking and family history of cancer seem to be detrimental. Next steps include looking at potential biomarkers and using cure models to determine long-term survivors.
- breast cancer subtypes
- predictive factors
- estrogen
- progesterone
- human epidermal receptors
- regularized regression
- LASSO
- ridge
- elastic nets
Introduction
Breast cancer is the most common type of cancer diagnosed in the Western part of the world. In Europe, more than 523,000 women were diagnosed with breast cancer in 2018 and more than 138,000 women died from it [1]. World-wide, close to 2 million women are diagnosed with breast cancer each year and approximately 30% die from this disease. Breast cancer is largely viewed as a disease predominantly influenced by risk factors related to lifestyle [2] though through the twin studies of heritability of breast cancer, genetic contribution can still be significant [3]. Recent work to combine contribution from many genetic variants to breast cancer achieves an above 60% area-under-receive-operator-curve prediction rate [4, 5], and 20% variance explained [6].
Female hormones may affect breast cancer, and their status have been used to classify breast cancers. In particular, estrogen receptor positive (ER+) or negative (ER-), progesterone receptor positive (PR+) or negative (PR-), human epidermal growth factor 2 positive (HER2+) or negative (HER-) are the major classification schemes of breast cancer subtypes. It has been shown that ER+/-, PR+/-, HER+/- breast cancer subtypes have different clinical features [7], the cancer etiology of these subtypes can be heterogeneous, and treatment strategies also diverge. In particular, hormone receptor-positive (ER+ or PR+) subtype should expect good prognosis, using drugs like Tamoxifen/Nolvadex Similarly, the more aggressive HER2+ subtype can be treated successfully with drugs like Trastuzumab/Herceptin. In contrast, the triple-negative subtype (ER-PR-HER2-) poses challenges in treatment strategies [8].
Although international and national studies of breast cancer have been conducted with large sample sizes, such as BCFR (www.bcfamilyregistry.org), GICR (gicr.iarc.fr), BCSC (www.bcsc-research.org), there has never been a breast cancer survey on the subtype distributions, potentially explanatory variables, and correlation between these variables and breast cancer subtypes, in North Cyprus (though there are some studies in Turkey [9]). To fill this gap, we present a first epidemiological survey of around 300 breast cancer patients from North Cyprus – among them around 230 Turkish Cypriots.
We collected and analyzed reproductive (age of menarche, number of children (zero for nulliparity), menopause status, hormone therapy, oral contraceptive use, breastfeeding, left/right breast with cancer), demographic (age at diagnosis, education level, housewife/employed), genetic (first relative having cancer), and epidemiological (smoking, other cancers) information. Most of these factors are known to be risk factors for breast cancer, e.g., early menarche, late menopause, nulliparity, long hormone replacement therapy, older age, family history of breast cancer, but it is unclear which factor is predictive for breast cancer subtypes.
Our analysis strategy is as follows: We treat ER, PR, HER2 as dependent variables, and other as independent variables. As we do not have control (noncancer) samples, it is a case-only analysis or subtypes-with-case analysis [10, 11]. The first analysis is to compare our independent and dependent variables distribution with a major public breast cancer databases. Second, correlation between the cancer subtypes are determined. Third, uni-variate, multiple, and regularization logistic regression are performed to detect any factor-subtype association, i.e., to identify potential predictive factors for breast cancer subtypes. We will show that though there are some minor surprises, our cohort conforms with some other studies concerning predictive factors of breast cancer subtypes.
Methods
Sample collection
We included 324 samples (321 female, 3 male) collected retrospectively from the Dr. Burhan Nalbantoğlu State Hospital (BNSH) in Nicosia, North Cyprus between 2006–2015, mostly from years 2011–2015 (93%). This represented around 40% of total beast cancer cases that exist in the archives during this period. The data consists of reproductive factors, histology and biomarker information such as the ER, PR, and HER2 status. Permission was obtained from the Ministry of Health from the Turkish Republic of Cyprus for scientific use of the data. Additionally, ethical approval from the Eastern Mediterranean University Ethics Committee in Famagusta was granted to conduct the study. IRB approval number was AAAP8950. Patient consent forms were not required. Telephone interviews were made when necessary to collect information from patients to fill in the missing factor values.
For these cases, pathologists from the BNSH ascertained ER and PR status from patient tumor tissue using immunohistochemistry (IHC) and/or pathology reports using a standardized protocol and pathology reporting forms. For all cases, HER2 status available (around 290) was provided from patient medical reports. Where tumor tissue was available, pathologists used IHC testing for ER and PR, and categorized tumors as ER and PR positive if ≥ 10% of tumor cells stained positive. When the ER or PR +/- status is not labeled, but with a specific percentage, we treat it as unknown. Menopausal and other information were extracted either from the medical records (with guidance/approval from an oncologist) or by telephone interviews.
Pre-processing of data
We remove the three male samples, reducing the sample size from 324 to 321. For hormone receptor status, if the left/right breast has different value, it is labeled as NA (unknown). Also, if the hormone receptor status is not binarized but represented by a percentage, it is labeled as NA.
Other re-coding of the data include: (smoking) seldom=0, quit=1, x-number-pocket=1; (family history) first degree relatives are parents, children, and siblings; (other cancer) anything not “no” is considered as yes (including metastasis); (education) 0,1,2,3 are for no school, primary/middle school, high school, college or more; (housewife/employment) retied is considered as the same as employed; (tumor stage) “high stage” is considered as 3, inoperable is considered as 4, A/B/C are ignored; (invasive cancer) IDC(invasive ductal carcinoma)/ICC(invasive cribriform cancer)/ISC(invasive secretory cancer) are invasive, everything else is not invasive. Of the 321 samples, 300 have ER or PR information, 291 have HER2 information; all have age information, but only 214, 215, 217, 222 patients have breastfeeding, age of menarche, use of oral contraceptive, use of hormone therapy information. The amount of missing data for other factors is given in Tables 1,2. An independent variable is removed if the missing rate is too high (e.g., > 0.2). The rest of missing value is imputed from the known variable value (e.g., if x is the independent variable, two values are missing, they are replaced by (R code): sample(x[!is.na(x)][1:2]).
Turkish Cypriots and non-Turkish-Cypriots patients
Of the 321 samples, 314 of them report the birth country, where the majority are born in Cyprus (n = 233), 53 born in Turkey, and remaining 28 born in other countries, including UK, Turkmenistan, and Bulgaria. Although our analysis is not on genetic or ethnicity’s contribution to the breast cancer subtypes, and foreign-born does not automatically imply non-Turkish-Cypriots, we run all analyses twice: one on n=321 samples and another on n=233 Turkish-Cypriots only samples.
Statistical programs used
All statistical analysis either used R 3.5.1 (www.r-project.org, released July 2018) or SPSS 17.0 (released 2008, Chicago: SPSS Inv.). The Rtsne R package is used for the t-SNE analysis (github.com/jkrijthe/Rtsne), with the default parameter setting (e.g., perplexity=30, dims=2). The glmnet R package (web.stanford.edu/ hastie/glmnet) [12] is used for the following regularized regression: least absolute shrinkage and selection operator, or LASSO (alpha=1, family=”binomial”), elastic net (alpha=0.5), ridge (alpha=1). The logistic regression is carried out by the standard R function: glm(… family=binomial(link=”logit”)), and the Fisher’s test was conducted using the standard R function fisher.test. The independent two-sample t-test between age distribution, one from raw data and another from summary statistics, was conducted using our customized R script.
Results
Visual inspection of the data by t-SNE
The t-distributed stochastic neighbor embedding (t-SNE) [13] is a popular method to represent high-dimensional data in 2 or 3 dimensions. We previously use the technique in other applications in biology/genomics [14, 15].
We use 3 dependent variables (ER, PR, HER2), 5 quantitative independent variables (age of diagnosis, age of menarche, number of children, education level, cancer stage), and 10 binary independent variables (left/right breast, menopause, first relative with cancer, other cancer, smoker, hormone therapy, oral contraceptive use, breastfeeding, housewife/employment, cancer invasiveness). The quantitative variables are standardized to have zero-mean and unit-variance (z-transformation).
Due to high missing rates for age of menarche (missing rate = 33%), hormone therapy (31%), oral contraceptive use (32%), breastfeeding (33%), we only keep samples who have information on these factors. This reduced the sample size from 321 to 211 for t-SNE plot. For these 211 patients, other missing data (of much lower missing rate) are imputed.
Fig.1 shows one run of t-SNE. Because ER, PR, HER2 are part of the input, it is not surprising that their values are well partitioned in the plot (e.g., ER+ and ER-samples). It can be seen that ER+ samples tend to PR+, and HER2-, ER-samples tend to be PR- and HER2+. The 7 samples with other cancers (including metastasis) form a distinct cluster from the rest of the samples. While ER, PR, HER2 values separate in up-down direction in Fig.1, other factors, such as menopause status, breastfeeding, age, etc. seem to be separated in (not completely) orthogonal direction.
The non-Turkish-Cypriots are marked with different symbols (Turkey-born patients in circles, other foreign-country-born patients in crosses) in the top two rows of Fig.1. There is no evidence that the location of these points in the plot, or their collective features, are very different from the rest Turkish-Cypriots samples.
Distribution of patient’s factors
Table 1 shows that our cohort (n=321) is distinct from the Breast Cancer Family Registry (BCFR) samples, which are mostly of USA/Canada/Australia origin, in several demographic or reproductive factors.
The North Cyprus cohort was older in age, had lower education level and less use of oral contraceptives, and had greater number of postmenopause subjects, lesser number of subjects with young (≤ 11) age at menarche, fewer nulliparous subjects, and greater number of breast-feeding subjects. The same analysis is carried out for n=233 Turkish-Cypriots patients only, and the conclusions remain the same (data not shown).
There are two explanations for these significant differences. The first is due to cultural and customary differences between countries (e.g., use of oral contraceptives). The second explanation is that our samples are collected from the state hospital, and a higher percentage of affluent patients may select to be treated at private hospitals, or hospitals overseas. The differences remain even for ER+/PR+ subgroup, and for ER-/PR-subgroup (though less significant due to smaller sample sizes).
Within our North Cyprus cohort, when the ER+/PR+ and ER-/PR-groups are compared on these factors, only ER-/PR- is slightly younger (t-test p-value = 0.06) (Table 1). Differences in other factors are not yet significant, either truly not different, or not enough sample size to prove the difference.
There are also some other factors distributed not very differently between our cohort and BCFR, as summarized in Table 2. These include hormone therapy usage and having a first-degree relative with cancer. The rest of factors in Table 2 do not have the corresponding information in BCFR, including other cancers, smoking status, left/right breast with cancer, housewife/employed, and cancer invasiveness. Only for ER+/PR+ subtype, the North Cyprus cohort is significantly less likely to have hormone therapy than the BCFR samples.
We also examine correlation between factors. Using all breast cancer patients without considering the subtypes, these correlations are observed: (1) patients who breast feed are less likely to go through hormone therapy (OR=7.1, Fisher p-value 9 ×10−5); (2) patients who work are more likely to smoke than housewives (OR=3.1, p-value= 1.3 ×10−4); (3) patients who work are more likely to be pre-menopause than housewives (OR=2.8, p-value= 1.3 ×10−4).
Conversion of age to age-group
Age is a special factor different from others because it is a continuous variable spanning a wide range of values. Discretizing or categorizing a continuous variable is an involved topic by itself. Age is a well known target for such a categorization [16]. We partition samples into younger and older age-group by an age-threshold. When an age threshold is chosen, 2-by-2 count table can be constructed by the binary age group and binary breast cancer subtype. Age group versus breast cancer subtype association can be judged by Fisher’s test.
Fig.2 shows (top) (− log) Fisher’s p-value and (bottom) adds-ratio as a function of the age threshold for converting age to age group. For HER2, the best p-value is 0.0002 when age threshold is 42, and there is a broad range of age threshold (41-47) where the Fisher’s test is significant at 0.01 level. For ER, PR, ER/PR, this age threshold range also lead to some significant test results, indicating that patients younger than mid-40s may form a distinct group, that tend to ER, PR, ER/PR negative and HER2 positive. At age threshold 67-68, there is a second peak, indicating that patients older than that age tend to be ER, PR, ER/PR positive.
Distribution of breast cancer subtypes
We compare the hormone receptor defined subtype distribution between the North Cyprus cohort and BCFR in Table 3. The count in each of the 8 ER/PR/HER2 subtypes in n=321 set and (Cyprus-born only) n=233 set is listed in Table 3(A)). These counts are not available for BCFR cohort [17], but the distribution by ER/PR subtypes, and by HER2/(ER-PR- and not) subtypes are available which is reproduced in Table 3(B) and (C).
The Fisher’s test for these subtype groupings have been carried out: ER, PR, ER+/PR+ vs ER-/PR-, four ER/PR groups, HER2, triple-negative vs rest, and the p-values are listed in Table 3(D). The lack of significant difference in subtype distribution between North Cyprus cohort and BCFR, indicating certain similarity, is in strong contrast with the dissimilarity of many demographic and reproductive factors as shown in Table 1.
The largest difference is probably in proportion of PR subtypes (higher PR+ proportion in North Cyprus cohort than in BCFR), which may also cause a relatively larger difference for the four ER/PR groups. The highly significant correlation between ER and PR may make PR measurement redundant. In fact, it is argued that added value of PR is questionable [18]. More specifically, ER-/PR+ subtype is rare and may not be reproducible (i.e., can be reclassified to another subtype by another method) [18].
The distribution of breast cancer subtype, defined as the combination of HER2 and ER-PR- (see, e.g. Table 3(C)), is also striking similar between our cohort, the Breast Cancer Association Consortium (BCAC) in UK [19], which consists mostly Europeans/Caucasians. The subtype information is obtained from [20], and Fisher’s p-value for testing triple-negative-only proportion in the two cohorts is not significant (Table 3(D)). Testing four subtypes’ proportions (triple-negative, HER2+only, ER+ and/or PR+ and HER2+, ER+ and/or PR+ and HER2-) is also not significant.
Predictive factors for breast cancer subtypes
The comparison of factor values between ER+ or PR+ and ER- or PR-samples can also be cast into a regression of ER/PR (dependent variable) over individual factors (independent variables). Table 4 shows all results which are significant at 0.1 level from regressing ER, or PR, or HER2, over either single factor by univariate logistic regression, or all factors by a multiple logistic regression; and for either n=321 samples and for Cyprus-only n=233 samples.
Table 4 shows age being positively correlated with ER+ and PR+, but negatively correlated with HER2+. These results are similar to what is shown in Table 1 as well as Fig.2 that ER+PR+ patients are older. Due to the positive correlation between ER and PR, ER+ patients and HER2-patients are older.
Table 4 also shows that post-menopause is positively correlated with ER+ but negatively correlated with HER2+. It can be stated that menopause plays a protective role as post-menopause patients are more likely to be in the more curable ER+ subtype and less likely to have the worse prognosis ER-type [21]. The positive correlation between menopause status and age is self-explanatory, and the association between menopause status and ER or HER2 is also easily explained by the age. The last three minor conclusions are: HER2+ patients are more likely to have a first relative with cancer, PR+ patients tend to have more children, and HER2+ patients tend to be smokers.
Between univariate and multiple regression, we also applied three closely related regularized regressions, LASSO (least absolute shrinkage and selection operator), elastic net, and ridge [22], to study the situation with a few explanatory variables. Regularized regression accomplished the task of variable selection (e.g., [23, 24]) by imposing constraint on the sum of absolute value of all fitting coefficients, effectively setting many coefficients to be zero, thus removing the contribution from these variables. Fig.3 shows how the coefficient of each explanatory variable increases, from left to right, when the number of non-zero-coefficient variables increases, for (rows) the dependent variables of ER, ER/PR, HER2, and for (columns) LASSO, elastic net, ridge.
First of all, we see that LASSO, elastic net, and ridge regression show similar trend with the same breast cancer subtype. Therefore, we may only focus on LASSO plot, which are the first column in Fig.3. Secondly, in order to see how n=321 dataset may differ from the n=233 Turkish-Cypriots only dataset, we draw the LASSO results side by side in Fig.4.
For ER subtype, the dominant contribution from menopause status is not only consistently shown for both n=321 and n=233 datasets, but also consistent with Table 4. The hormone therapy is a promising signal for n=321 dataset but not so for n=233 one. For ER/PR, the cancer stage is a consistent signal which is negatively correlated with the ER+/PR+ status. This signal is also highlighted in BCFR paper [17]. Breast-feeding is a promising signal for n=321 dataset, but seem to be less important for n=233 dataset. More discussion on the benefit of breast-feeding in reducing the probability in acquiring poor prognostic breast cancers, such as triple negative subtypes, can be found in [25, 26]. For HER2 subtype, similar to univariate and multiple regression (Table 4), menopause is a dominant factor with negative correlation. Interestingly, there is a signal for n=321 dataset from first-degree-relative cancer history, but that signal becomes weaker for the n=233 dataset.
Discussion
Without control samples, we carried out a case-only analysis of potential predictive factors of different subtypes of breast cancer. Case-only design has been implemented in breast cancer studies before, and it is “an important initial step in understanding the extent of etiologic heterogeneity between tumor subtypes” [10, 11]. Since different subtypes of breast cancer have different prognostics, it is important to assess their distribution.
One of the striking results we obtained is that our North Cyprus has very similar ER+, ER+/PR+, and HER2+ as BCFR, even though our cohort is much older, more post-menopause, less educated, less hormone therapy use, more breastfeeding, etc. If older/post-menopause patients tend to have ER+, and our older cohort should have higher proportion of ER+ subtype than BCFR. Although it is indeed the case (74.5% ER+ in n=321 Cyprus dataset and 71.9% ER+ in BCFR samples), the difference on underlying factors (age or menopause status) is very significant between the two cohorts, but not significant for ER distributions.
This topic can be discussed in a general term: can correlation at one level be translated to correlation at another level? In our case, we examine the potential similarity/dissimilarity of distribution of a factor in two data sets (low-level), and wonder whether it can be translated to the similarity/dissimilarity of distribution of a subtype affected by these predictive factor (high-level) in those two datasets. In our previous investigation of a very different issue, linkage/association analysis of multiple correlated phenotypes in a lipid panel, we have observed that correlation at the high-level (phenotype) does not necessarily translate to a correlation at the low-level (linkage-disequilibrium or co-localization between genetic variants) [27].
It could also be that the causal link between the two levels are not strong enough to transfer correlation from one level to another. In our LASSO analysis (Fig.3 and 4), it can be seen that the fraction of deviation explained (range of x axis) of ER, ER/PR, HER2 is at most a few percent, even using all factors. Random forest run on the same data also show that the classification rate on ER, or ER/PR, or HER2 status is not high: on average barely over 50% (results not shown). It highlights the fact that many true predictive factors of breast cancer subtypes are not yet included in our data (e.g., body mass index (BMI)), and also the known genetic causes of breast cancer (e.g., BRCA1 and BRCA2) is not part of the analysis.
In a recent systematic meta analysis of African breast cancer subtypes [28], it is found that proportion of ER+ and PR+ samples fluctuate greatly from study to study. There are also data showing the triple-negative subtype rate is much higher in African women than European/Caucasian [29, 30]. To double check whether breast cancer subtype distribution in our North Cyprus cohort is still the same with another study, we picked a published summary statistics from a southeastern Turkish cohort [31]. The ER+, ER+/PR+, HER2+ proportions in the Turkish cohort are 73.5%, 81.8%, 30.4%, compared to our 74.5%, 75.9% and 24.1%, leading to Fisher test p-values of 0.8, 0.086, and 0.076 (number of sample in Turkish cohort with the subtype information are 438, 437, 434). These differences are within ranges, and are not significant.
It could be of great interest to compare our breast cancer subtypes statistics with a Greek Cypriots cohort. We found two surveys on breast cancer in Greek Cypriots, one with 1100 patients conducted from 1999 to 2005 [32], and another more than 4000 patients conducted from year 2005 to 2013 [33]. Unfortunately, there is no hormone receptor subtype information for the Greek Cypriots cohort in both time periods. However, we could compare the distribution of other factors when the comparable data is available. We found that for 1995-2005 period, number of children distribution is almost identical between the Greek and Turkish Cypriot cohorts; age, smoking status are not significantly different; Greek Cypriot patients have higher education level (p-value=1E-8), less family breast cancer history (p-value=5E-7), more early (less or equal 11 years old) menarche (p-value=0.04), less breastfeeding (p-value= 0.02), less use of oral contraceptive (p-value= 0.004). For 2005-2013 data, smoking status difference is not significant, and there are more invasive patients in Greek Cypriots than Turkish Cypriots (p-value=1E-14).
Our regularized regression (LASSO, elastic net and ridge) (Fig.3, 4) reveal potential predictive factors to breast cancer subtypes. However, these weak signals can only be considered to be “trend” which is not yet proven by statistical test as those in Table 4. In [34], benign disease proliferation risk is higher in ER+ subtype patients than in ER-group. This can be compared our positive contribution from other cancer (including metastasis) and ER+, or ER+/PR+ subtype (Fig.4). In [17], not breastfeeding is associated with ER-/PR-subtype, which can be compared with our result that breastfeeding is positively correlated with the ER+/PR+ subtype. In [21], ER-cancer rate stop to increase at certain age, whereas ER+ rate continue to increase. This can be compared to our result that post-menopause is is positively correlated with the ER+ subtype (Fig.4). In [35], significant difference of age, menopause status, past use of hormone therapy is observed in four ER/PR group. In [36], early age at menarche (≤ 12 years) is less common in PR-group than PR+ group, and it is also true in our data comparing ER-/PR- and ER+/PR+ groups. To summarize, many of our observed predictive factors for breast cancer subtypes are consistent with the literature. The positive correlation between cancer family history and HER2+ subtype (Fig.4) remains intriguing.
In conclusion, we use a unique cohort of breast cancer in a under-studied population to survey the breast cancer subtypes and related factors. We use a simplified analysis framework: keeping breast cancer subtypes at one level, and all factors at another level. Distribution of many factors are extremely different from that of another large breast cancer registry, while the subtype distribution is similar. This indirectly shows that we have not exhaustively measured all predictive factors of breast cancer subtypes. The relationship between the two levels is investigated by regression, with one variable, all variable, or subset of variables. These regression analyses show post-menopause and/or older breast cancer patients are more likely to have the ER+ subtype and HER2-subtype. We have also observed several other trends which should be validated in a larger cohort.
Data Availability
individual person's data is not available. summary data is available upon request.
Acknowledgements
This study was supported through the Fulbright Visiting Research Scholarship Grant by the US Department of State. AU would like to thank her advisor Prof. Mary Terry Beth and her colleagues at the BCFR at the Department of Epidemiology at Columbia University in New York for their help and support. AU would also like to thank to the Ministry of Health at the Turkish Republic of Northern Cyprus for access to the breast cancer archives at the Burhan Nalbantoglu State Hospital in Nicosia, and Dr. Nilay Acar, Dr. Mehmet Ali Alpdoğan, Dr. Fuat Ağlarcan, and Dr.Whitney A. Onuorah from the Faculty of Medicine, Eastern Mediterranean University for their help and guidance for the data collection. WT would like to thank the support from Robert S Boas Center for Genomics and Human Genetics.
Footnotes
Some errors are corrected and a subset which excludes non ethnic Turkish Cypriots is analyzed separately.