ABSTRACT
More than three hundreds North Cyprus breast cancer patients with subtype information are surveyed for their demographic, reproductive, genetic, epidemiological factors. Despite the fact that our cohort differs significantly from some larger cohorts (e.g., the Breast Cancer Family Registry (BCFR) with samples from USA/Canada/Australia) in age, menopause status, age of menarche, parity, education level, oral contraceptive use, breast feeding, the distribution of breast subtypes is not significantly different. Using regularized regressions, we show that the estrogen-receptor-positive (ER+) subtype is positively related to post-menopause and negatively associated with hormone therapy; the estrogen-receptor-positive and progesterone-receptor-positive (ER+/PR+) subtype is positively associated with breast feeding, and negatively associated with hormone therapy status. On the other hand, the human epidermal growth factor 2 positive (HER2+) subtype, which itself is negatively correlated with ER+ and ER+/PR+, is positively related to having first-degree-relative with cancer, and negatively associated with post-menopause. Single and multiple regression also identify older age to be positively correlated to ER+ and ER+/PR+ subtypes, and negatively correlated to HER2+ subtype. Assuming ER+ and ER+/PR+ subtypes to have better prognostic, then post-menopause and breast-feeding are beneficial, and hormone therapy treatment is detrimental.
Introduction
Breast cancer is the most common type of cancer diagnosed in the Western part of the world. In Europe, more than 523,000 women were diagnosed with breast cancer in 2018 and more than 138,000 women died from it (Ferlay et al., 2018). World-wide, close to 2 million women are diagnosed with breast cancer each year and approximately 30% die from this disease (Bray et al., 2018). Breast cancer is largely viewed as a disease predominantly influenced by risk factors related to lifestyle (Madigan et al., 1995; McPherson et al., 2000; Martin and Weber, 2000; Key et al., 2001; Singletary, 2003; Hulka and Moorman, 2008) though through the twin studies of heritability of breast cancer, genetic contribution can still be significant (Peto and Mack, 2003; Möller et al., 2016). Recent work to combine contribution from many genetic variants to breast cancer achieves an above 60% area-under-receive-operator-curve prediction rate (Mavaddat et al., 2019; Shieh et al., 2019b), and 20% variance explained (Lee et al., 2019).
Female hormones may affect breast cancer, and their status have been used to classify breast cancers. In particular, estrogen receptor positive (ER+) or negative (ER-) (Knight et al., 1977; Hähnel et al., 1979), progesterone receptor positive (PR+) or negative (PR-) (Osborne et al., 1980; Clark et al., 1983), human epidermal growth factor 2 positive (HER2+) or negative (HER-) (Wolff et al., 2007) are the major classification schemes of breast cancer subtypes. It has been shown that ER+/-, PR+/-, HER+/- breast cancer subtypes have different clinical features (Richard et al., 1987; Fisher et al., 1980; Onitilo et al., 2009), the cancer etiology of these subtypes can be heterogeneous, and treatment strategies also diverge. In particular, hormone receptor-positive (ER+ or PR+) subtype should expect good prognosis, using drugs like Tamoxifen/Nolvadex (Jordan, 2003; Nasrazadani et al., 2018). Similarly, the more aggressive HER2+ subtype can be treated successfully with drugs like Trastuzumab/Herceptin (Pegram et al., 1998). On the other hand, triple-negative subtype (ER-PR-HER2-) faces challenges in treatment plan (Cleator et al., 2007; Lehmann et al., 2011).
There have been international and national studies of breast cancer with large sample sizes, such as BCFR (www.bcfamilyregistry.org), GICR (gicr.iarc.fr), BCSC (www.bcsc-research.org). However, there has never been a breast cancer survey on the subtype distributions, potentially explanatory variables, and correlation between these variables and breast cancer subtypes, in North Cyprus (though there are some studies in Turkey (Kuzhan et al., 2013; Yildiz et al., 2014; Özmen, 2014; Özmen et al., 2019)). To fill this gap, we present a first epidemiological survey of close to 300 breast cancer patients from North Cyprus.
We collected and analyzed reproductive (age of menarche, number of children (zero for nulliparity), menopause status, hormone therapy or not, oral contraceptive use or not, breast feeding or not, left or right breast with cancer), demographic (age at diagnosis, education level, housewife or employed), genetic (whether a first relative has cancer), and epidemiological (smoking or not, whether the patient has other cancers) information. Most of these factors are known to be risk factors for breast cancer, e.g., early menarche, late menopause, nulliparity, long hormone replacement therapy, older age,family history of breast cancer, but it is unclear which factor is predictive for breast cancer subtypes. Some information are collected but not used as they lack diversified values. For example, even though there are three male samples, the extreme imbalanced sample size makes it unlikely to extract useful information. Therefore, we exclude male samples and discard the gender information. Another example if alcoholic use whose value is “No” for all samples, which would also be not useful for the analysis.
Our analysis strategy is the following: We separate ER, PR, HER2, the dependent variables, from other factors which are independent variables. Since we do not have control (non-breast cancer) samples, it is a case-only analysis or subtypes-with-case analysis (Martínez et al., 2010; Redondo et al., 2012). The first analysis is to compare our independent variables distribution with another major public breast cancer databases, and to compare the distributions of dependent variable (i.e., breast cancer subtypes) also. Even without the raw data from database we do not have access, summary statistics with sample size/mean/standard deviation are enough for statistical tests. Second, correlation between the cancer subtypes are determined. Third, uni-variate, multiple, and regularization logistic regression are performed to detect any factor-subtype association, i.e., to identify potential predictive factors for breast cancer subtypes. We will show that though there are some minor surprises, our cohort conforms with some other studies concerning predictive factors of breast cancer subtypes.
Results
Visual inspection of the data by t-SNE
The t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008) is a popular method to represent high-dimensional data in 2 or 3 dimensions. Its application ranges from handwriting recognition (Van der Maaten, 2009) to single-cell expression data analysis (Kobak and Berens, 2018), to genetics/genomics (Li et al., 2017; Gaspar and Breen, 2019), and other biological topics (Hirata et al., 2019; Li et al., 2019).
We use 3 dependent variables (ER, PR, HER2), 5 quantitative independent variables (age of diagnosis, age of menarche, number of children, education level (0-3), cancer grade (1-4)), and 10 binary independent variables (left or right breast, menopause or not, first relative with cancer or not, having other cancer or not, smoker or not, hormone therapy or not, oral contraceptive use or not, breast feeding or not, housewife or not, invasive cancer or not). The quantitative variables are standardized to have zero-mean and unit-variance (z-transformation).
Since there are a lot of missing data for age of menarche (missing rate = 33%), hormone therapy (31%), oral contraceptive use (32%), breast feeding (33%), we only keep samples who have information on these factors. This reduced the sample size from 321 to 211. For these 211 patients, other missing data (of much lower missing rate) are imputed.
Fig.1 shows one run of t-SNE (different runs would lead to different layout of the points but similar cluster patterns). Because ER, PR, HER2 are part of the variables used in the input, it is not surprising that their values are partitioned in the plot (e.g., ER+ and ER-samples). It can be seen that ER+ samples tend to PR+, and HER2-, ER-samples tend to be PR- and HER2+. The 7 samples with other cancers (including metastasis) form a distinct cluster from the rest of the samples. While ER, PR, HER2 values separate in up-down direction in Fig.1, other factors, such as menopause status, breast feeding, age, etc. seem to be separated in (not completely) orthogonal direction.
Distribution of patient’s factors
Table 1 shows that our cohort is distinct from the Breast Cancer Family Registry (BCFR) samples, which are mostly of USA/Canada/Australia origin, in several demographic or reproductive factors. The north Cyprus cohort is older, more post-menopause, lesser number of young (≤ 11) age at menarche, lesser nulliparous, lesser education, less use of oral contraceptives, and more breast feeding. There are two explanations for these significant differences. The first is due to cultural and customary differences between countries (e.g. use of oral contraceptives). The second explanation is that our samples are collected from the state hospital, and a higher percentage of well-to-do patients may select to be treated at private hospitals, or hospitals overseas. The differences remain even for ER+/PR+ subgroup, and for ER-/PR-subgroup (though less significant due to smaller sample sizes).
Within our North Cyprus cohort, when the ER+/PR+ and ER-/PR- groups are compared on these factors, only the education level is significantly different (ER-/PR- are less educated), and ER-/PR- is slightly younger (t-test p-value = 0.07) (Table 1). Differences in other factors could become significant if the sample size is larger (e.g., there is a trend that ER-PR-patients breast feed less, or more likely to have the first menstruation at age older than 13, or more likely to be in pre-menopause stage), but not significant with our limited sample size.
On the other hand, some other demographic and other factors are not very different between our cohort and BCFR, as summarized in Table 2. These include hormone therapy usage, having a first-degree relative with cancer, and tumor grade. We also list factors which do not have the corresponding information in BCFR: having other cancers, smoking status, left or right breast with cancer, housewife or employed, invasive cancer or not. Only for ER+/PR+ subtype, the North Cyprus cohort is significantly less likely to have hormone therapy than the BCFR samples.
We also examine correlation between factors. Using all breast cancer patients without considering the subtypes, these correlations are observed: (1) patients who breast feed are less likely to go through hormone therapy (OR=7.1, Fisher p-value 9 ×10−5); (2) patients who work are more likely to smoke than housewives (OR=3.1, p-value= 1.3 ×10−4); (3) patients who work are more likely to be pre-menopause than housewives (OR=2.8, p-value= 1.3 ×10−4).
Distribution of breast cancer subtypes
There are n=290 samples with all ER, PR, HER2 subtype information available. The distribution of ER, PR and HER2 values of these samples are listed in Table 3, their marginal counts are listed. ER and PR are strongly positively correlated (Fisher p-value = 1.6 × 10−35, Odds-Ratio (OR) = 63). The majority (68% (198/290)) samples are ER+/PR+. ER and HER2 are negatively correlated (Fisher p-value =0.018, OR=0.47). PR and HER2 are also negatively correlated (Fisher p-value = 2.3 ×10−4, OR=0.34). There are 40 triple negative patients (ER-/PR-/HER2-) or 13.8% of the total.
If we ignore HER2 status, there are four ER/PR groups with the following distribution: ER+/PR+ n=198 (68.3%), ER+/PR-n=18 (6.2%), ER-/PR+ n=11 (3.8%), ER-/PR-n=63 (21.7%). This distribution can be compared to that from the Breast Cancer Family Registry (Work et al., 2014): ER+/PR+ n=2486 (62%), ER+/PR-n=397(9.9%), ER-/PR+ n=208(5.2%), ER-/PR-n=920 (22.9%). A Fisher’s test comparing these two distributions has p-value 0.08 which is not significant at 0.05 or 0.01 level.
The highly significant correlation between ER and PR may make PR measurement redundant. In fact, it is argued that added value of PR is questionable (Hefti et al., 2013). More specifically, ER-/PR+ subtype is rare and may not be reproducible (i.e., can be reclassified to another subtype by another method) (Hefti et al., 2013). If we ignore PR and HER2, the ER-, ER+ frequencies in North Cyprus (25.5% and 74.5%) are not significantly different from those in the Breast Cancer Family Registry (28.1% and 71.9%) (Fisher test p-value is 0.38).
If we ignore ER+/PR- and ER-/PR+ subtypes, and only compare the ER+/PR+ and ER- /PR- subtype frequency between North Cyprus (75.9% and 24.1%) and BCFR (73.0% and 27.0%), the Fisher test p-value is 0.35. If we compare HER2+ and HER2-frequency in the two groups, that in North Cyprus is 24.1% and 75.9%, in BCRF is 23.4% and 76.6%, again no significant difference (Fisher p-value is 0.81).
The similarity between breast cancer subtype frequencies in North Cyprus and BCRF is in strong contrast with the dissimilarity of many demographic and reproductive factors.
The breast cancer subtype distribution in our cohort is also striking similar to another mostly European/Caucasian database: the Breast Cancer Association Consortium (BCAC) in UK (Breast Cancer Association Consortium, 2006). From Table 3, the luminal A/B (the difference between them is determined by lower/higher protein level of Ki-67, whose information is not included in the table) or simply hormone receptor positive), HER2+ but hormone receptor negative (or luminal HER2), HER2+only, triple-negative have frequency of 62%, 16%, 8%, and 14%, as compared to those of BCAC (Brouckaert et al., 2017), 66%, 13%, 7%, 13%, with Fisher p-value of 0.44. Due to limited information provided in (Brouckaert et al., 2017), we cannot carry out a more systematic comparison of factors. But we do observe some difference: e.g., percentage of patients without child is around 10% in our cohort, but more than 15% in BCAC (Brouckaert et al., 2017).
Predictive factors for breast cancer subtypes
The comparison of factor values between ER+PR+ and ER-PR-samples can also be cast into a regression of ER/PR (dependent variable) over individual factors (independent variables). Table 4 shows all results which are significant at 0.1 level from regressing ER/PR, or ER, or HER2, over either single factor by univariate logistic regression, or all factors by a multiple logistic regression.
Table 4 shows age being positively correlated with ER+/PR+, and ER+, but negatively correlated with HER2+. These results are similar to what is shown in Table 1 that ER+PR+ patients are older. Due to the positive correlation between ER and PR, ER+ patients and HER2-patients are older.
Table 4 also shows that post-menopause is positively correlated with ER+/PR+, and ER+, but negatively correlated with HER2+. Since post-menopause implies a more curable ER+ type, it is said that menopause plays a protective role for less chance to be in ER-type (Tarone and Chu, 2002). The positive correlation between menopause status and age is self-explanatory, and the association between menopause status and ER or HER2 is also easily explained by the age. Finally, Table 4 shows that HER2+ patients are more likely to have a first relative with cancer.
Between univariate and multiple regression, we also applied a regularized regression, LASSO (least absolute shrinkage and selection operator) (Tibshirani, 1996), to study the situation with a few explanatory variables. LASSO accomplished the task of variable selection (e.g., (Halinski and Feldt, 1970; Li and Yang, 2002)) by imposing constraint on the sum of absolute value of all fitting coefficients, effectively setting many coefficients to be zero, thus removing the contribution from these variables. Fig.2 shows how the coefficient of each explanatory variable increases, from left to right, when the number of non-zero-coefficient variables increases, for the dependent variables of ER, ER/PR, HER2.
For ER subtype, Fig.2(A) shows the dominant contribution from menopause status, consistent with Table 4. Fig.2(A) also shows that after menopause, hormone therapy is the second most important contributing factor, though the contribution is negative. Fig.2(B) shows breast-feeding being positively contributes to the ER/PR+ status, a result not prominent in the single-variable and all-variable regression, though the trend can already be seen in Table 1. This result is consistent with reports in the literature that breast-feeding is beneficial in reducing the probability in acquiring poor prognostic breast cancers, such as triple negative subtypes (Islami et al., 2015; Fortner et al., 2019). Fig.2(C) confirms the positive contribution from first-degree-relative cancer history, and negative contribution from the menopause status to HER2+, with the latter result already seen in Table 4.
Materials and Methods
Sample collection
We included 324 samples collected retrospectively from the Dr. Burhan Nalbantoǧlu State Hospital (BNSH) in Nicosia, North Cyprus between 2006-2015, with the majority from years 2011-2015 (93%). This represented around 40% of total beast cancer cases that exist in the archives during this period. The data consists of reproductive factors, histology and biomarker information such as the Estrogen receptor (ER), Progesterone receptor (PR), and human epidermal growth factor 2 (HER2) status. Permission was obtained from the Ministry of Health from the Turkish Republic of Cyprus for scientific use of the data. Additionally, ethical approval from the Eastern Mediterranean University (EMU) Ethics Committee in Famagusta was granted to conduct the study. Telephone interviews were made when necessary to collect information from patients to fill in the missing factor values.
Tumor marker data collection
For the 324 cases, pathologists from the BNSH ascertained ER and PR status from patient tumor tissue using immunohistochemistry (IHC) and/or pathology reports using a standardized protocol and pathology reporting forms. For all cases, HER2 status available (300) was provided from patient medical reports. Where tumor tissue was available, pathologists used IHC testing for ER and PR, and categorized tumors as ER and PR positive if ≥ 10% of tumor cells stained positive. When the ER or PR +/- status is not labeled, but with a specific percentage, we treat it as unknown. When the left and right breast are labeled with different ER, PR, or HER2 status, it is treated as unknown. Menopausal and other information were extracted either from the medical records (with guidance/approval from an oncologist) or by telephone interviews.
Pre-processing of data
We remove the three male samples, reducing the sample size from 324 to 321. For hormone receptor status, if the left and right breast has different value, it is labeled as NA (unknown). Also, if the hormone receptor status is not binarized but represented by a percentage, it is labeled as NA.
Other re-coding of the data include: (smoking) seldom=0, quit=1, x-number-pocket=1; (menopause status) “not clear” is considered as unknown; (family history) first degree relatives are parents, children, and siblings; (other cancer) anything not “no” is considered as yes (including metastasis); (education) 0,1,2,3 are for no school, primary/middle school, high school, college or more; (housewife/employment) “did not work” is considered as housewife, retied is considered as the same as employed; (tumor grade) “high grade” is considered as 3, inoperable is considered as 4, A/B/C are ignored; (invasive cancer) IDC(invasive ductal carcinoma)/ICC(invasive cribriform cancer)/ISC(invasive secretory cancer) are invasive, everything else is not invasive.
For regression analysis when a sub-table of the whole dataset is created, we alway do not use samples where the dependent variable is unknown. An independent variable is removed if the missing rate is too high (e.g. > 0.2). For an independent variable with low missing rate, the missing rate is imputed from the known variable value (e.g., if x is the independent variable, two values are missing, they are replaced by (R code): sample(x[!is.na(x)][1:2]).
Statistical programs used: All statistical analysis either used R 3.5.1 (www.r-project.org, released July 2018) or SPSS 17.0 (released 2008, Chicago: SPSS Inv.). The Rtsne R package is used for the t-SNE analysis (github.com/jkrijthe/Rtsne), with the default parameter setting (e.g., perplexity=30, dims=2). The glmnet R package (Friedman et al., 2010) is used for the LASSO analysis (alpha=1, family=“binomial”). The logistic regression is carried out by the standard R function: glm(… family=binomial(link=“logit”)), the Fisher’s test by R function: fisher.test.
Discussion
Without control samples, we carried out a case-only analysis of potential predictive factors of different subtypes of breast cancer. Case-only design has been implemented in breast cancer studies before, and it is “an important initial step in understanding the extent of etiologic heterogeneity between tumor subtypes” (Martínez et al., 2010; Redondo et al., 2012). Since different subtypes of breast cancer have different prognostics, it is important to assess their distribution.
One of the striking results we obtained is that our North Cyprus has very similar ER+, ER+/PR+, and HER2+ as BCFR, even though our cohort is much older, more post-menopause, less educated, less hormone therapy use, more breast feeding, etc. Since we also show a correlation between menopause status and ER subtype, it might seem to be paradoxal that the higher proportion of post-menopause samples in our data does not lead to a significantly higher ER+ frequency. In fact, the ER+ proportion (74.5%) in indeed higher than that in BCFR (71.9%), only that the difference is not statistically significant.
This topic can be discussed in a general term: can correlation at one level be translated to correlation at another level? In our case, we examine the potential similarity/dissimilarity of distribution of a factor in two data sets (low-level), and wonder whether it can be translated to the similarity/dissimilarity of distribution of a subtype affected by these predictive factor (high-level) in those two datasets. In our previous investigation of a very different issue, we did observe that correlation may not be transferable from one level to another. It is the example of genetic linkage/association analysis of multiple correlated phenotypes (Ulgen et al., 2003). One might guess that simply because these phenotypes are correlated, their risk genes should be located in the same chromosome regions. But our work on traits/phenotypes in a typical lipid panel shows that genetic linkage results are not necessarily correlated even though the phenotypes are (Ulgen et al., 2003).
Another explanation is that causal link between the two levels are not strong enough to transfer correlation from one level to another. In our LASSO analysis (Fig.2), it can be seen that the fraction of deviation explained (range of x axis) of ER, ER/PR, HER2 is at most a few percent, even using all factors. Random forest run on the same data also show that the classification rate on ER, or ER/PR, or HER2 status is not high: on average barely over 50% (results not shown). It highlights the fact that many true predictive factors of breast cancer subtypes are not yet included in our data, and also the known genetic causes of breast cancer is not part of the analysis.
In a recent systematic meta analysis of African breast cancer subtypes (Eng et al., 2014), it is found that proportion of ER+ and PR+ samples fluctuate greatly from study to study. There are also data showing the triple-negative subtype rate is much higher in African women than European/Caucasian (Huo et al., 2009; Zheng et al., 2018). To double check whether breast cancer subtype distribution in our North Cyprus cohort is still the same with another study, we picked a published summary statistics from a southeastern Turkish cohort (Kuzhan et al., 2013). The ER+, ER+/PR+, HER2+ proportions in the Turkish cohort are 73.5%, 81.8%, 30.4%, compared to our 74.5%, 75.9% and 24.1%, leading to Fisher test p-values of 0.8, 0.086, and 0.076 (number of sample in Turkish cohort with the subtype information are 438, 437, 434). These differences are within ranges, and are not significant.
Our regularized regression (LASSO) (Fig.2) reveal several potential predictive factors to breast cancer subtypes. In order to compare our results with other studies, we note (1) since we do not have normal samples, a risk factor with positive correlation means its larger value may lead to a higher risk for one breast cancer subtype vs. another, not breast cancer vs cancer-free (Key et al., 2001); (2) some predictive factor for breast cancer subtype, e.g. body-mass-index (BMI) (Phipps et al., 2011) and mammographic density (Shieh et al., 2019a) are not included in our survey; (3) our analysis treated breast cancer subtype as dependent variable and all other factors as independent variable, in a regression framework, whereas some studies are stratified with some factors fixed.
In (Kerlikowske et al., 2016), benign disease proliferation risk is higher in ER+ subtype patients than in ER- group. This can be compared our positive contribution from other cancer (including metastasis) and ER+, or ER+/PR+ subtype (Fig.2(A,B). In (Work et al., 2014), not breast feeding is associated with ER-/PR-subtype, which can be compared with our result that breast feeding is positively correlated with the ER+/PR+ subtype. In (Tarone and Chu, 2002), ER-cancer rate stop to increase at certain age, whereas ER+ rate continue to increase. This can be compared to our result that post-menopause is is positively correlated with the ER+ subtype (Fig.2(A)). In (Colditz et al., 2004), significant difference of age, menopause status, past use of hormone therapy is observed in four ER/PR group. In (Yang et al., 2010), early age at menarche (≤ 12 years) is less common in PR- group than PR+ group, and it is also true in our data comparing ER-/PR- and ER+/PR+ groups. To summarize, many of our observed predictive factors for breast cancer subtypes are consistent with the literature. The positive correlation between cancer family history and HER2+ subtype (Fig.2(C)) remains intriguing.
In conclusion, we use a unique cohort of breast cancer in a under-studied population to survey the breast cancer subtypes and related factors. We use a simplified analysis framework: keeping breast cancer subtypes at one level, and all factors at another level. Distribution of many factors are extremely different from that of another large breast cancer registry, while the subtype distribution is similar. This indirectly shows that we have not exhaustively measured all predictive factors of breast cancer subtypes. The relationship between the two levels is investigated by regression, with one variable, all variable, or subset of variables. These regression analyses show post-menopause and/or older breast cancer patients are more likely to have the ER+ subtype and HER2-subtype; hormone therapy is positively correlated with ER- or ER-/PR-subtype; breast-feeding and/or older breast cancer patients are more like to have ER+/PR+ subtype; and family history is observed more frequently in HER2+ subtypes.
Data Availability
individual person's data is not available. summary data is available upon request.
Acknowledgements
This study was supported through the Fulbright Visiting Research Scholarship Grant by the US Department of State. AU would like to thank her advisor Prof. Mary Terry Beth and her colleagues at the BCFR at the Department of Epidemiology at Columbia University in New York for their help and support. AU would also like to thank to the Ministry of Health at the Turkish Republic of Northern Cyprus for access to the breast cancer archives at the Burhan Nalbantoglu State Hospital in Nicosia, and Dr. Nilay Acar, Dr. Mehmet Ali Alpdoǧan, Dr. Fuat Aǧlarcan, and Dr.Whitney A. Onuorah from the Faculty of Medicine, Eastern Mediter-ranean University for their help and guidance for the data collection. WT would like to thank the support from Robert S Boas Center for Genomics and Human Genetics.
Footnotes
↵† wli{at}northwell.edu