Potential Predictive Factors for Breast Cancer Subtypes from a North Cyprus Cohort Analysis

We present a first epidemiological survey from North Cyprus to determine predictive factors for breast cancer subtypes.More than 300 breast cancer patients, 90% of them having subtype information, are surveyed from the State Hospital in Nicosia between 2006 – 2015 for their demographic, reproductive, genetic, epidemiological factors. The breast cancer subtypes, Estrogen receptor (ER) +/-, Progesterone receptor (PR) +/-, and human epidermal growth factor 2 (HER2) +/- status, are determined. Single and multiple variable, regularized regressions, with predictive factors as independent variables, breast cancer subtypes as dependent variables are conducted.Our cohort differs significantly from larger cohorts (e.g., Breast Cancer Family Registry), in age, menopause status, age of menarche, parity, education, oral contraceptive use, breastfeeding, but the distribution of breast subtypes is not significantly different. Subtype distribution in our cohort is also not different from another Turkish cohort. We show that the ER+ subtype is positively related to age/post-menopause; ER+/PR+ is positively associated with age, but negatively associated with cancer stage; HER2+, which is negatively correlated with ER+ and ER+/PR+, is positively related to cancer stage but negatively associated with age/post-menopause.Assuming ER+ and ER+/PR+ to have better prognostic, HER+ to have worse prognostic, then older age and postmenopause seem to be beneficial, smoking and family history of cancer seem to be detrimental. Next steps include looking at potential biomarkers and using cure models to determine long-term survivors.


Introduction
Breast cancer is the most common type of cancer diagnosed in the Western part of the world.
In Europe, more than 523,000 women were diagnosed with breast cancer in 2018 and more than 138,000 women died from it (Ferlay et al., 2018). World-wide, close to 2 million women are diagnosed with breast cancer each year and approximately 30% die from this disease (Bray et al., 2018). Breast cancer is largely viewed as a disease predominantly influenced by risk factors related to lifestyle (Madigan et al., 1995;McPherson et al., 2000;Martin and Weber, 2000;Key et al., 2001;Singletary, 2003;Hulka and Moorman, 2008) though through the twin studies of heritability of breast cancer, genetic contribution can still be significant (Peto and Mack, 2003;Möller et al., 2016). Recent work to combine contribution from many genetic variants to breast cancer achieves an above 60% area-under-receive-operator-curve prediction rate (Mavaddat et al., 2019;Shieh et al. , 2019b), and 20% variance explained (Lee et al.,
However, there has never been a breast cancer survey on the subtype distributions, potentially 2 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint explanatory variables, and correlation between these variables and breast cancer subtypes, in North Cyprus (though there are some studies in Turkey (Kuzhan et al., 2013;Yildiz et al., 2014;Özmen, 2014;Özmen et al., 2019)). To fill this gap, we present a first epidemiological survey of close to 300 breast cancer patients from North Cyprus.
We collected and analyzed reproductive (age of menarche, number of children (zero for nulliparity), menopause status, hormone therapy or not, oral contraceptive use or not, breast feeding or not, left or right breast with cancer), demographic (age at diagnosis, education level, housewife or employed), genetic (whether a first relative has cancer), and epidemiological (smoking or not, whether the patient has other cancers) information. Most of these factors are known to be risk factors for breast cancer, e.g., early menarche, late menopause, nulliparity, long hormone replacement therapy, older age,family history of breast cancer, but it is unclear which factor is predictive for breast cancer subtypes. Some information are collected but not used as they lack diversified values. For example, even though there are three male samples, the extreme imbalanced sample size makes it unlikely to extract useful information. Therefore, we exclude male samples and discard the gender information. Another example if alcoholic use whose value is "No" for all samples, which would also be not useful for the analysis.
Our analysis strategy is the following: We separate ER, PR, HER2, the dependent variables, from other factors which are independent variables. Since we do not have control (non-breast cancer) samples, it is a case-only analysis or subtypes-with-case analysis (Martínez et al., 2010;Redondo et al., 2012). The first analysis is to compare our independent variables distribution with another major public breast cancer databases, and to compare the distributions of dependent variable (i.e., breast cancer subtypes) also. Even without the raw data from database we do not have access, summary statistics with sample size/mean/standard deviation are enough for statistical tests. Second, correlation between the cancer subtypes are determined. Third, uni-variate, multiple, and regularization logistic regression are performed to detect any factorsubtype association, i.e., to identify potential predictive factors for breast cancer subtypes. We will show that though there are some minor surprises, our cohort conforms with some other studies concerning predictive factors of breast cancer subtypes.
3 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint

Results
Visual inspection of the data by t-SNE: The t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008) is a popular method to represent highdimensional data in 2 or 3 dimensions. Its application ranges from handwriting recognition (Van der Maaten, 2009) to single-cell expression data analysis (Kobak and Berens, 2018), to genetics/genomics (Li et al., 2017;Gaspar and Breen, 2019), and other biological topics (Hirata et al., 2019;Li et al., 2019).
We use 3 dependent variables (ER, PR, HER2), 5 quantitative independent variables (age of diagnosis, age of menarche, number of children, education level (0-3), cancer grade (1-4)), and 10 binary independent variables (left or right breast, menopause or not, first relative with cancer or not, having other cancer or not, smoker or not, hormone therapy or not, oral contraceptive use or not, breast feeding or not, housewife or not, invasive cancer or not). The quantitative variables are standardized to have zero-mean and unit-variance (z-transformation).
Since there are a lot of missing data for age of menarche (missing rate = 33%), hormone therapy (31%), oral contraceptive use (32%), breast feeding (33%), we only keep samples who have information on these factors. This reduced the sample size from 321 to 211. For these 211 patients, other missing data (of much lower missing rate) are imputed. Fig.1 shows one run of t-SNE (different runs would lead to different layout of the points but similar cluster patterns). Because ER, PR, HER2 are part of the variables used in the input, it is not surprising that their values are partitioned in the plot (e.g., ER+ and ER-samples).
It can be seen that ER+ samples tend to PR+, and HER2-, ER-samples tend to be PR-and HER2+. The 7 samples with other cancers (including metastasis) form a distinct cluster from the rest of the samples. While ER, PR, HER2 values separate in up-down direction in Fig.1, other factors, such as menopause status, breast feeding, age, etc. seem to be separated in (not completely) orthogonal direction.
Distribution of patient's factors: Table 1 shows that our cohort is distinct from the Breast Cancer Family Registry (BCFR) samples, which are mostly of USA/Canada/Australia origin, in several demographic or reproductive factors. The north Cyprus cohort is older, more post-menopause, lesser number of young (≤ 11) age at menarche, lesser nulliparous, lesser 4 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint  6 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint education, less use of oral contraceptives, and more breast feeding. There are two explanations for these significant differences. The first is due to cultural and customary differences between countries (e.g. use of oral contraceptives). The second explanation is that our samples are collected from the state hospital, and a higher percentage of well-to-do patients may select to be treated at private hospitals, or hospitals overseas. The differences remain even for ER+/PR+ subgroup, and for ER-/PR-subgroup (though less significant due to smaller sample sizes).
Within our North Cyprus cohort, when the ER+/PR+ and ER-/PR-groups are compared on these factors, only the education level is significantly different (ER-/PR-are less educated), and ER-/PR-is slightly younger (t-test p-value = 0.07) ( Table 1). Differences in other factors could become significant if the sample size is larger (e.g., there is a trend that ER-PR-patients breast feed less, or more likely to have the first menstruation at age older than 13, or more likely to be in pre-menopause stage), but not significant with our limited sample size.
On the other hand, some other demographic and other factors are not very different between our cohort and BCFR, as summarized in Table 2. These include hormone therapy usage, having a first-degree relative with cancer, and tumor grade. We also list factors which do not have the corresponding information in BCFR: having other cancers, smoking status, left or right breast with cancer, housewife or employed, invasive cancer or not. Only for ER+/PR+ subtype, the North Cyprus cohort is significantly less likely to have hormone therapy than the BCFR samples.
We also examine correlation between factors. Using all breast cancer patients without considering the subtypes, these correlations are observed: (1) patients who breast feed are less likely to go through hormone therapy (OR=7.1, Fisher p-value 9 ×10 −5 ); (2) patients who work are more likely to smoke than housewives (OR=3.1, p-value= 1.3 ×10 −4 ); (3) patients who work are more likely to be pre-menopause than housewives (OR=2.8, p-value= 1.3 ×10 −4 ).
Distribution of breast cancer subtypes: There are n=290 samples with all ER, PR, HER2 subtype information available. The distribution of ER, PR and HER2 values of these samples are listed in Table 3, their marginal counts are listed. ER and PR are strongly positively correlated (Fisher p-value = 1.6 × 10 −35 , Odds-Ratio (OR) = 63). The majority (68% (198/290) ) samples are ER+/PR+. ER and HER2 are negatively correlated (Fisher p-value =0.018, OR=0.47). PR and HER2 are also negatively correlated (Fisher p-value = 2.3 7 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint ×10 −4 , OR=0.34). There are 40 triple negative patients (ER-/PR-/HER2-) or 13.8% of the total.
The highly significant correlation between ER and PR may make PR measurement redundant. In fact, it is argued that added value of PR is questionable (Hefti et al., 2013). More specifically, ER-/PR+ subtype is rare and may not be reproducible (i.e., can be reclassified to another subtype by another method) (Hefti et al., 2013). If we ignore PR and HER2, the ER-, ER+ frequencies in North Cyprus (25.5% and 74.5%) are not significantly different from those in the Breast Cancer Family Registry (28.1% and 71.9%) (Fisher test p-value is 0.38).
If we ignore ER+/PR-and ER-/PR+ subtypes, and only compare the ER+/PR+ and ER-/PR-subtype frequency between North Cyprus (75.9% and 24.1%) and BCFR (73.0% and 27.0%), the Fisher test p-value is 0.35. If we compare HER2+ and HER2-frequency in the two groups, that in North Cyprus is 24.1% and 75.9%, in BCRF is 23.4% and 76.6%, again no significant difference (Fisher p-value is 0.81).
The similarity between breast cancer subtype frequencies in North Cyprus and BCRF is in strong contrast with the dissimilarity of many demographic and reproductive factors.
The breast cancer subtype distribution in our cohort is also striking similar to another   Table 4 also shows that post-menopause is positively correlated with ER+/PR+, and ER+, but negatively correlated with HER2+. Since post-menopause implies a more curable ER+ type, it is said that menopause plays a protective role for less chance to be in ER-type (Tarone and Chu, 2002). The positive correlation between menopause status and age is self-explanatory, and the association between menopause status and ER or HER2 is also easily explained by the age. Finally, Table 4 shows that HER2+ patients are more likely to have a first relative with cancer.
Between univariate and multiple regression, we also applied a regularized regression, LASSO (least absolute shrinkage and selection operator) (Tibshirani, 1996), to study the situation with a few explanatory variables. LASSO accomplished the task of variable selection (e.g., (Halinski 9 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint  and Feldt, 1970;Li and Yang, 2002)) by imposing constraint on the sum of absolute value of all fitting coefficients, effectively setting many coefficients to be zero, thus removing the contribution from these variables. Fig.2 shows how the coefficient of each explanatory variable increases, from left to right, when the number of non-zero-coefficient variables increases, for the dependent variables of ER, ER/PR, HER2.
For ER subtype, Fig.2(A) shows the dominant contribution from menopause status, consistent with Table 4. Fig.2(A) also shows that after menopause, hormone therapy is the second most important contributing factor, though the contribution is negative. Fig.2(B) shows breast-feeding being positively contributes to the ER/PR+ status, a result not prominent in the single-variable and all-variable regression, though the trend can already be seen in Table   1. This result is consistent with reports in the literature that breast-feeding is beneficial in reducing the probability in acquiring poor prognostic breast cancers, such as triple negative subtypes (Islami et al., 2015;Fortner et al., 2019). Fig.2(C) confirms the positive contribution from first-degree-relative cancer history, and negative contribution from the menopause status to HER2+, with the latter result already seen in Table 4.

10
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Famagusta was granted to conduct the study. Telephone interviews were made when necessary to collect information from patients to fill in the missing factor values.
Tumor marker data collection: For the 324 cases, pathologists from the BNSH ascertained ER and PR status from patient tumor tissue using immunohistochemistry (IHC) and/or pathology reports using a standardized protocol and pathology reporting forms. For all cases, HER2 status available (300) was provided from patient medical reports. Where tumor tissue was available, pathologists used IHC testing for ER and PR, and categorized tumors as ER and PR positive if ≥ 10% of tumor cells stained positive. When the ER or PR +/-status is not labeled, but with a specific percentage, we treat it as unknown. When the left and right breast are labeled with different ER, PR, or HER2 status, it is treated as unknown. Menopausal and other information were extracted either from the medical records (with guidance/approval from an oncologist) or by telephone interviews.
Pre-processing of data: We remove the three male samples, reducing the sample size from 324 to 321. For hormone receptor status, if the left and right breast has different value, it is labeled as NA (unknown). Also, if the hormone receptor status is not binarized but represented by a percentage, it is labeled as NA.
Other re-coding of the data include: (smoking) seldom=0, quit=1, x-number-pocket=1; (menopause status) "not clear" is considered as unknown; (family history) first degree relatives are parents, children, and siblings; (other cancer) anything not "no" is considered as yes (including metastasis); (education) 0,1,2,3 are for no school, primary/middle school, high 11 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint school, college or more; (housewife/employment) "did not work" is considered as housewife, retied is considered as the same as employed; (tumor grade) "high grade" is considered as 3, inoperable is considered as 4, A/B/C are ignored; (invasive cancer) IDC(invasive ductal carcinoma)/ICC(invasive cribriform cancer)/ISC(invasive secretory cancer) are invasive, everything else is not invasive.
For regression analysis when a sub-table of the whole dataset is created, we alway do not use samples where the dependent variable is unknown. An independent variable is removed if the missing rate is too high (e.g. > 0.2). For an independent variable with low missing rate, the missing rate is imputed from the known variable value (e.g., if x is the independent One of the striking results we obtained is that our North Cyprus has very similar ER+, ER+/PR+, and HER2+ as BCFR, even though our cohort is much older, more post-menopause, less educated, less hormone therapy use, more breast feeding, etc. Since we also show a correlation between menopause status and ER subtype, it might seem to be paradoxal that the 12 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint higher proportion of post-menopause samples in our data does not lead to a significantly higher ER+ frequency. In fact, the ER+ proportion (74.5%) in indeed higher than that in BCFR (71.9%), only that the difference is not statistically significant.
This topic can be discussed in a general term: can correlation at one level be translated to correlation at another level? In our case, we examine the potential similarity/dissimilarity of distribution of a factor in two data sets (low-level), and wonder whether it can be translated to the similarity/dissimilarity of distribution of a subtype affected by these predictive factor (high-level) in those two datasets. In our previous investigation of a very different issue, we did observe that correlation may not be transferable from one level to another. It is the example of genetic linkage/association analysis of multiple correlated phenotypes (Ulgen et al., 2003).
One might guess that simply because these phenotypes are correlated, their risk genes should be located in the same chromosome regions. But our work on traits/phenotypes in a typical lipid panel shows that genetic linkage results are not necessarily correlated even though the phenotypes are (Ulgen et al., 2003).
Another explanation is that causal link between the two levels are not strong enough to transfer correlation from one level to another. In our LASSO analysis (Fig.2), it can be seen that the fraction of deviation explained (range of x axis) of ER, ER/PR, HER2 is at most a few percent, even using all factors. Random forest run on the same data also show that the classification rate on ER, or ER/PR, or HER2 status is not high: on average barely over 50% (results not shown). It highlights the fact that many true predictive factors of breast cancer subtypes are not yet included in our data, and also the known genetic causes of breast cancer is not part of the analysis.
In a recent systematic meta analysis of African breast cancer subtypes (Eng et al., 2014), it is found that proportion of ER+ and PR+ samples fluctuate greatly from study to study.
There are also data showing the triple-negative subtype rate is much higher in African women than European/Caucasian (Huo et al., 2009;Zheng et al., 2018). To double check whether breast cancer subtype distribution in our North Cyprus cohort is still the same with another study, we picked a published summary statistics from a southeastern Turkish cohort (Kuzhan et al., 2013). The ER+, ER+/PR+, HER2+ proportions in the Turkish cohort are 73.5%, 81.8%, 30.4%, compared to our 74.5%, 75.9% and 24.1%, leading to Fisher test p-values of 0.8,

13
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint 0.086, and 0.076 (number of sample in Turkish cohort with the subtype information are 438, 437, 434). These differences are within ranges, and are not significant.
Our regularized regression (LASSO) (Fig.2) reveal several potential predictive factors to breast cancer subtypes. In order to compare our results with other studies, we note (1)  In (Kerlikowske et al., 2016), benign disease proliferation risk is higher in ER+ subtype patients than in ER-group. This can be compared our positive contribution from other cancer (including metastasis) and ER+, or ER+/PR+ subtype (Fig.2(A,B). In (Work et al., 2014), not breast feeding is associated with ER-/PR-subtype, which can be compared with our result that breast feeding is positively correlated with the ER+/PR+ subtype. In (Tarone and Chu, 2002), ER-cancer rate stop to increase at certain age, whereas ER+ rate continue to increase.
This can be compared to our result that post-menopause is is positively correlated with the ER+ subtype (Fig.2(A)). In (Colditz et al. , 2004), significant difference of age, menopause status, past use of hormone therapy is observed in four ER/PR group. In (Yang et al., 2010), early age at menarche (≤ 12 years) is less common in PR-group than PR+ group, and it is also true in our data comparing ER-/PR-and ER+/PR+ groups. To summarize, many of our observed predictive factors for breast cancer subtypes are consistent with the literature. The positive correlation between cancer family history and HER2+ subtype (Fig.2(C)) remains intriguing.
In conclusion, we use a unique cohort of breast cancer in a under-studied population to survey the breast cancer subtypes and related factors. We use a simplified analysis framework: keeping breast cancer subtypes at one level, and all factors at another level. Distribution of many factors are extremely different from that of another large breast cancer registry, while the subtype distribution is similar. This indirectly shows that we have not exhaustively 14 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint measured all predictive factors of breast cancer subtypes. The relationship between the two levels is investigated by regression, with one variable, all variable, or subset of variables. These regression analyses show post-menopause and/or older breast cancer patients are more likely to have the ER+ subtype and HER2-subtype; hormone therapy is positively correlated with ER-or ER-/PR-subtype; breast-feeding and/or older breast cancer patients are more like to have ER+/PR+ subtype; and family history is observed more frequently in HER2+ subtypes.    Cancer Inst., 22:1681-1685 18 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

19
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

20
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

22
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint The nine subplots are the same plot labeled with different information: ER subtype (red for ER+, blue for ER-), PR subtype, HER2 subtype, menopause status (post-menopause in red, pre-menopause in blue), if the patient has other cancer (red for yes, blue for no), breast feeding (red for yes, blue for no), age of diagnosis (red if younger or equal of 50 years ago), parity/number of children, education level (0 for none, 1 for primary or middle school, 2 for high school, 3 for college or higher).

23
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint 24 . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/19010181 doi: medRxiv preprint