Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Comment
  • Published:

Avoiding common pitfalls in machine learning omic data science

This Comment describes some of the common pitfalls encountered in deriving and validating predictive statistical models from high-dimensional data. It offers a fresh perspective on some key statistical issues, providing some guidelines to avoid pitfalls, and to help unfamiliar readers better assess the reliability and significance of their results.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The curse of dimensionality and overfitting.
Fig. 2: Avoiding bias when training and evaluating molecular predictors.
Fig. 3: Unknown confounders and class prediction.
Fig. 4: Avoiding bias when comparing feature selection methods.

References

  1. Kalinin, S. V., Sumpter, B. G. & Archibald, R. K. Nat. Mater. 14, 973–980 (2015).

    Article  CAS  Google Scholar 

  2. Marx, V. Nature 498, 255–260 (2013).

    Article  CAS  Google Scholar 

  3. Mattmann, C. A. Nature 493, 473–475 (2013).

    Article  CAS  Google Scholar 

  4. Fodor, S. P. et al. Science 251, 767–773 (1991).

    Article  CAS  Google Scholar 

  5. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Science 270, 467–470 (1995).

    Article  CAS  Google Scholar 

  6. Perou, C. M. et al. Proc. Natl Acad. Sci. USA 96, 9212–9217 (1999).

    Article  CAS  Google Scholar 

  7. Wheeler, D. A. et al. Nature 452, 872–876 (2008).

    Article  CAS  Google Scholar 

  8. Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).

    Article  CAS  Google Scholar 

  9. van ’t Veer, L. J. et al. Nature 415, 530–536 (2002).

    Article  Google Scholar 

  10. Guo, S. et al. Nat. Genet. 49, 635–642 (2017).

    Article  CAS  Google Scholar 

  11. Gerlinger, M. et al. N. Engl. J. Med. 366, 883–892 (2012).

    Article  CAS  Google Scholar 

  12. Xu, R. H. et al. Nat. Mater. 16, 1155–1161 (2017).

    Article  CAS  Google Scholar 

  13. Storey, J. D. & Tibshirani, R. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

    Article  CAS  Google Scholar 

  14. Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).

    Article  CAS  Google Scholar 

  15. Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).

    Article  CAS  Google Scholar 

  16. Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. J. Natl Cancer Inst. 95, 14–18 (2003).

    Article  CAS  Google Scholar 

  17. Ioannidis, J. P. PLoS Med. 2, e124 (2005).

    Article  Google Scholar 

  18. Jager, L. R. & Leek, J. T. Biostatistics 15, 1–12 (2014).

    Article  Google Scholar 

  19. Sebastiani, P. et al. Science 333, 404 (2011).

    Article  CAS  Google Scholar 

  20. Ioannidis, J. P. et al. Nat. Genet. 41, 149–155 (2009).

    Article  CAS  Google Scholar 

  21. Seoighe, C., Tosh, N. J. & Greally, J. M. Nat. Genet. 50, 1062–1063 (2018).

    Article  CAS  Google Scholar 

  22. Jacob, L. & Speed, T. P. Genome Biol. 19, 97 (2018).

    Article  Google Scholar 

  23. Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E. J. Nat. Neurosci. 14, 1105–1107 (2011).

    Article  CAS  Google Scholar 

  24. Qin, L. X., Huang, H. C. & Begg, C. B. J. Clin. Oncol. 34, 3931–3938 (2016).

    Article  Google Scholar 

  25. Ernst, J. & Kellis, M. Nat. Biotechnol. 33, 364–376 (2015).

    Article  CAS  Google Scholar 

  26. Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).

    Google Scholar 

  27. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).

  28. Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).

    Article  Google Scholar 

  29. Webb, S. Nature 554, 555–557 (2018).

    Article  CAS  Google Scholar 

  30. Bishop, C. M. Neural Networks for Pattern Recognition (Oxford Univ. Press, Oxford, 1995).

    Google Scholar 

  31. Varma, S. & Simon, R. BMC Bioinform. 7, 91 (2006).

    Article  Google Scholar 

  32. Teschendorff, A. E. et al. Genome Biol. 7, R101 (2006).

    Article  Google Scholar 

  33. Ambroise, C. & McLachlan, G. J. Proc. Natl Acad. Sci. USA 99, 6562–6566 (2002).

    Article  CAS  Google Scholar 

  34. Reunanen, J. J. Mach. Learn. Res. 3, 1371–1382 (2003).

    Google Scholar 

  35. Efron, B. & Tibshirani, R. J. J. Am. Stat. Assoc. 92, 548–560 (1997).

    Google Scholar 

  36. Simon, R. J. Natl Cancer Inst. 97, 866–867 (2005).

    Article  CAS  Google Scholar 

  37. Biton, A. et al. Cell Rep. 9, 1235–1245 (2014).

    Article  CAS  Google Scholar 

  38. Leek, J. T. & Storey, J. D. PLoS Genet. 3, 1724–1735 (2007).

    Article  CAS  Google Scholar 

  39. Horvath, S. Genome Biol. 14, R115 (2013).

    Article  Google Scholar 

  40. Leek, J. T. & Storey, J. D. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).

    Article  CAS  Google Scholar 

  41. Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. Breast Cancer Res. Treat. 22, 207–219 (1992).

    Article  CAS  Google Scholar 

  42. Bartlett, T. E. et al. PLoS ONE 10, e0143178 (2015).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew E. Teschendorff.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teschendorff, A.E. Avoiding common pitfalls in machine learning omic data science. Nat. Mater. 18, 422–427 (2019). https://doi.org/10.1038/s41563-018-0241-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41563-018-0241-z

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing