By design, the i-Forget study will find associations between genetics, the intestinal microbiome and human disease. Because we are measuring 3 billion genetic variables and 2 billion bacterial variables in every participant, we are likely to find something that distinguishes people with dementia risk from those who don’t.
We refer to large and complex sets of data like this as big data. The volume and variety of big data require advanced analytic methods to reveal patterns, trends, and associations. Technological advances have made this large-scale analysis possible.
Out of the fantastically rich i-Forget data set will come associations that researchers will then need to confirm in separate data sets, followed by animal and, eventually, human intervention studies. We are beginning a process that we need to carry out with careful analytical methods to produce quality results.
The analysis of large data sets is not new. It has been the cornerstone of population-based medical research for a century or more. However, history shows that analyzing these large data sets to find a single version of the truth can be challenging.
Perhaps the most celebrated example of finding truth in large data is work by the British epidemiologist Sir Richard Doll. Doll was the first to analyze data from a large number of people. He published a report in 1954 titled “The mortality of doctors in relation to their smoking habits: a preliminary report.” After studying surveys completed by over 40,000 doctors, he reported higher lung cancer mortality among smokers. This landmark observation led to a U.S. surgeon general warning about cigarette smoking causing lung cancer in 1964.
Other large data set studies have revealed analysis pitfalls. In 1981, researchers from the Harvard School of Public Health announced a link between drinking coffee and pancreatic cancer. After studying 369 pancreatic cancer patients and 644 similar people who did not have pancreatic cancer, the authors reported a 3-fold increased risk among those who drank more than five cups a day. Subsequent large studies failed to confirm the link, and recent research suggests that coffee consumption may PREVENT pancreatic cancer.
What causes big data studies to generate misleading conclusions? One common explanation is that specific data sets may, by chance, show associations that can’t be reproduced in other data sets. The original observation may have been just a fluke–a peculiarity specific to just that one data set, which is not a true disease association (like coffee and pancreatic cancer).
Another source of failure happens when researchers “analyze their data to death” – meaning they look at one thing after another until they find some disease association. While this approach can find genuine associations that go on to be confirmed by other studies, investigators can be fooled into believing that some relationships exist where there are none.
For example, say a study hypothesized that 6-year-old boys are taller than girls but found no difference. The investigators might go on to examine other traits—like hair colour, eye colour, arm length etc.—until they find a difference, let’s say the number of freckles. However, when others do a follow-up study focusing solely on freckles (a clear, pre-defined variable), their analysis shows no difference. The association of a difference in freckles between girls and boys was false—a fluke. This example highlights how measuring one variable after another without a prior definition of what you are examining can lead to false associations.
To combat these failures, data science experts recommend that researchers identify study questions in advance and that the analysis focuses on producing a specific answer. Also, when possible, the researchers should accumulate data prospectively, which means following individuals over time and collecting data about them as their characteristics or circumstances change. Researchers often use prospective accumulation of data to determine if a specific intervention changes disease risk or alters symptoms.
If you have participated in a randomized controlled trial (RCT) testing a new medication, that is an example of prospective data collection. Some of the study subjects are randomly selected to get the new drug while others are not. The researchers compare the outcome of each group over time. RCTs are expensive and time-consuming but are more likely to be accurate.
Data sets can help point out statistical associations. But the critical next step is to develop hypotheses to be confirmed in independent data sets and, finally, by prospective studies (if important enough). Any associations require confirmation, meaning they are repeatedly observed by different investigators using different patient data sets.
Copyright © 2024 i-Forget - All Rights Reserved.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.