Why do some people have a more severe course of COVID-19 disease than others? A database created by an international collaboration of researchers — including many from the University of Toronto and partner hospitals — may hold the answers to this question, and many more.
In late 2019 and early 2020, reports of a novel form of coronavirus started emerging, first from China, then from many other locations across the globe. Lisa Strug, senior scientist at The Hospital for Sick Children (SickKids) and academic director of U of T’s Data Sciences Institute, remembers what happened next.
“In my research, I use data science techniques to map the genes responsible for complex traits,” says Strug, who is a professor in the departments of statistical sciences and computer science in the Faculty of Arts & Science at U of T and in the biostatistics division of the Dalla Lana School of Public Health. She is also the associate director of SickKids’ Centre for Applied Genomics, which is one of three sites across Canada that form CGEn, Canada’s national platform for genome sequencing infrastructure for research.
“We knew that genes were a factor in the severity of previous SARS infections, so it made sense that COVID-19, which is caused by a closely related virus, would have a genetic component too. Very early on, I started getting messages from several scientists who wanted to set up different studies that would help us find those genes.”
Over the next few months, Strug collaborated with nearly 100 researchers from across U of T and partner hospitals and institutions, as well as other researchers from across Canada to enrol individuals with COVID-19 and sequence their genomes.
Some of the key team members from the Toronto community included:
Partner hospitals and institutions included:
Together with researchers at other universities, hospitals and research institutions across Canada, the team eventually created what came to be known as CGEn HostSeq — Canadian COVID-19 Human Host Genome Sequencing Databank.
Initiated by Scherer and CGEn’s Naveed Aziz, with Strug, a $20M grant was secured from Innovation, Science and Economic Development Canada administered through Genome Canada.
Scherer recalls, “we had to go right to the top to get this project funded fast and our labs and teams worked seven days a week on the project right through the pandemic.”
Identifying associations between individual genes and complex traits typically requires thousands of genomes, both from those with the trait and those without. Though there was no shortage of cases to choose from, it was critical to gather, sequence DNA and organize the data in a way that would be ethical, efficient and useful to researchers now and in the future.
“One of our key mandates at the Data Sciences Institute is developing techniques and programs that ensure that data remains as open, accessible and as reproduceable as it can be,” says Strug.
“That vision was brought to bear as we assembled the data infrastructure for this project: for example, ensuring that consent forms were as broad as possible, so that this data could be linked with other sources, from electronic medical records to other health databases.”
“We wanted to be sure that even after the COVID-19 pandemic was over, this could be a national whole genome sequencing resource to ask all kinds of questions about health and our genes. The development of the database and its open nature also enabled Canada to collaborate effectively with similar projects in other countries.”
In the end, the project gathered more than 11,000 full genome sequences from across Canada, representing patients with a wide range of health outcomes. Those data were then combined with even more sequences from patients in other countries under what came to be called the COVID-19 Host Genetics Initiative.
It didn’t take long for patterns to start to emerge. A paper published in Nature in 2021 identified 13 genome-wide significant loci that are associated with SARS-CoV-2 infection or severe manifestations of COVID-19.
Since then, even more data have been added, and subsequent analysis has confirmed the significance of existing loci while also identifying new ones. The most recent update to the project, published in Nature earlier this year, brings the total number of distinct, genome-wide significant loci to 51.
“Identification of these loci can help one predict who might be more prone to a severe course of COVID-19 disease,” says Strug.
“When you identify a trait-associated locus, you can also unravel the mechanism by which this genetic region contributes to COVID-19 disease. This potentially identifies therapeutic targets and approaches that a future drug could be designed around.”
While it will take many more years to fully untangle the effects of the different loci that have been identified, Strug says that the database is already showing its worth in other ways.
“It can be difficult to find datasets with whole genome sequence and approved for linkage with other health information that are this large, and we want people to know that it is open and available for all kinds of research, well beyond COVID, through a completely independent data access committee,” she says.
“For example, several investigators from across Canada have been approved to use these data and we’ve even provided funding to trainees to encourage them to develop new data science methodologies or ask novel health questions using the CGen HostSeq data.”
“This was a humongous effort, where researchers from across Canada came together during the COVID-19 pandemic to recruit, obtain and sequence DNA from more than 11,000 Canadians, in a systematic, cooperative, aligned way to create a made-in-Canada data resource that will hopefully be useful for years to come. I think that was really miraculous.”