Chapter 4 Sample selection

In this Chapter we describe two major approaches to sample selection for genomic epidemiological studies. We describe when you should use these different sampling strategies, focusing on which types of questions you may be interested in, and how the particular sampling strategy supports the goal of your investigation. This Chapter is pertinent for any readers who are actively seeking to implement genomic surveillance programs and integrate genomic epidemiology into their public health investigations. Readers who want to understand why different sampling strategies are useful will also benefit from reading these sections.

4.1 Representative Sampling

In representative sampling, a set of specimens are selected for sequencing such that the pathogen genetic diversity in the sequenced sample set is representative of the pathogen genetic diversity circulating in the broader population. This means that the investigator should be able to detect the same suite of genotypes in their sample as circulate in the broader population, and that the frequencies with which they observe different genotypes in their sequenced dataset should reflect the frequencies with which those genotypes are found in the broader population.

To maintain these attributes and protect the validity of their dataset, the investigator must ensure that they do not accidentally enrich for certain genotypes by preferentially sequencing samples that have a particular diagnostic trait, a particular clinical presentation, or affect a particular demographic group.

Furthermore, to maintain a representative dataset, the investigator must also avoid systematically excluding certain genotypes that are circulating in the population. Genotype exclusion can occur when you lack sequences from a particular portion of your population that does not mix homogeneously with everyone else in your population. For example, if you have an under-served population that co-mingles, but has limited contact with other groups in your community, then a certain pathogen genotype could circulate primarily within that community. If that population lacks equitable access to testing, then you may not detect transmission within the community or sequence the circulating genotype(s). This situation would lead you to miss or underestimate the prevalence of those genotypes.

Generally, we recommend using representative sampling when exploring surveillance questions, such as:

  • What is the frequency of this variant in my population?

  • How are frequencies of these different variants changing over time?

  • What is the spatial distribution of different pathogen genotypes?

  • When was this particular pathogen genotype introduced to this population, and how long has it circulated for?

  • How much pathogen diversity do we observe in our community, and how does this relate to pathogen diversity in other communities?

4.2 Targeted Sampling

In targeted sampling we aim to sequence as many samples as possible from a particular population, outbreak, or transmission chain, in order to understand the specific genotypes and disease dynamics associated with that population or setting. Examples of when we use targeted sampling include:

  • Investigating populations showing particular clinical presentations of a disease to see whether a specific pathogen genotype appears correlated with an altered disease presentation.

  • Exploring whether an outbreak in a localized setting, such as a workplace, school, or medical facility, is the result of transmission occurring within the setting. Alternatively, infections could be acquired in the broader community, and simply detected in the localized setting, for example due to increased screening in the facility.

  • Investigating a particular set of cases that report an epidemiologic link to determine whether they are indeed part of the same transmission chain.

  • Investigating individuals presenting with a second acute period of illness, to distinguish between a reinfection event and reactivation of latent disease.

4.3 Contextual data

Targeted sampling focuses on deeply sampling particular settings, transmission chains, or infections that meet case definitions. Yet, to analyze those datasets appropriately, you need to include representatively-sampled genome sequences alongside your targeted samples. Within genomic epidemiology, we often refer to these other representative samples as contextual data. They provide a backdrop for what is happening more broadly outside of your densely sequenced target population, and can improve analytic inferences. Importantly, they also serve as controls, enabling you to see whether dynamics observed in your targeted population are unique to that population, or whether they are typical of the broader set of sequenced cases.

4.3.1 Contextual data as a backdrop

Including contextual data in the analysis of transmission dynamics within a targeted setting enables the public health practitioner to see whether the pathogen genotypes associated with the outbreak circulated in the community before and/or after the outbreak. This contextual information can potentially clarify how an outbreak began, and when it has truly ended. Furthermore, including contextual data in a targeted analysis can help elucidate links between an outbreak in the targeted population and transmission in the broader community. That information can help the epidemiologist to see whether an outbreak amplified transmission and seeded it in the broader population. Additionally, if the bounds of an outbreak aren’t truly known, representative samples that appear related to targeted samples may indicate a connection to the outbreak that was not previously known. Finally, including sequences sampled over longer time periods will make estimates of the molecular clock more accurate, and often more precise. When the cases of interest within your targeted analysis occur over a short time window, as in the case of small, localized outbreaks, including contextual sequences sampled over longer periods of time will ensure that your time tree analyses remain accurate.

4.3.2 Contextual data as controls

Imagine that you are interested in a cluster of illnesses that have a different disease presentation. You wonder whether a change in the pathogen itself might be responsible for the changed clinical manifestation. To investigate, you decide to sequence pathogen genomes from cases that meet a case definition for the new clinical presentation. In the absence of an association between pathogen genotype and disease presentation, you may find that all of these individuals with similar disease presentations are infected with distinct and diverse pathogen genotypes. However, what if the individuals who meet your case definition are infected with the same or similar genotypes? Does that mean that there is an association between the genotype and disease presentation?

Not necessarily. Only looking at sequences from individuals showing a specific clinical course would be like only looking at the people who got sick after eating at a “poisoned picnic”. If every sick case at the picnic ate the potato salad, then you might conclude that the potato salad is to blame. But what if everyone ate the potato salad, including your controls who did not get sick after the picnic? Then the potato salad is probably not the culprit. The validity of your study depends on the controls.

Similarly, representatively-sampled contextual sequences act as controls in genomic epidemiological analyses. They allow you to see which pathogen genotypes circulate in individuals not presenting with the altered clinical course. Much like the case where everyone at the picnic ate the potato salad, it is completely possible that a particular genotype is dominantly circulating in a community and causing infections with varied clinical presentations. If this is the case, then your targeted samples may all be infected with the same genotype, but your representatively-sampled contextual sequences will also show that same genotype.

Just like controls in traditional epidemiological studies, contextual sequences should be collected from individuals who would have been sequenced as part of your targeted sampling effort had they fit the criteria that is guiding your targeted sampling effort.