Glossary | Magellan Bio

1. INTRO

Proteomics. Lipidomics. Metabolomics. Welcome to the world of mass spectrometry-based omics. Magellan Bioanalytics’ computational tools are designed for thorough exploration of large scale mass spec-based omics datasets.
2. DATA SCALE

Data scale has two components: first, the number of examples or subjects under study (n) and, second, the number of individual datapoints considered for each example or subject. Most machine learning problems are built to detect signals in datasets with massive numbers of examples, but relatively fewer datapoints per example. Omics research tends towards the opposite, with massive numbers of individual numbers of datapoints per example, but relatively few examples top consider. This has important implications for how machine learning can facilitate omics research.
3. PURPOSE OF ‘N’

N refers to the number of subjects in a study. Studies with larger numbers of subjects are often thought of as more ‘powerful.’ Low n studies are a red flag; they show evidence of biological connections where there are none. Increasing n gives greater confidence that biological connections are real, revealing the accuracy of the study with ever more resolution and increasing the signal to noise ratio.
4. THE LIMITS OF ‘N’

Increasing n is not a solution to all data problems. In biology, many connections are limited. For example, a single biomarker may predict an outcome with 80% accuracy. Increasing n affects this accuracy value, but only in terms of how precisely we know the accuracy (is it 80% accurate or 79.999% accurate?). Increasing n does not improve the accuracy of the biomarker in this example. And if there is no signal in the dataset, increasing n is futile.
5. WHAT DOES INCREASING DATAPOINTS PER SUBJECT DO?

Increasing the number of data points considered per subject is to ‘cast a wider net.’ It increases the probability that individual datapoints with connections to biological parameters will be found. It does not guarantee that connections will be strong (that high accuracy biomarkers will be found, say) or affect their strength. Nor does it affect the signal to noise ratio. It simply alters the likelihood of successful discovery.
6. THE DATA CHALLENGE FOR OMICS

Researchers leveraging omics techniques have one goal: to determine the molecular basis of biological differences. But the most interesting biological differences are also the most subtle. That means that signals are rare and non-obvious. While sufficiently powered (large n) studies are required for confidence and sensitivity, large datasets are critical to successful discovery of novel connections.
7. BIG DATA IN OMICS RESEARCH

How ‘big’ is big data in omics research? 30,000 genes? 90,000 proteins? 190,000 lipid species? 1 million metabolites? Understanding how variations in this molecular complexity alter human health is foundational to the future of medicine. Integrated omics, the analysis of changing levels of all of the major biomolecule subtypes, is the ultimate big data problem.
8. WEAK BIOLOGICAL CONNECTIONS AND INFORMATION CONTENT

Connections between biomolecules and biology are complex. Most biomarkers imperfectly reflect biology, meaning they have low information content with respect to an associated biological parameter. For more subtle biological differences, the more likely that individual biomarkers will have limited accuracy predicting specific biology.
9. ORTHOGONAL DATA: DIFFERENCES IN INFORMATION CONTENT

How much more can you learn from increasing the number of high information datapoints? That depends on how orthogonal the data, meaning the difference in information content between the datatpoints. Consider multiple datapoints that each provide the same information and no additional insight is gained. But consider multiple datapoints that each provide unique information content and new insight is added from each new datapoint.
10. MULTIVARIATE ANALYSIS AND DATA SCALE

Low information content datapoints can be combined together in multivariate analysis to predict connected biological with high accuracy. The key is that it requires increasing numbers of individual biomarkers with different information content. Finding sufficient numbers of datapoints , each with unique and relevant information content, requires large numbers of datapoints per subject.
11. DATA SCALE’S GLASS CEILING

As biological differences become more fine, the scarcity of high information content increases and information content of each datapoint can fall. Thus, the number of datapoints evaluated per sample ultimately limits the capacity of multivariate analysis to distinguish biological differences. How many datapoints you consider is the major determinate to your success analyzing complex and subtle biology.
12. WHY 2 +2 = 5: ORTHOGONAL DATA AND INTEGRATED OMICS

Integrated omics is the combined analysis of different major subclasses of biomolecule categories, like transcriptomics (RNA) and proteomics (proteins). Integrated omics is particularly powerful because each omics subtype considered is orthogonal to the other. In other words, datapoints from different omics subtypes are more likely to contain distinct information content and this maximizes opportunity to successfully identify the molecular basis of complex biology.
13. OBSTACLES TO FULL OMICS INTEGRATION

Integrated analysis of different omics is difficult because major omics approaches generate datasets in vastly different formats. The current approach to integrated omics is to analyze one omics dataset and use the results for targeted analysis of datasets generated with other omics subtypes. This approach, a coordinated but independent analysis of different omics datasets, biases analysis of ‘secondary’ omics approaches employed in the study.
14. BIOMOLECULES: DNA AND EVERYONE ELSE

Comprehensive analysis of biomolecular species can be divided into two categories: nucleotide sequences and everything else. Nucleotide sequences, both DNA and RNA, are analyzed by sequencer instruments that produce hundreds of millions of reads. Other biomolecules—proteins, lipids, metabolites, and others— are detected using mass spectrometry.
15. EVOLUTION OF SEQUENCING-BASED OMICS

Evolving from methods that produced short sequences in limited numbers, today’s instruments have achieved ‘singularity.’ Today’s sequencers can yield the complete nucleotide sequence of an entire genome. They can quantify every single RNA sequence in a sample, with sensitivity that means detecting the rarest of transcripts. In short, researchers work with datasets that are 100% complete.
16. INCOMPLETE DATSETS AND SUCCESSFUL DISCOVERY

Scientists sometimes have to work with omics datasets that are limited to specific target biomolecules. Researchers must hope that signals in their incomplete dataset connect to the biology under study. Working with a dataset of a few genes or proteins offers limited potential to find connections. Capacity to analyze an entire set of biomolecules offers the opportunity to find any connection.
17. COMPLETE DATASETS AND A REVOLUTION IN OMICS THINKING

After the sequencing ‘singularity,’ a new analytical approach emerged. GWAS, or genome-wide association studies, link differences found anywhere in the genome with biology. GWAS studies are totally unbiased in their search for molecular differences and were made possible by the reproducible generation of complete genomics datasets. Numerous unexpected biological connections were and continue to be discovered.
18. SEQUENCING ENTERS THE INFORMATION AGE

Sequencing-based omics has transitioned from a technical, instrumentation-based problem to a data science problem. In other words, scientists face problems of extracting information from sequencing datasets and understanding it, not with how to generate such datasets. The entire scientific field of bioinformatics is devoted to computational processing of sequencing datasets and its practice does not require technical expertise in sequencing methodology.
19. INEGRATED OMICS: BIOINFORMATICS AND MASS SPEC-BASED OMICS

An advantage of bioinformatics emergence as a discipline is that real integration of different bioinfomrmatics omics subtypes (genomics, transcriptomics, and microbiomics, say) is possible. Real integration, meaning analysis in parallel and not in series, with mass--based omics, like proteomics, metabolomics, and lipidomics, is not currently possible.
20. MASS SPEC AND THE INFORMATION AGE

While mass spec generated omics datasets that are formatted very differently than sequencing, this is not the only obstacle to real integration with sequencing-based omics. Put simply, mass spec-based omics has not entered the information age. Mass spec-based research remains centered around overcoming technical and instrumentation challenges.
21. INTEGRATION OF MASS SPEC-BASED OMICS

Mass spec-based omics analysis can be integrated with each other. But that is theoretically only, since a high degree of mass spec analytical software specialization interferes with easy integration of the data. Real integration of sequencing- and mass spec-based omics is not currently possible. Despite a focus of the mass spec research community on technical and instrumentation challenges, a lack of computational tools plagues mass-spec based omics.
22. MASS SPECTROMETRY METHODOLOGY

Mass spectrometry can measure the abundance of proteins, lipids, metabolites and other molecules. Each molecule is detected as one or more ions, whose charge-dependent mass (m/z) and abundance is measured. In one scan, many different molecules can be detected and measured. Mass specs can handle specimens containing a complex mixture of molecules, using chromatography to separate the mixture into its simpler fractions that are subjected to repeated scanning.
23. MASS SPEC AND MOLECULAR IDENTITY

Determining the molecular identity of species detected by mass spec instruments can be challenging. The m/z and chromatographic behavior of features detected by mass spec may already be known, but require additional instrument activity if molecular identity is sought or must be confirmed. For some molecules, like proteins and peptides, this is routine. Others are more recalcitrant to identification and each identification requires significant effort.
24. MS1 AND MS2 SCANS

Mass spec instruments scan samples to identify molecular species within the sample as they enter the mass spec. Ionization energies are kept at a minimum for these initial MS1 scans to avoid fragmenting the molecules into smaller pieces that would each be detected as separate features. Thus, MS1 scans contain abundance information for all of the molecules detected by the mass spec. Some of these are then selected for high energy ionization in subsequent MS2 scans. The fragments of the original molecule are detected as distinct features, from whose reconstruction the molecular identity can be determined or confirmed.
25. SELECTING SPECIES FOR FRAGMENTATION

Which species in the MS1 scan should be subjected to MS2 analysis. Several different algorithms are used, but the point is that not all the features detected in MS1 enter MS2 scans. All algorithms use abundance and familiarity, meaning high likelihood of matching something in an MS2 database, as criteria. For typical analysis of a complex sample, 90% of the instrument time is used for MS2 scanning.
26. MS2 LIBRARIES

As various mass spec features are subjected to successful identification, they enter MS2 libraries. These libraries then inform MS2 feature selection going forward. When MS2 scans are used as part of mass spec dataset generation, instruments become more and more focused on previously identified species. Opportunities to find novel species, and thus unanticipated discovery, can suffer.
27. INSTRUMENT OVERLOAD: GAPS IN MS2 DATSETS

MS2 scans are instrument intensive. For complex samples, only a portion of the features detected in MS1 scans are subjected to MS2 scanning. The same features may not be subjected to MS2, even when running an identical sample. The result is that mass spec-based omics datasets generated with MS2 scanning, where molecular identities are known, have gaps. This creates an analytical challenge: is a given species absent or was it not selected for MS2 for a subset of the samples?
28. DATA CAPACITY AND ACCURACY OF MODERN MASS SPEC INSTRUMENTS

Modern mass spec instruments are highly accurate, measuring m/z to 4 decimal places. Accuracy increases the ability to generate more data per scan, since molecules with similar m/z values can be distinguished. Today’s most powerful instruments can detect hundreds of thousands or millions of features during run times with relatively short chromatographic separations. In short, modern instruments are capable of measuring essentially every biomolecule, even for complex samples.
29. LESSONS FROM FEATURE FIRST PROTEOMICS

Feature-first proteomics is the approach of performing analysis using datasets containing only MS1 scans. Mass spec is performed without MS2 scans. Using the same logic as for GWAS studies in bioinformatics, the idea is to identify MS1 datapoints that are connected to biological differences without any bias, including that arising from MS2 target feature selection algorithms. The approach treats mass spec studies as data science problems.
30. INFORMATION OVERLOAD AND LACK OF COMPUTATIONAL TOOLS

Why are mass spec-based omics studies so much smaller in scale than for bioinformatics studies? Comparative analysis of mass spec datasets is challenging because analytical tools do not match the capacity of modern mass spec instruments. Though instruments are capable of detecting and measuring the levels of millions of biomolecules per sample, software capability breaks down at this scale.
31. CURRENT INFORMATION OVERLOAD WORKAROUNDS

Why attempt to analyze all of the available data without appropriate computational tools? Current approaches to mass spec-based omics focus on generating the highest quality data, but in quantities that make for easy analysis. Any study using MS2 scans accepts data loss in exchange for molecular identification. Reflecting that mass spec-based omics is stuck in the technical and instrumentation problem paradigm, most new solutions focus on sample preparation methods that are designed to extract potentially high information content species from complex samples, reducing the number of biomolecules under study.
32. INFORMATION OVERLOAD: THE N SQUARED PROBLEM

Why is comparative analysis of mass spec-based omics datasets so challenging? Because connections between datapoints emerge only after analysis, mas spec datasets cannot be collapsed to smaller, tractable numbers for analysis. Comparing complex samples means computing values for the square of the number of datapoints, then multiplying that number by the number of subjects. Comparing datasets with one million datapoints each is not computationally possible, even with a supercomputer.
33. MACHINE LEARNING TOOLS AND DATA SCALES

Can artificial intelligence help? Absolutely, but existing machine learning tools are ill suited to the task. Widely used machine learning algorithms were built to handle data generated by tech companies, which had large number of examples (tens of millions) and relatively few (tens of thousands) of datapoints. Mass spec-based omics datasets present the opposite challenge, massive numbers of datapoints (tens of millions for an integrated omics dataset) with relatively few examples (thousands).

MAGELLAN
BIOANALYTICS

Software Tools