New and emerging environmental contaminants are chemicals that have not been previously detected or that are being detected at levels significantly different from those expected in both biological and ecological arenas (that is, human, wildlife, and environment). Many chemicals can originate from a variety of sources, including consumer, agriculture, and industry as well as natural and/or anthropogenic disaster scenarios. For example, endocrine-disrupting chemicals (EDCs), pharmaceuticals, and personal care products (such as therapeutic, nontherapeutic, and veterinary drugs, as well as cosmetics and fragrances) are known to be present in many of the world's water bodies and thought to originate from a variety of sources, including improper disposal into municipal sewage, agribusiness, and veterinary practices. The detection and quantification of these chemicals from a toxicology and exposure perspective is paramount to understanding their effects on both the ecosystem and human health. EDCs act on the endocrine system and are known to alter sexual development and fertility in many vertebrate species. It is suspected that they may play a role in species population decline as well as public health issues.
Discriminating between potential contaminants and noncontaminants (for example, EDCs versus non-EDCs) can be an exhaustive and costly endeavor. In some cases, these methods rely on a specialized detection apparatus, discrete samples, and complicated sampling techniques as well as bioethical issues in testing methods that may require the sacrifice of animals. Currently, high-throughput screening (HTS) efforts (such as toxicity in vitro assays) are helping to reduce some of the challenges and hurdles to testing these chemicals. However, the analysis and interpretation of results require powerful data analytics to summarize and make sense of the data, due to the voluminous amount generated. Informatics-based approaches like cheminformatics hold the best possibility of deciphering the barrage of data within a statistical and chemical context.
Data mining and informatics
Informatics is a broad field of study encompassing computer science and information technology from the retrieval and storage of data to the mining of patterns that exist within the stored data streams. Data mining itself is one step along a process commonly known as data or knowledge discovery (KD). Data curation and storage are critical steps in this process, but analysis and interpretation help researchers to elucidate and summarize patterns and relationships within the data itself through sophisticated algorithmic and visualization-type techniques. Some examples of these data mining techniques include the location of predetermined groups (such as decision-tree analysis techniques), organization of data due to logical relationships (such as hierarchical and cluster analysis), identification of associations/dependencies (for example, associative rule mining), and prediction of patterns based on historical data (for example, predictive analytics). Many of these techniques are routinely applied in the retail, finance, and marketing sectors (for example, predicting consumer buying habits and trends in the market). Informatics-based approaches in the life sciences are largely dominated by cheminformatics and bioinformatics, which also employ the same techniques to understand the relationships (associations and patterns) related to chemical structures and properties and biological functions, respectively. Historically, this has been done with an eye toward discovery of new chemicals. This discovery aspect has apparent implications in both pharmaceutical and material sciences, but the same tools and techniques are beginning to be used in a variety of research areas (such as geographical information systems, environment- and genome-wide association studies). A visual representation using IBM's Many Eyes (http://www-958.ibm.com) service using 374 abstracts found in PubMed highlights the relationships of concepts found in the current literature (Fig. 1).

Chemical space and cheminformatics
Application of data-mining techniques in the arena of knowledge discovery for new and emerging chemicals encompasses a wide variety of chemicals from exposure biomarkers and pesticides to drugs and EDCs. Since “chemical space” is defined by the set of all energetically stable stoichiometric combinations of atoms, nuclei, and electrons, it is not difficult to imagine that the possible combinations are astronomical, easily surpassing the current list of emerging contaminants. The number of small organic chemicals, alone, has been estimated to be on the order of 1060. To sample and characterize this space, it would require multiple lifetimes at the current level of technology. Optimistically, this would be on the order of 1052 years, assuming most HTS efforts can process about 100,000 chemicals per day. A far more efficient approach would be to apply cheminformatics-based data-mining techniques to the subset of already known chemicals and arrive at a qualitative and potentially quantitative predictive framework (that is, through clustering, classification, association, and predictive analytics). This approach would provide the context for characterizing existing chemicals as well as new and emerging chemicals through a combination of molecular descriptor generation, molecular fingerprinting, and predictive analytics.
The description of chemical space is largely dictated by structure-based information. For example, one could define chemical space in terms of the subset of chemical properties (that is, molecular descriptors) that might be of interest to a particular biological activity or outcome. Molecular descriptors have a long history within cheminformatics and can be categorized as either mathematical constructs or empirical-based measurements that allow one to enumerate or quantify information about a chemical, spanning the range from simple quantification of a chemical's relative partition into oil and water (log Po/w) to more complex quantum-mechanical-based descriptors that rely on the electron density of a molecule.
Molecular fingerprints, much like their name implies, are encoded structure-based information (such as molecular descriptors and fragments) that are ideally unique to a particular chemical. As variable or fixed-sized representations, they can encode structural keys related to both two- and three-dimensional (2D and 3D) molecular information. The power of molecular fingerprints is that they can be rapidly evaluated and compared to existing fingerprints in a database, thereby making similarity/dissimilarity searches trivial via standard similarity measures (such as Tanimoto Index). Chemical similarity is largely based on the principle that similar compounds have similar properties and, by association, chemicals can be grouped on the basis of some derived similarity in their selected molecular fingerprints (for example, P-glycoprotein inhibitors versus non-inhibitors). Calculated distance matrices in a database of chemicals can also aid in identifying observed structure in the data (that is, clustering of like properties and/or biological activity).
One of the classic predictive analytic methods of cheminformatics is quantitative structure-activity relationships, which seek to find statistical correlations between a finite set of structure-based features (for example, molecular descriptors) and their observed outcome (for example, molecular and/or biological activity). Due to the feature selection problem (that is, which descriptors to choose in the model) a variety of algorithms have evolved to use data-mining techniques, such as neural networks, support-vector machines, and ensemble average and kernel-based methods. Applicability domain issues (that is, the relevancy and applicability of a predictive model to a wide range of chemicals) are always prevalent in such models, as they rely heavily on the available data to “train” their predictive associations. In such cases, local models that are defined by their nearest-neighbors association may provide more predictive power than global models by interpolating within the data rather than extrapolating outside the data. However, these models may suffer from sparse data or small training sets, making it difficult to accurately quantify the applicability domain.
Visualization of multiple molecules of interest within a set of prescribed descriptor dimensions can convey rapid information on chemical similarity or dissimilarity as well as general clustering of chemicals. Reduced dimensionality visualization approaches, such as three-dimensional principal component analysis (3D-PCA) plots, can provide rapid visual insights. In this case, the similarity or dissimilarity of chemicals is based on the relative mapped positions of one molecular entity's structure-based properties with relation to another neighboring entity in a reduced Euclidean space composed of multiple molecular descriptors. To illustrate, Fig. 2 shows a chemographic representation based on CHEMGPS-NP (http://chemgps.bmc.uu.se) of several open-access chemical databases that illustrate the representation of multidimensional data in identifying overlaps of datasets based on similar principal components.

Exposure science and pharmacokinetics
Detection of a chemical merely suggests its presence in the environment. However, a chemical's presence alone does not dictate the effect on ecosystem and human health because of many determining factors. Analogously, exposure to a chemical does not necessarily mean that an adverse effect (such as toxicity or disease) will arise. A complex and complicated relationship exists among many determining factors, including the physicochemical properties, the concentration in the environment, the subsequent fate and transport within both biology and the environment, and the discrete exposure related behaviors (that is, time-activity patterns) of the biological receptor (such as nontarget wildlife species and susceptible individuals/populations). Understanding these factors is a primary concern of exposure science which seeks to understand the continuum of processes from a chemical source to a tissue dose within an organism. The range of predicted physicochemical properties for new and emerging contaminants, however, may influence these key factors, thus making efforts at determining chemical similarity and their associated properties with predictive analytics a critical step in characterizing these chemicals.
Environmental fate and transport as well as its biological analogue, pharmacokinetics/pharmacodynamics, is described by the chemical's interaction within the system. In the pharmaceutical sciences, simple pharmacokinetic-based absorption-distribution-metabolism-elimination (ADME) rules of thumb are commonly used as a selective criteria in screening for drug candidates quickly and efficiently. The most famous of these is Lipinski's Rule of Five (RO5) and subsequent variations, which seek to identify “druglikeness” in candidate compounds (that is, orally active drugs for humans) based on its permeability/absorption into the body. As with many generalizations, it is far from perfect with many limitations due to its inability to cover all of drug space (that is, domain of applicability issues based on four simple molecular descriptors). But as a screening tool, it was transformative in the science, successfully showing that drug permeability could be screened based on simple molecular descriptors, thus reducing the pool of candidate drugs cheaply and efficiently.
From a human exposure perspective, ADME concepts can be used to characterize exposure potentials of chemicals based on a rate-limiting-step assumption of how chemicals enter and exit the body. If we assume that ADME, a step along the source-to-outcome continuum, describes the biological process whereby a chemical trespasses the body's barrier (absorption), is metabolized, distributed, and exits the body (elimination), then a simplified binary (fast/slow) diagram can illustrate the effect on exposure-dose relationships to categorize 16 unique scenarios (24 possible combinations) or dose categorizations (Fig. 3). In these scenarios, two dominant exposure-dose themes are observed: (1) absorption limited (AL) via slow absorption and (2) elimination limited (EL) via fast absorption into the body. In this thought experiment, one could flag potential chemicals of concern based on their ability to enter quickly and exit slowly. For example, dose categories 13, 14, 15, and 16 would have the highest concerns given that elimination is slow and absorption is fast. Conversely, dose categories 1, 2, 3, and 4 would have the lowest concerns based on slow absorption and fast elimination. Assuming simple metabolic clearance (that is, no metabolic activation of toxicity pathways), inclusion of metabolism would delineate each category further by a faster/slower metabolism, which would result in quicker/slower clearance of a chemical thus reducing/increasing its dose at a target tissue. Since all steps in the ADME process can be influenced by its physicochemical properties, generic pharmacokinetic modeling should be used when possible to give context to the relative mappings of potential ADME behaviors alongside their predicted molecular descriptors.

Conclusions
Cheminformatics techniques are typically much less intensive to apply, but provide key insights into the nature of chemicals, especially in the context of knowledge discovery. For many contaminants, there is insufficient data available to parameterize models and perform the necessary risk-assessment studies. Data mining and informatics-based approaches allow us to induce predictive models as well as understand potential chemical similarities/dissimilarities of new and emerging contaminants to the environment and to public health. However, care should also be taken when considering the exposure-dose relationships, especially with respect to pharmacokinetic-based ADME concepts. As more information via HTS studies becomes available, the associative power of these predictive models should become richer and more detailed, improving on the current state of the science. Ultimately, the ability to rapidly characterize the presence of new and emerging chemicals as well as their effects on individuals, populations, and ecosystems will have beneficial implications for both exposure risk assessment and risk mitigation.
[Disclaimer: The United States Environmental Protection Agency through its Office of Research and Development funded and managed the research described here. It has been subjected to Agency review and approved for publication. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.]
See also: Algorithm; Chemometrics; Computational chemistry; Data mining; Data reduction; Database management system; Environmental engineering; Environmental toxicology; Hazardous waste; Molecular simulation; Mutagens and carcinogens; Neural network; Pharmacology; Risk assessment and management; Toxicology; Trophic ecology