Limits...
Self-organising maps and correlation analysis as a tool to explore patterns in excitation-emission matrix data sets and to discriminate dissolved organic matter fluorescence components.

Ejarque-Gonzalez E, Butturini A - PLoS ONE (2014)

Bottom Line: SOM is a pattern recognition method which clusterizes and reduces the dimensionality of input EEMs without relying on any assumption about the data structure.According to our results, chemical industry effluents appeared to have unique and distinctive spectral characteristics.We conclude that SOM coupled with a correlation analysis procedure is a promising tool for studying large and heterogeneous EEM data sets.

View Article: PubMed Central - PubMed

Affiliation: Departament d'Ecologia, Facultat de Biologia, Universitat de Barcelona, Barcelona, Catalunya, Spain.

ABSTRACT
Dissolved organic matter (DOM) is a complex mixture of organic compounds, ubiquitous in marine and freshwater systems. Fluorescence spectroscopy, by means of Excitation-Emission Matrices (EEM), has become an indispensable tool to study DOM sources, transport and fate in aquatic ecosystems. However the statistical treatment of large and heterogeneous EEM data sets still represents an important challenge for biogeochemists. Recently, Self-Organising Maps (SOM) has been proposed as a tool to explore patterns in large EEM data sets. SOM is a pattern recognition method which clusterizes and reduces the dimensionality of input EEMs without relying on any assumption about the data structure. In this paper, we show how SOM, coupled with a correlation analysis of the component planes, can be used both to explore patterns among samples, as well as to identify individual fluorescence components. We analysed a large and heterogeneous EEM data set, including samples from a river catchment collected under a range of hydrological conditions, along a 60-km downstream gradient, and under the influence of different degrees of anthropogenic impact. According to our results, chemical industry effluents appeared to have unique and distinctive spectral characteristics. On the other hand, river samples collected under flash flood conditions showed homogeneous EEM shapes. The correlation analysis of the component planes suggested the presence of four fluorescence components, consistent with DOM components previously described in the literature. A remarkable strength of this methodology was that outlier samples appeared naturally integrated in the analysis. We conclude that SOM coupled with a correlation analysis procedure is a promising tool for studying large and heterogeneous EEM data sets.

Show MeSH

Related in: MedlinePlus

Summary of the methodology applied in this study.A) N initial samples are reduced to M prototype EEMs by SOM analysis. B) EEM prototypes are clustered to facilitate exploration of the relationships between the sample EEMs. C) SOM is performed on the correlation matrix of the component planes of the Q-mode SOM analysis. The output corresponds to an aggregation of highly correlated wavelength coordinates in a single neuron unit. D) Neuron units are clustered in order to find groups of highly correlated wavelength coordinates. E) Wavelength coordinate clusters are displayed in an EEM optical space in order to evaluate their biogeochemical meaning. Adapted and extended from Vesanto and Alhoniemi [38].
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4048288&req=5

pone-0099618-g002: Summary of the methodology applied in this study.A) N initial samples are reduced to M prototype EEMs by SOM analysis. B) EEM prototypes are clustered to facilitate exploration of the relationships between the sample EEMs. C) SOM is performed on the correlation matrix of the component planes of the Q-mode SOM analysis. The output corresponds to an aggregation of highly correlated wavelength coordinates in a single neuron unit. D) Neuron units are clustered in order to find groups of highly correlated wavelength coordinates. E) Wavelength coordinate clusters are displayed in an EEM optical space in order to evaluate their biogeochemical meaning. Adapted and extended from Vesanto and Alhoniemi [38].

Mentions: SOM analysis was conducted using the Kohonen package for R [37]. The successive steps undertaken in our computations are conceptualised in the flow diagram shown in Figure 2. EEMs were pre-processed by normalising their fluorescence intensity by their maximum, in order remove effects of changes in concentration and focus specifically on qualitative variations [43]. The input matrix for the SOM analysis in the Q-mode contained 270 linearized EEMs with fluorescence data from 366 λex–λem coordinate pairs (Figure 2A). The output layer was an hexagonal grid (Figure 2B). Its size was chosen to be the largest size that ensured stability of the quantization error [44]. In addition, dimensions were set to preserve the proportions of the two highest eigenvalues of the covariance matrix of the input data [19], [45]–[47]. During the training phase, the learning rate decreased linearly from 0.05 to 0.01. The initial neighbourhood size included two-thirds of all distances of the map units, and decreased linearly during the first third of the iterations. After that, only the winning unit was being adapted. In order to emphasise dissimilarities between the neurons of the SOM grid, a hierarchical cluster analysis with complete linkage was performed using the Lance-Williams update formula [48].


Self-organising maps and correlation analysis as a tool to explore patterns in excitation-emission matrix data sets and to discriminate dissolved organic matter fluorescence components.

Ejarque-Gonzalez E, Butturini A - PLoS ONE (2014)

Summary of the methodology applied in this study.A) N initial samples are reduced to M prototype EEMs by SOM analysis. B) EEM prototypes are clustered to facilitate exploration of the relationships between the sample EEMs. C) SOM is performed on the correlation matrix of the component planes of the Q-mode SOM analysis. The output corresponds to an aggregation of highly correlated wavelength coordinates in a single neuron unit. D) Neuron units are clustered in order to find groups of highly correlated wavelength coordinates. E) Wavelength coordinate clusters are displayed in an EEM optical space in order to evaluate their biogeochemical meaning. Adapted and extended from Vesanto and Alhoniemi [38].
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4048288&req=5

pone-0099618-g002: Summary of the methodology applied in this study.A) N initial samples are reduced to M prototype EEMs by SOM analysis. B) EEM prototypes are clustered to facilitate exploration of the relationships between the sample EEMs. C) SOM is performed on the correlation matrix of the component planes of the Q-mode SOM analysis. The output corresponds to an aggregation of highly correlated wavelength coordinates in a single neuron unit. D) Neuron units are clustered in order to find groups of highly correlated wavelength coordinates. E) Wavelength coordinate clusters are displayed in an EEM optical space in order to evaluate their biogeochemical meaning. Adapted and extended from Vesanto and Alhoniemi [38].
Mentions: SOM analysis was conducted using the Kohonen package for R [37]. The successive steps undertaken in our computations are conceptualised in the flow diagram shown in Figure 2. EEMs were pre-processed by normalising their fluorescence intensity by their maximum, in order remove effects of changes in concentration and focus specifically on qualitative variations [43]. The input matrix for the SOM analysis in the Q-mode contained 270 linearized EEMs with fluorescence data from 366 λex–λem coordinate pairs (Figure 2A). The output layer was an hexagonal grid (Figure 2B). Its size was chosen to be the largest size that ensured stability of the quantization error [44]. In addition, dimensions were set to preserve the proportions of the two highest eigenvalues of the covariance matrix of the input data [19], [45]–[47]. During the training phase, the learning rate decreased linearly from 0.05 to 0.01. The initial neighbourhood size included two-thirds of all distances of the map units, and decreased linearly during the first third of the iterations. After that, only the winning unit was being adapted. In order to emphasise dissimilarities between the neurons of the SOM grid, a hierarchical cluster analysis with complete linkage was performed using the Lance-Williams update formula [48].

Bottom Line: SOM is a pattern recognition method which clusterizes and reduces the dimensionality of input EEMs without relying on any assumption about the data structure.According to our results, chemical industry effluents appeared to have unique and distinctive spectral characteristics.We conclude that SOM coupled with a correlation analysis procedure is a promising tool for studying large and heterogeneous EEM data sets.

View Article: PubMed Central - PubMed

Affiliation: Departament d'Ecologia, Facultat de Biologia, Universitat de Barcelona, Barcelona, Catalunya, Spain.

ABSTRACT
Dissolved organic matter (DOM) is a complex mixture of organic compounds, ubiquitous in marine and freshwater systems. Fluorescence spectroscopy, by means of Excitation-Emission Matrices (EEM), has become an indispensable tool to study DOM sources, transport and fate in aquatic ecosystems. However the statistical treatment of large and heterogeneous EEM data sets still represents an important challenge for biogeochemists. Recently, Self-Organising Maps (SOM) has been proposed as a tool to explore patterns in large EEM data sets. SOM is a pattern recognition method which clusterizes and reduces the dimensionality of input EEMs without relying on any assumption about the data structure. In this paper, we show how SOM, coupled with a correlation analysis of the component planes, can be used both to explore patterns among samples, as well as to identify individual fluorescence components. We analysed a large and heterogeneous EEM data set, including samples from a river catchment collected under a range of hydrological conditions, along a 60-km downstream gradient, and under the influence of different degrees of anthropogenic impact. According to our results, chemical industry effluents appeared to have unique and distinctive spectral characteristics. On the other hand, river samples collected under flash flood conditions showed homogeneous EEM shapes. The correlation analysis of the component planes suggested the presence of four fluorescence components, consistent with DOM components previously described in the literature. A remarkable strength of this methodology was that outlier samples appeared naturally integrated in the analysis. We conclude that SOM coupled with a correlation analysis procedure is a promising tool for studying large and heterogeneous EEM data sets.

Show MeSH
Related in: MedlinePlus