Auxiliary Material for Paper 2008GC002314


Mineral phase analysis of deep-sea hydrothermal particulates by a Raman Spectroscopy Expert Algorithm: Toward autonomous in situ experimentation and exploration


J. A. Breier, C. R. German, and S. N. White
Woods Hole Oceanographic Institution, Woods Hole, Massachusetts, USA 


Breier, J. A., C. R. German, and S. N. White (2009), Mineral phase analysis of deep-sea hydrothermal particulates by a Raman Spectroscopy Expert Algorithm: Toward autonomous in situ experimentation and exploration, Geochem. Geophys. Geosyst., 10, Q05T05, doi:10.1029/2008GC002314.


Introduction:
RASEA is designed for automated point-counting based on Raman spectroscopy. Many 
spectra must be analyzed for one measurement. The general data management 
philosophy is to keep all raw spectra files for one measurement in a single 
directory - but create one composite database of the raw spectra plus their 
processing results. This composite database is also kept in the same directory 
with the raw files. Most RASEA functions are designed to be called from within a 
batch file that that processes each spectra sequentially.  
 
Requirements 
Matlab 7.4" and the Wavelet and Curve Fitting Toolboxes (cwtpeakfind uses the 
cwt function in the Wavelet Toolbox; patternmatch uses the lsqcurvefit function 
in the Curve Fitting Toolbox).  
 
Licensing 
The RaSEA functions and database are released under the GNU Public License 
included in the release. 
 
Function List 
database: deepsea_vents_v1.txt	 
Being developed for deep-sea hydrothermal vent mineral assemblages. Contains the 
compound specific peak parameters used to identify and match compound peak sets 
to unknown samples. 
 
Main Functions 
 
examinespec: 
Analyzes a spectra to determine significant peak locations, fit a baseline, and 
estimate peak sizes and other metrics (e.g. fluorescence index). Saves results 
with the original spectra in a single .mat file. Best use is to call this from a 
batch file to analyze multiple spectra. 
 
identifyspec: 
Matches a spectra, previously processed by examinespec,  to a compound in the 
RASEA database. Saves results with the original spectra in a single .mat file. 
Best use is to call this from a batch file to identify multiple spectra. 
subfunctions 
 
raseasettings: 
Contains all the settings used by RASEA functions. This subfunction is called by 
other RASEA functions; settings can be changed by editing the file directly. 
 
cwtpeakfind: 
Identifies peaks based on a continuous wavelet transform. 
 
peakcheck: 
Checks peak validity near the notch filter baseline shift. 
 
raseabaseline: 
Fits a smooth baseline to a spectra. 
 
peaklookup: 
Examines the database for potential peak matches. 
 
patternmatch: 
Performs a constrained curve fit for all potential matches and determines the 
best match with the unknown spectrum. 
 
raseapeaks: 
A mixed Gaussian/Lorentzian shape function for individual peaks. 
 
Utilities 
 
zscorenan: 
Normalizes a data vector to unit standard deviation centered about zero - 
similar to Matlab zscore but ignores NaN values. 
 
makecol: 
Takes any 1-D vector and rewrites it in a vertical format. 
 
batch processing examples 
 
raseabatch: 
An example batch file illustrating the sequential use of the RASEA functions. 
 
directoryprocess 
An example batch file using examinespec to process a set of spectral files. 
 
flat2struct 
Assembles the spectral database. 
 
directoryidentify 
An example batch file using identifyspec to identify a set of spectral files. 
 
raseaplot 
A batch file to graphically review results of multiple spectra. 
 
identifysummary 
An example batch file using to tabulate the results of a set of RaSEA analyses.  
 
Example Data 
 
/data/dirnames.dat 
/data/py_50_goe_50 
 
100 spectra point counts of a 1:1 mixture of pyrite and goethite (90-250 um 
particulates) (a sample of data from Table 3). The file dirnames.dat is an 
example of how multiple such subdirectories are processed - all but the 
subdirectory included with this supplement are commented out. 
 
Database format: 
The RaSEA database is a tab-delimited text file. Each spectral peak is given one 
line, any compound can have an unlimited number of spectral peaks but only two 
designated as primary and secondary. The order of the peaks is arbitrary, RaSEA 
functions sort the database entries by Raman shift prior to use. For best 
performance, databases should be developed for specific applications containing 
just the compounds that may be encountered in the samples - this minimizes the 
misidentification rates. The following is an explanation of each column in the 
database: 
 
Column: 		Definition 
 
compound: 	The compound the peak belongs to. 
 
rs:			The median Raman shift of the peak. 
 
rsmean:		The mean Raman shift of the peak. 
 
rssd:		When known, the standard deviation of the peak Raman shift; 
otherwise a range of 5 cm-1 is a good initial estimate. 
 
ppi:			The intensity of the peak relative to the primary peak in the 
spectrum. 
 
ppimin:		The minimum relative peak intensity. 
 
ppimax:		The maximum relative peak intensity. 
 
distinct:		(Y)es or (N)o, as to whether the peak is distinct enough to 
use for peak lookup; regardless, all peaks are used for curve fitting unless the 
curve fitting terms are omitted. 
 
peakorder:	1, 2, or 3 for primary, secondary, or tertiary peak. 
 
icfwhh:		Curve fitting initial condition for full width half height. 
 
icfwhhmin:	Curve fitting minimum full width half height. 
 
icbfwhhmax:	Curve fitting maximum full width half height. 
 
icfl:		Curve fitting initial condition for fraction Lorentzian. 
 
icbflmin:		Curve fitting minimum fraction Lorentzian. 
 
icbflmax:		Curve fitting maximum fraction Lorentzian. 
 
ll:			Curve fitting lower Raman shift regional bound. 
 
ul:			Curve fitting upper Raman shift regional bound. 
 
Key Function Descriptions: 
 
examinespec 
[experiment,nri,analysis,baseline,xc,yc] = examinespec(meta,pow,rs,ri,file) 
 
Inputs:  
 
meta: Any collection of meta data relavent to the measurements, currently RaSEA 
uses a meta data block produced by Kaiser Optics Hologram Raman acquisition 
software; but only uses exposure meta data in calculations. 
 
pow: Laser power used during measurements, this isn't automatically included in 
the meta variable so we add it here. 
 
rs: The Raman shifts of the spectra being identified (the x-axis). 
 
ri: The Raman intensities of the spectra being identified (the y-axis). 
 
file: The file name being processed for recording in the data structure. 
 
Output:	 
	 
experiment: 	Meta data worth retaining. 
 
nri:          	Normalized Raman intensity. 
 
analysis:     	A struct variable including the identified peaks, peak 
intensities, and other spectra statistics. 
 
baseline:     	The baseline. 
 
xc:           	A subset of the original Raman shift. 
 
yc:           	A subset of the original Raman intensity. 
 
The most important part of the output is the analysis.peaks portion of the data 
 
analysis.peaks.wnum: The Raman shift of the identified peaks. 
 
analysis.peaks.snr: The signal to noise ratio of the peak. 
 
analysis.peaks.rsnr: The signal to noise ratio of the peak based on the 
normalized Raman intensity. 
 
analysis.peaks.height: Absolute peak intensity. 
 
analysis.peaks.base: Baseline height of the peak. 
 
analysis.peaks.size: Peak size as estimated by the continuous wavelet transform. 
 
analysis.peaks.relsize: Peak size relative to the largest peak. 
 
analysis.peaks.fwhh: Peak full width half height estimate. 
 
identifyspec 
[identified] = identifyspec(peakset,rs,ri,baseline,alglevel,sublib) 
 
Inputs:  
 
peakset: The peak parameters (e.g. location, height) generated by examinespec. 
 
rs: Raman shift (the x-axis) - a data vector. 
 
ri: Raman intensity (the y-axis) - same length as rs. 
 
baseline:	Baseline (the y-axis with the peaks removed) - same length as rs. 
 
level: 1 for fast but less accurate one step pattern matching. 2 for slower 
higher accuracy two step pattern matching. 
 
sublib: 'full' considers all compounds in the database as a possible match 
'pyrite','chalcopyrite' for example limits possible matches to specific 
compounds This is more accurate if other compounds can be excluded based on 
other data (i.e. chemical). 
 
Output: A struct variable of the pattern matching results. The first two items 
(in bold) give the result; all other items are for reference. 
 
identified.compounds: The list of compounds evaluated as possible matches. 
Subsequent variables are matched to this order. 
 
identified.count: The decision list, 1 for matched, 0 for not matched. 
 
identified.maxpeak: Whether the compound includes the most intense peak in the 
unknown peakset. 
 
identified.secpeak: Whether the compound includes the second most intense peak 
in the unknown peakset. 
 
identified.weights: The weights list, total peak lookup score. 
 
identified.w1: A score if the compound primary peak was found. 
 
identified.w2: A score if the compound secondary peak was found. 
 
identified.w3: A score that increases for every compound tertiary peak found. 
 
identified.w4: A score that increases each timing the spacing matches between 
adjacent peaks for a database compound. 
 
identified.w5: A score if either the 1st or 2nd most intense unknown peak is one 
of the database compound peaks. 
 
identified.w6: A score that increases each timing the relative intensity 
difference matches between adjacent peaks for a database compound. 
 
identified.mainpeakheight: Intensity of the most intense peak in the unknown 
peakset. 
 
identified.secpeakheight: Intensity of the 2nd most intense peak in the unknown 
peakset. 
 
identified.mainpeakbase: Baseline intensity of the most intense peak in the 
unknown peakset. 
 
identified.secpeakbase: Baseline intensity of the 2nd most intense peak in the 
unknown peakset. 
 
identified.mainpeakrs: Raman shift of the most intense peak in the unknown 
peakset. 
 
identified.secpeakrs: Raman shift of second most intense peak in the unknown 
peakset. 
 
identified.cmperr: The composite error of a curve fit, if performed for a 
database compound. The composite error is the sum of the root mean square error, 
the correlation coefficient, the difference in Raman shift, and intensity at 
just the peaks between the curve fits and the spectra, and the difference in 
Raman shift, intensity, and width for just the most intense peak. The smallest 
error is the best match. 
 
identified.rmse: The fit root mean square error. 
 
identified.rsq: The fit correlation coefficient. 
 
identified.mfitrs: The difference in Raman shift, for just the peaks, between 
the curve fit and the unknown spectra. 
 
identified.mfitri: The difference in intensity, for just the peaks, between the 
curve fit and the unknown spectra. 
 
identified.mpmfitrs: The difference in Raman shift, for the most intense peak, 
between the curve fit and the unknown spectra. 
 
identified.mpmfitri: The difference in intensity, for the most intense peak, 
between the curve fit and the unknown spectra. 
 
identified.mpmfitfwhh: The difference in full width half height, for the most 
intense peak, between the curve fit and the unknown spectra. 
 
identified.spmfitrs: The difference in Raman shift, for the 2nd most intense 
peak, between the curve fit and the unknown spectra. 
 
identified.spmfitri: The difference in intensity, for the 2nd most intense peak, 
between the curve fit and the unknown spectra. 
 
Example:  
s is the struct variable created from a set of spectra. 
This function is normally used in a loop that processes all spectra in s 
 
v = 1;  %the first spectra in s 
 
s(v,1).analysis.identified = identifyspec(... 
			[s(v,1).analysis.peaks],...	% Significant peak statistics 
			[s(v,1).rs],...			% Spectra Raman shift. 
			[s(v,1).ri],...			% Spectra Raman intensity. 
			[s(v,1).baseline],...		% Spectra fitted baseline. 
			2,...				% Full matching algorithm. 
			'full');				% Uses entire database. 
 
raseasettings 
[] = raseasettings 
 
Inputs: No direct inputs, called directly by the RaSEA functions. Modify the 
variables documented within the function to modify the behavior of the program. 
 
Outputs: Calling the function populates the variable space with the control 
variables. 
 
The raseasettings function contains the user adjustable settings for all the 
RaSEA functions. The settings can be altered to tailor the toolkit for specific 
applications, specifically different spectral ranges, peak sensitivity, 
detection thresholds, and data locations.  
 
raseaplots 
[] = raseaplots(dir) 
 
Inputs:  
 
dir:	The directory within the current directory that contains the point counts 
to be reviewed. 
 
Output: Graphical 
 
identifysummary 
[cnts, compounds] = identifysummary 
 
This is an example for use with the data provided by RaSEA. It builds a summary 
table of counts for each database compound. It operates on the subdirectories 
listed in dirnames.dat that are contained in the current directory. The order 
the database compounds appear in the table is hardwired at the end of the this 
function. Adjust the order for your needs. 
 
Inputs: none 
 
Outputs: 
 
cnts: A summary table of counts for each database compound. Each row of the 
table is the result for one of the subdirectories in the parent directory. The 
last element of each row is the total number of spectra counted.  
 
compounds: The compounds that represent the column headings of cnts. 
 
Example processing with RaSEA functions 
 
Example data is included with the toolbox. There is a folder in the /exampledata 
parent directory that contains the spectra acquired during a 100 point count 
compositional analysis of a 1:1 binary mixture of pyrite and goethite mineral 
standards; these are the same spectra collected for the relevant portion of 
Table 3. The sequence of commands in raseabatch will process and identify the 
spectra. The result will be a struct variable saved in the measurement directory 
as a .mat file named for the directory. The results can be viewed graphically 
using the raseaplots function. The results can be summarized using 
identifysummary.