Auxiliary Material for Paper 2008GC002314 Mineral phase analysis of deep-sea hydrothermal particulates by a Raman Spectroscopy Expert Algorithm: Toward autonomous in situ experimentation and exploration J. A. Breier, C. R. German, and S. N. White Woods Hole Oceanographic Institution, Woods Hole, Massachusetts, USA Breier, J. A., C. R. German, and S. N. White (2009), Mineral phase analysis of deep-sea hydrothermal particulates by a Raman Spectroscopy Expert Algorithm: Toward autonomous in situ experimentation and exploration, Geochem. Geophys. Geosyst., 10, Q05T05, doi:10.1029/2008GC002314. Introduction: RASEA is designed for automated point-counting based on Raman spectroscopy. Many spectra must be analyzed for one measurement. The general data management philosophy is to keep all raw spectra files for one measurement in a single directory - but create one composite database of the raw spectra plus their processing results. This composite database is also kept in the same directory with the raw files. Most RASEA functions are designed to be called from within a batch file that that processes each spectra sequentially. Requirements Matlab 7.4" and the Wavelet and Curve Fitting Toolboxes (cwtpeakfind uses the cwt function in the Wavelet Toolbox; patternmatch uses the lsqcurvefit function in the Curve Fitting Toolbox). Licensing The RaSEA functions and database are released under the GNU Public License included in the release. Function List database: deepsea_vents_v1.txt Being developed for deep-sea hydrothermal vent mineral assemblages. Contains the compound specific peak parameters used to identify and match compound peak sets to unknown samples. Main Functions examinespec: Analyzes a spectra to determine significant peak locations, fit a baseline, and estimate peak sizes and other metrics (e.g. fluorescence index). Saves results with the original spectra in a single .mat file. Best use is to call this from a batch file to analyze multiple spectra. identifyspec: Matches a spectra, previously processed by examinespec, to a compound in the RASEA database. Saves results with the original spectra in a single .mat file. Best use is to call this from a batch file to identify multiple spectra. subfunctions raseasettings: Contains all the settings used by RASEA functions. This subfunction is called by other RASEA functions; settings can be changed by editing the file directly. cwtpeakfind: Identifies peaks based on a continuous wavelet transform. peakcheck: Checks peak validity near the notch filter baseline shift. raseabaseline: Fits a smooth baseline to a spectra. peaklookup: Examines the database for potential peak matches. patternmatch: Performs a constrained curve fit for all potential matches and determines the best match with the unknown spectrum. raseapeaks: A mixed Gaussian/Lorentzian shape function for individual peaks. Utilities zscorenan: Normalizes a data vector to unit standard deviation centered about zero - similar to Matlab zscore but ignores NaN values. makecol: Takes any 1-D vector and rewrites it in a vertical format. batch processing examples raseabatch: An example batch file illustrating the sequential use of the RASEA functions. directoryprocess An example batch file using examinespec to process a set of spectral files. flat2struct Assembles the spectral database. directoryidentify An example batch file using identifyspec to identify a set of spectral files. raseaplot A batch file to graphically review results of multiple spectra. identifysummary An example batch file using to tabulate the results of a set of RaSEA analyses. Example Data /data/dirnames.dat /data/py_50_goe_50 100 spectra point counts of a 1:1 mixture of pyrite and goethite (90-250 um particulates) (a sample of data from Table 3). The file dirnames.dat is an example of how multiple such subdirectories are processed - all but the subdirectory included with this supplement are commented out. Database format: The RaSEA database is a tab-delimited text file. Each spectral peak is given one line, any compound can have an unlimited number of spectral peaks but only two designated as primary and secondary. The order of the peaks is arbitrary, RaSEA functions sort the database entries by Raman shift prior to use. For best performance, databases should be developed for specific applications containing just the compounds that may be encountered in the samples - this minimizes the misidentification rates. The following is an explanation of each column in the database: Column: Definition compound: The compound the peak belongs to. rs: The median Raman shift of the peak. rsmean: The mean Raman shift of the peak. rssd: When known, the standard deviation of the peak Raman shift; otherwise a range of 5 cm-1 is a good initial estimate. ppi: The intensity of the peak relative to the primary peak in the spectrum. ppimin: The minimum relative peak intensity. ppimax: The maximum relative peak intensity. distinct: (Y)es or (N)o, as to whether the peak is distinct enough to use for peak lookup; regardless, all peaks are used for curve fitting unless the curve fitting terms are omitted. peakorder: 1, 2, or 3 for primary, secondary, or tertiary peak. icfwhh: Curve fitting initial condition for full width half height. icfwhhmin: Curve fitting minimum full width half height. icbfwhhmax: Curve fitting maximum full width half height. icfl: Curve fitting initial condition for fraction Lorentzian. icbflmin: Curve fitting minimum fraction Lorentzian. icbflmax: Curve fitting maximum fraction Lorentzian. ll: Curve fitting lower Raman shift regional bound. ul: Curve fitting upper Raman shift regional bound. Key Function Descriptions: examinespec [experiment,nri,analysis,baseline,xc,yc] = examinespec(meta,pow,rs,ri,file) Inputs: meta: Any collection of meta data relavent to the measurements, currently RaSEA uses a meta data block produced by Kaiser Optics Hologram Raman acquisition software; but only uses exposure meta data in calculations. pow: Laser power used during measurements, this isn't automatically included in the meta variable so we add it here. rs: The Raman shifts of the spectra being identified (the x-axis). ri: The Raman intensities of the spectra being identified (the y-axis). file: The file name being processed for recording in the data structure. Output: experiment: Meta data worth retaining. nri: Normalized Raman intensity. analysis: A struct variable including the identified peaks, peak intensities, and other spectra statistics. baseline: The baseline. xc: A subset of the original Raman shift. yc: A subset of the original Raman intensity. The most important part of the output is the analysis.peaks portion of the data analysis.peaks.wnum: The Raman shift of the identified peaks. analysis.peaks.snr: The signal to noise ratio of the peak. analysis.peaks.rsnr: The signal to noise ratio of the peak based on the normalized Raman intensity. analysis.peaks.height: Absolute peak intensity. analysis.peaks.base: Baseline height of the peak. analysis.peaks.size: Peak size as estimated by the continuous wavelet transform. analysis.peaks.relsize: Peak size relative to the largest peak. analysis.peaks.fwhh: Peak full width half height estimate. identifyspec [identified] = identifyspec(peakset,rs,ri,baseline,alglevel,sublib) Inputs: peakset: The peak parameters (e.g. location, height) generated by examinespec. rs: Raman shift (the x-axis) - a data vector. ri: Raman intensity (the y-axis) - same length as rs. baseline: Baseline (the y-axis with the peaks removed) - same length as rs. level: 1 for fast but less accurate one step pattern matching. 2 for slower higher accuracy two step pattern matching. sublib: 'full' considers all compounds in the database as a possible match 'pyrite','chalcopyrite' for example limits possible matches to specific compounds This is more accurate if other compounds can be excluded based on other data (i.e. chemical). Output: A struct variable of the pattern matching results. The first two items (in bold) give the result; all other items are for reference. identified.compounds: The list of compounds evaluated as possible matches. Subsequent variables are matched to this order. identified.count: The decision list, 1 for matched, 0 for not matched. identified.maxpeak: Whether the compound includes the most intense peak in the unknown peakset. identified.secpeak: Whether the compound includes the second most intense peak in the unknown peakset. identified.weights: The weights list, total peak lookup score. identified.w1: A score if the compound primary peak was found. identified.w2: A score if the compound secondary peak was found. identified.w3: A score that increases for every compound tertiary peak found. identified.w4: A score that increases each timing the spacing matches between adjacent peaks for a database compound. identified.w5: A score if either the 1st or 2nd most intense unknown peak is one of the database compound peaks. identified.w6: A score that increases each timing the relative intensity difference matches between adjacent peaks for a database compound. identified.mainpeakheight: Intensity of the most intense peak in the unknown peakset. identified.secpeakheight: Intensity of the 2nd most intense peak in the unknown peakset. identified.mainpeakbase: Baseline intensity of the most intense peak in the unknown peakset. identified.secpeakbase: Baseline intensity of the 2nd most intense peak in the unknown peakset. identified.mainpeakrs: Raman shift of the most intense peak in the unknown peakset. identified.secpeakrs: Raman shift of second most intense peak in the unknown peakset. identified.cmperr: The composite error of a curve fit, if performed for a database compound. The composite error is the sum of the root mean square error, the correlation coefficient, the difference in Raman shift, and intensity at just the peaks between the curve fits and the spectra, and the difference in Raman shift, intensity, and width for just the most intense peak. The smallest error is the best match. identified.rmse: The fit root mean square error. identified.rsq: The fit correlation coefficient. identified.mfitrs: The difference in Raman shift, for just the peaks, between the curve fit and the unknown spectra. identified.mfitri: The difference in intensity, for just the peaks, between the curve fit and the unknown spectra. identified.mpmfitrs: The difference in Raman shift, for the most intense peak, between the curve fit and the unknown spectra. identified.mpmfitri: The difference in intensity, for the most intense peak, between the curve fit and the unknown spectra. identified.mpmfitfwhh: The difference in full width half height, for the most intense peak, between the curve fit and the unknown spectra. identified.spmfitrs: The difference in Raman shift, for the 2nd most intense peak, between the curve fit and the unknown spectra. identified.spmfitri: The difference in intensity, for the 2nd most intense peak, between the curve fit and the unknown spectra. Example: s is the struct variable created from a set of spectra. This function is normally used in a loop that processes all spectra in s v = 1; %the first spectra in s s(v,1).analysis.identified = identifyspec(... [s(v,1).analysis.peaks],... % Significant peak statistics [s(v,1).rs],... % Spectra Raman shift. [s(v,1).ri],... % Spectra Raman intensity. [s(v,1).baseline],... % Spectra fitted baseline. 2,... % Full matching algorithm. 'full'); % Uses entire database. raseasettings [] = raseasettings Inputs: No direct inputs, called directly by the RaSEA functions. Modify the variables documented within the function to modify the behavior of the program. Outputs: Calling the function populates the variable space with the control variables. The raseasettings function contains the user adjustable settings for all the RaSEA functions. The settings can be altered to tailor the toolkit for specific applications, specifically different spectral ranges, peak sensitivity, detection thresholds, and data locations. raseaplots [] = raseaplots(dir) Inputs: dir: The directory within the current directory that contains the point counts to be reviewed. Output: Graphical identifysummary [cnts, compounds] = identifysummary This is an example for use with the data provided by RaSEA. It builds a summary table of counts for each database compound. It operates on the subdirectories listed in dirnames.dat that are contained in the current directory. The order the database compounds appear in the table is hardwired at the end of the this function. Adjust the order for your needs. Inputs: none Outputs: cnts: A summary table of counts for each database compound. Each row of the table is the result for one of the subdirectories in the parent directory. The last element of each row is the total number of spectra counted. compounds: The compounds that represent the column headings of cnts. Example processing with RaSEA functions Example data is included with the toolbox. There is a folder in the /exampledata parent directory that contains the spectra acquired during a 100 point count compositional analysis of a 1:1 binary mixture of pyrite and goethite mineral standards; these are the same spectra collected for the relevant portion of Table 3. The sequence of commands in raseabatch will process and identify the spectra. The result will be a struct variable saved in the measurement directory as a .mat file named for the directory. The results can be viewed graphically using the raseaplots function. The results can be summarized using identifysummary.