Overview: To make the GCMS-ID webserver as user-friendly as possible, we developed a separate software package to automatically generate derivatized structures for RI prediction. With this package, users only need to provide the SMILES string or structure of the underivatized (base) compound and the derivatization reagent (limited to TMS and TBDMS).

To generate structures for all possible derivatized products for each query compound, the computational derivatization script (called AUTOSILATOR) appends the TMS and/or TBDMS functional group (based on user input ‘derivatization type’) in chemically appropriate positions. The script uses individual derivatization rules for silylating compounds having functional groups such as acids, thiols, ketones, aldehydes, amines, etc. The script also automatically generates derivatized structure names that are formatted as “base_compound_name, n TMS/TBDMS”, where n represents the total number of TMS or TBDMS groups attached to the base molecule.

To ensure chemical viability, two filtering steps are used to remove any incorrect or offending structures. First, the software evaluates the molecular weights (MW) of the generated compounds and only keeps those with MW <900 Da. This is the maximum MW typically measurable by most commercial GC-MS instruments.

Second, all compounds are passed through ChemBL to assess the validity and feasibility of the computationally generated derivative structures. The ChemBL program is a bond/stereochemical evaluation program that is able to automatically identify issues with chemical structures such as mol-InChI stereo mismatches or improper placement of atoms and functional groups in invalid positions. ChemBL uses this information to assign a complexity value (0 to 9, 0 being the score for no issue and 9 being the score with lots of issues) for each structure. For our program, any generated compounds with a ChemBL complexity score of greater than 5 are discarded.