Errors

The errors detected by the parser fall into three categories. The parser attempts to warn about these errors at the earliest opportunity.

  1. Syntax errors

  2. Valence errors - a type of semantic error, where an atom’s valence is not on a list of allowed valences

  3. Kekulization errors - another semantic error, where an aromatic system cannot be kekulized

Note that detection of stereochemistry-related errors is not currently supported.

The following example shows how to catch all exceptions raised by the parser with a single except statement. Note that you may prefer to individually catch each of the three types of exception for the purpose of statistics or logging (see below for details). Note that you should avoid using a generic except: statement as this may mask errors in your own code.

import sys
import partialsmiles as ps

try:
    ps.ParseSmiles("c1cccc1", partial=False)
except ps.Error as e:
    print(repr(e), file=sys.stderr)
    # KekulizationFailure('Aromatic system cannot be kekulized', 'c1cccc1', 5)
    print(str(e), file=sys.stderr)
    # Aromatic system cannot be kekulized
    #  c1cccc1
    #       ^

Syntax errors

The SMILES parser warns about obvious syntax errors such as illegal characters, missing brackets, unmatched parentheses and ring closure digits, and so forth.

It goes a bit further than this though. The partialsmiles parser is designed to warn about errors in SMILES strings at the very earliest opportunity. In order to do so, by default certain assumptions are made (and enforced) about the structure of the SMILES string. These assumptions will be true for SMILES strings generated by a cheminformatics toolkit. Given that machine-learning models are typically trained on such SMILES strings, it is reasonable for the validation procedure to assume (and enforce) that these assumptions are met.

  1. Bond closures may not span a dot, e.g. C1.C1.

  2. Dots may not occur within parentheses (e.g. C(C.C)C).

  3. The final branch must be unparenthesised - e.g. C(C(O))C is rejected.

Taken together, (1) and (2) allow the parser to validate that all rings are closed and all brackets matched whenever a dot is observed. (3) allows the parser to more accurately place a lower bound on the valence of the atom being added to; for example, for the partial SMILES string C(, the explicit valence of the carbon is at least 2 if (3) is true, but only 1 otherwise; this allows valence errors due to hypervalent atoms to be caught more quickly.

Syntax errors can be caught as follows:

import sys
import partialsmiles as ps

try:
    ps.ParseSmiles("C[C(=O)C", partial=False)
except ps.SMILESSyntaxError as e:
    print(repr(e), file=sys.stderr)
    # SMILESSyntaxError('Missing the close bracket', 'C[C(=O)C', 3)
    print(str(e), file=sys.stderr)
    # Missing the close bracket
    #   C[C(=O)C
    #      ^

Valence errors

These errors are caused when the valence of a particular charged state of an atom is not present in a list of allowed valences. For example, neutral carbon is only allowed to be 4-valent; C+ is only allowed to be 3-valent. More than one allowed value can be specified.

The valence-checker provided works as follows:

  1. If an element is not on the list, an exception is raised.

  2. Otherwise…if its charge state is not, an exception is raised.

  3. Otherwise…if its actual valence is not included in the list of allowed valences, an exception is raised

Before using the partialsmiles library to check SMILES strings generated by a machine-learning method, the training set should first be checked. If any valence errors are found, consider editing the dictionary of allowed valences in valence.py (but also think about whether there’s something wrong with your structure).

Warning

An important point that the user should note is that the only allowed valence for neutral nitrogen is 3. While the user can edit this to include ‘hypervalent’ nitrogen (i.e. 5-valent), I recommend they do not, and instead that they convert any hypervalent nitrogens in their training data to 3-valent. Allowing hypervalent nitrogen yields no benefit but makes it difficult to catch erroneous nitrogen valences.

It is also worth noting that if every atom in the training set is specified with square brackets (e.g. ethane as [CH3][CH3] instead of CC), then early termination (via early detection of disallowed valence) is also promoted. This may also improve the machine-learning model.

Valence errors can be caught as follows:

import sys
import partialsmiles as ps

try:
    ps.ParseSmiles("C(C)(C)(C)(C)C", partial=False)
except ps.ValenceError as e:
    print(repr(e), file=sys.stderr)
    # ValenceError('Uncommon valence or charge state', 'C(C)(C)(C)(C)C', 0)
    print(str(e), file=sys.stderr)
    # Uncommon valence or charge state
    #   C(C)(C)(C)(C)C
    #   ^

Kekulization errors

A kekulization error is raised if an alternating pattern of single and double bonds cannot be found to cover an aromatic system (some details omitted). By definition, it is not possible to check for an error until the entire aromatic system is read (i.e. all connected lowercase atoms in the aromatic system). This means that all atoms connected to the system also need to be resolved as the parser can’t know whether these will turn out to be aromatic.

For example, we cannot attempt to kekulize the aromatic system in the partial SMILES string c1ccccc1 as any additional character may affect the kekulization; once this is provided, e.g. c1ccccc1C, we can try to kekulize it. Similarly, for c1ccc2cc1C we cannot attempt to kekulize until the identity of the atom at the other end of the bond closure “2” is known.

Kekulization errors can be caught as follows. Note that the indicated location of the error may be any atom in the problematic aromatic system:

import sys
import partialsmiles as ps

try:
    ps.ParseSmiles("C(C)(C)(C)(C)C", partial=False)
except ps.KekulizationError as e:
    print(repr(e), file=sys.stderr)
    # KekulizationFailure('Aromatic system cannot be kekulized', 'c1cc[nH]cc1', 3)
    print(str(e), file=sys.stderr)
    # Aromatic system cannot be kekulized
    #   c1cc[nH]cc1
    #      ^