Skip to content

COAs

COA Data Parser

Certificates of analysis (COAs) are abundant for cultivators, processors, retailers, and consumers too, but the data is often locked away. Rich, valuable laboratory data so close, yet so far away! CoADoc puts these vital data points in your hands by parsing PDFs and URLs, finding all the data, standardizing the data, and cleanly returning the data to you.

In order to parse CoAs well, CoADoc takes into consideration:

  • The lab or LIMS that generated the CoA;
  • PDF properties, such as the fonts used and the page dimensions;
  • All detected words, lines, columns, white-space, etc.

The roadmap for CoADoc is to continue adding lab and LIMS CoA parsing algorithms until a general CoA parsing algorithm may be able to be created. As CoA parsing algorithms are still under development, if you want a specific lab or LIMS CoA parsed, then please contact The Cannlytics Team: dev@cannlytics.com

Algorithms

You can methodically get data from COAs for various labs with the combination of a handful of tools plus a little human ingenuity. All COAs that are extracted through manually written routines are quite useful because these can then be used for training natural language processing (NLP) models to further extract data from COAs!

Status
🟠 Development 🟡 Operational 🟢 Functional
Lab / LIMS Algorithm Status
ACS Labs parse_acs_coa 🟡
Anresco Laboratories parse_anresco_coa 🟢
Cannalysis parse_cannalysis_coa 🟢
Confidence Analytics parse_confidence_coa 🟢
Confident Cannabis parse_cc_coa 🟢
Green Leaf Lab parse_green_leaf_lab_coa 🟢
Kaycha Labs parse_kaycha_coa 🟡
KCA Labs parse_kca_coa 🟠
MCR Labs parse_mcr_labs_coa 🟢
Modern Canna Science parse_moderncanna_coa 🟠
Method Testing Labs parse_mtl_coa 🟠
SC Labs parse_sc_labs_coa 🟢
Sonoma Lab Works parse_sonoma_coa 🟢
Steep Hill parse_steephill_coa 🟢
TagLeaf LIMS parse_tagleaf_coa 🟢
TerpLife Labs parse_terplife_coa 🟠
Veda Scientific parse_veda_coa 🟡

Installation

First, make sure that you have installed cannlytics:

pip install cannlytics

Then make sure to install the following dependencies:

  1. Install Tesseract (Windows | iOS).
  2. Install ImageMagick (Download | Cloud
  3. Install zbar (Download).

Usage

Initialize a CoADoc parsing client.

from cannlytics.data.coas import CoADoc

# Initialize a COA parser.
parser = CoADoc()

Parse a PDF.

# Parse a PDF.
filename = 'Classic Jack.pdf'
data = parser.parse_pdf(filename)

Parse a URL.

# Parse a URL.
url = 'https://lims.tagleaf.com/coa_/F6LHqs9rk9'
data = parser.parse_url(url)

Parse a list or URLs or PDFs, a mix of both is okay.

# Parse a list of COAs.
data = parser.parse([filename, url])

Close the client when you are finished to perform garbage cleaning.

# Close the parser.
parser.quit()

Core Methods

CoADoc comes ready to rumble, but to unleash CoADoc's true power, ensure that you have installed ChromeDriver (and that ChromeDriver is in your PATH) for parsing complex lab/LIMS webpages. Also, you can install ImageMagick and Tesseract to use optical character recognition (OCR) to attempt to recognize the text of COAs stored as images.

Function Description
identify_lims(doc, lims=None) Identify if a COA was created by a common LIMS. Search all of the text of the LIMS name or URL. If no LIMS is identified from the text, then the images are attempted to be decoded, searching for a QR code URL.
parse(data, headers={}, kind='url', lims=None, max_delay=7, persist=True) Parse all COAs given a directory, a list of files, or a list of URLs.
parse_pdf(pdf, headers={}, kind='url', lims=None, max_delay=7, persist=True) Parse a COA PDF. The method searches PDF images, alternating from the first then the last to the middle, for a QR code that decodes to a URL. If a URL is found, then results are attempted to be collected from the web. If no QR code is found or results can't be found on the web, then data is extracted from the PDF.
parse_url(url, headers={}, kind='url', lims=None, max_delay=7, persist=True) Parse a COA URL using web data collection methods.
pdf_ocr(filename, outfile, temp_path='/tmp', resolution=300, cleanup=True) Pass a PDF through OCR to recognize its text. Outputs a new PDF. A temporary directory is used, because the algorithm is to: 1. Convert all PDF pages to images. 2. Convert each image to PDF with text. 3. Compile the PDFs with text to a single PDF. The rendered images and individual PDF files are removed by default.
save(data, outfile, codings=None, column_order=None, nuisance_columns=None, numeric_columns=None, standard_analyses=None, standard_analytes=None, standard_fields=None, google_maps_api_key=None) Save all COA data, elongating results and widening values. That is, a Workbook is created with a "Details" worksheet that has all of the raw data, a "Results" worksheet with long-form data where each row is a result for an analyte, and a "Values" worksheet with wide-form data where each row is an observation and each column is the value field for each of the results.
standardize(data, codings=None, column_order=None, nuisance_columns=None, numeric_columns=None, how='details', details_data=None, results_data=None, standard_analyses=None, standard_analytes=None, standard_fields=None, google_maps_api_key=None) Standardize (and normalize) given data. Pass A google_maps_api_key to supplement addresses with latitude and longitude. Specify column_order as a list of columns in desired order. Specify nuisance_columns as a list of column suffixes to remove. Specify numeric_columns as a list of columns to apply codings and convert to numeric values. Specify how for a simple clean of the data details by default. Alternatively specify wide for a wide-form DataFrame of values or long for a long-form DataFrame of results. Specify standard_analyses, standard_analytes, standard_fields to use custom standardization mappings.
quit() Close any driver, end any session, and reset the parameters.

Common COA Data Points

Below is a non-exhaustive list of fields, used to standardize the various data that are encountered, that you may expect to be returned from the parsing of a COA.

Field Example Description
analyses ["cannabinoids"] A list of analyses performed on a given sample.
{analysis}_status "pass" The pass, fail, or N/A status for pass / fail analyses.
methods [{"analysis: "cannabinoids", "method": "HPLC"}] The methods used for each analysis.
coa_urls [{"url": "", "filename": ""}] A list of certificate of analysis (COA) URLs.
date_collected 2022-04-20T04:20 An ISO-formatted time when the sample was collected.
date_tested 2022-04-20T16:20 An ISO-formatted time when the sample was tested.
date_received 2022-04-20T12:20 An ISO-formatted time when the sample was received.
distributor "Your Favorite Dispo" The name of the product distributor, if applicable.
distributor_address "Under the Bridge, SF, CA 55555" The distributor address, if applicable.
distributor_street "Under the Bridge" The distributor street, if applicable.
distributor_city "SF" The distributor city, if applicable.
distributor_state "CA" The distributor state, if applicable.
distributor_zipcode "55555" The distributor zip code, if applicable.
distributor_license_number "L2Stat" The distributor license number, if applicable.
images [{"url": "", "filename": ""}] A list of image URLs for the sample.
lab_results_url "https://cannlytics.com/results" A URL to the sample results online.
producer "Grow Tent" The producer of the sampled product.
producer_address "3rd & Army, SF, CA 55555" The producer's address.
producer_street "3rd & Army" The producer's street.
producer_city "SF" The producer's city.
producer_state "CA" The producer's state.
producer_zipcode "55555" The producer's zipcode.
producer_license_number "L2Calc" The producer's license number.
product_name "Blue Rhino Pre-Roll" The name of the product.
lab_id "Sample-0001" A lab-specific ID for the sample.
product_type "flower" The type of product.
batch_number "Order-0001" A batch number for the sample or product.
metrc_ids ["1A4060300002199000003445"] A list of relevant Metrc IDs.
metrc_lab_id "1A4060300002199000003445" The Metrc ID associated with the lab sample.
metrc_source_id "1A4060300002199000003445" The Metrc ID associated with the sampled product.
product_size 2000 The size of the product in milligrams.
serving_size 1000 An estimated serving size in milligrams.
servings_per_package 2 The number of servings per package.
sample_weight 1 The weight of the product sample in grams.
results [{…},…] A list of results, see below for result-specific fields.
status "pass" The overall pass / fail status for all contaminant screening analyses.
total_cannabinoids 14.20 The analytical total of all cannabinoids measured.
total_thc 14.00 The analytical total of THC and THCA.
total_cbd 0.20 The analytical total of CBD and CBDA.
total_terpenes 0.42 The sum of all terpenes measured.
sample_id "{sha256-hash}" A generated ID to uniquely identify the producer, product_name, and date_tested.
strain_name "Blue Rhino" A strain name, if specified. Otherwise, can be attempted to be parsed from the product_name.

Each result can contain the following fields.

Field Example Description
analysis "pesticides" The analysis used to obtain the result.
key "pyrethrins" A standardized key for the result analyte.
name "Pyrethrins" The lab's internal name for the result analyte
value 0.42 The value of the result.
mg_g 0.00000042 The value of the result in milligrams per gram.
units "ug/g" The units for the result value, limit, lod, and loq.
limit 0.5 A pass / fail threshold for contaminant screening analyses.
lod 0.01 The limit of detection for the result analyte. Values below the lod are typically reported as ND.
loq 0.1 The limit of quantification for the result analyte. Values above the lod but below the loq are typically reported as <LOQ.
status "pass" The pass / fail status for contaminant screening analyses.

Advanced Usage

If you are developing a new parsing routine for a lab or LIMS, then you will need to follow these steps. 1. Implement a parse_{lab}_coa, parse_{lab}_pdf and/or a parse_{lab}_url function to parse results from a given COA. If there is a QR code on the COA containing the sample's lab_results_url, then the PDF parsing routine can be as simple as follows. If not, then you can implement your own PDF parsing logic.

def parse_cc_pdf(
        self,
        doc: Any,
        max_delay: Optional[float] = 7,
        persist: Optional[bool] = False,
    ) -> dict:
    """Parse a Confident Cannabis COA PDF.
    Args:
        doc (str or PDF): A file path to a PDF or a pdfplumber PDF.
        max_delay (float): The maximum number of seconds to wait
            for the page to load.
        persist (bool): Whether to persist the driver.
            The default is `False`. If you do persist
            the driver, then make sure to call `quit`
            when you are finished.
    Returns:
        (dict): The sample data.
    """
    # TODO: Implement any custom PDF parsing here....

    return self.parse_pdf(
        self,
        doc,
        lims='Confident Cannabis',
        max_delay=max_delay,
        persist=persist,
    )
Your algorithm to parse from a lab or LIMS COA URL can be as simple or as complex as necessary. If the lab or LIMS has implemented an API, then the algorithm can simply be a function to make a request to the lab's API. Be sure to create a unique sample_id before returning the observation (obs) data.
from cannlytics.data.data import create_sample_id

def parse_cc_url(
        self,
        url: str,
        headers: Optional[Any] = None,
        max_delay: Optional[float] = 7,
        persist: Optional[bool] = False,
    ) -> dict:
    """Parse a Confident Cannabis COA URL.
    Args:
        url (str): The COA URL.
        headers (Any): Optional headers for standardization.
        max_delay (float): The maximum number of seconds to wait
            for the page to load.
        persist (bool): Whether to persist the driver.
            The default is `False`. If you do persist
            the driver, then make sure to call `quit`
            when you are finished.
    Returns:
        (dict): The sample data.
    """
    # TODO: Implement API request and parsing here....

    # Return the sample with a freshly minted sample ID.
    obs['sample_id'] = create_sample_id(
        private_key=producer,
        public_key=product_name,
        salt=date_tested,
    )
    return obs
Finally, you can create a parse_{lab}_coa function to parse either a PDF or a URL.
def parse_cc_coa(
    parser,
    doc: Any,
    **kwargs,
) -> dict:
"""Parse a Confident Cannabis COA PDF or URL.
Args:
    doc (str or PDF): A PDF file path or pdfplumber PDF.
Returns:
    (dict): The sample data.
"""
if isinstance(doc, str):
    if doc.startswith('http'):
        return parse_cc_url(parser, doc, **kwargs)
    elif doc.endswith('.pdf'):
        return parse_cc_pdf(parser, doc, **kwargs)
    else:
        return parse_cc_pdf(parser, doc, **kwargs)
else:
    return parse_cc_pdf(parser, doc, **kwargs)
2. Import the details for your lab / LIMS in coas.py, where your lab or LIMS details contain the following fields:
YOUR_FAVORITE_LIMS = {
    'coa_algorithm': 'favorite.py',
    'coa_algorithm_entry_point': 'parse_favorite_coa',
    'lims': 'Your Favorite LIMS',
    'url': 'https://cannlytics.com',
}
3. Add the lab details to the LIMS dictionary with the preferred name of your lab / LIMS as the key. For Example:
LIMS = {
  'Your Favorite LIMS': YOUR_FAVORITE_LIMS,
}

You can use CoADoc's built-in helper functions in your parsing algorithms.

Function Description
decode_pdf_qr_code(page, img, resolution=300) Decode a PDF QR Code from a given image.
find_pdf_qr_code_url(pdf, image_index=None) Find the QR code given a COA PDF or page. If no image_index is provided, then all images are tried to be decoded until a QR code is found. If no QR code is found, then a IndexError is raised.
find_metrc_id(pdf) Under Development
get_metrc_results Under Development
get_pdf_creation_date(pdf) Get the creation date of a PDF in ISO format.
identify_lims(doc) Identify if a COA was created by a common LIMS.

Once you have created a function or functions to parse COAs for a new lab or LIMS, then you can create a pull request to have your algorithm reviewed and included in the main Cannlytics repository upon approval. Once published, your algorithm and the knowledge that it generates can be used to help parse COAs all around the world. As more and more COAs are parsed, the knowledge base will grow and grow. Now you have rich COA data, cleverly unlocked, continuously being polished, and ripe for your plundering.