COAs
Certificates of analysis (COAs) are abundant for cultivators, processors, retailers, and consumers too, but the data is often locked away. Rich, valuable laboratory data so close, yet so far away! CoADoc
puts these vital data points in your hands by parsing PDFs and URLs, finding all the data, standardizing the data, and cleanly returning the data to you.
In order to parse CoAs well, CoADoc takes into consideration:
- The lab or LIMS that generated the CoA;
- PDF properties, such as the fonts used and the page dimensions;
- All detected words, lines, columns, white-space, etc.
The roadmap for CoADoc is to continue adding lab and LIMS CoA parsing algorithms until a general CoA parsing algorithm may be able to be created. As CoA parsing algorithms are still under development, if you want a specific lab or LIMS CoA parsed, then please contact The Cannlytics Team: dev@cannlytics.com
Algorithms
You can methodically get data from COAs for various labs with the combination of a handful of tools plus a little human ingenuity. All COAs that are extracted through manually written routines are quite useful because these can then be used for training natural language processing (NLP) models to further extract data from COAs!
Status | ||
---|---|---|
🟠Development | 🟡 Operational | 🟢 Functional |
Lab / LIMS | Algorithm | Status |
---|---|---|
ACS Labs | parse_acs_coa |
🟡 |
Anresco Laboratories | parse_anresco_coa |
🟢 |
Cannalysis | parse_cannalysis_coa |
🟢 |
Confidence Analytics | parse_confidence_coa |
🟢 |
Confident Cannabis | parse_cc_coa |
🟢 |
Green Leaf Lab | parse_green_leaf_lab_coa |
🟢 |
Kaycha Labs | parse_kaycha_coa |
🟡 |
KCA Labs | parse_kca_coa |
🟠|
MCR Labs | parse_mcr_labs_coa |
🟢 |
Modern Canna Science | parse_moderncanna_coa |
🟠|
Method Testing Labs | parse_mtl_coa |
🟠|
SC Labs | parse_sc_labs_coa |
🟢 |
Sonoma Lab Works | parse_sonoma_coa |
🟢 |
Steep Hill | parse_steephill_coa |
🟢 |
TagLeaf LIMS | parse_tagleaf_coa |
🟢 |
TerpLife Labs | parse_terplife_coa |
🟠|
Veda Scientific | parse_veda_coa |
🟡 |
Installation
First, make sure that you have installed cannlytics
:
pip install cannlytics
Then make sure to install the following dependencies:
Usage
Initialize a CoADoc
parsing client.
from cannlytics.data.coas import CoADoc
# Initialize a COA parser.
parser = CoADoc()
Parse a PDF.
# Parse a PDF.
filename = 'Classic Jack.pdf'
data = parser.parse_pdf(filename)
Parse a URL.
# Parse a URL.
url = 'https://lims.tagleaf.com/coa_/F6LHqs9rk9'
data = parser.parse_url(url)
Parse a list or URLs or PDFs, a mix of both is okay.
# Parse a list of COAs.
data = parser.parse([filename, url])
Close the client when you are finished to perform garbage cleaning.
# Close the parser.
parser.quit()
Core Methods
CoADoc
comes ready to rumble, but to unleash CoADoc
's true power, ensure that you have installed ChromeDriver (and that ChromeDriver is in your PATH) for parsing complex lab/LIMS webpages. Also, you can install ImageMagick and Tesseract to use optical character recognition (OCR) to attempt to recognize the text of COAs stored as images.
Function | Description |
---|---|
identify_lims(doc, lims=None) |
Identify if a COA was created by a common LIMS. Search all of the text of the LIMS name or URL. If no LIMS is identified from the text, then the images are attempted to be decoded, searching for a QR code URL. |
parse(data, headers={}, kind='url', lims=None, max_delay=7, persist=True) |
Parse all COAs given a directory, a list of files, or a list of URLs. |
parse_pdf(pdf, headers={}, kind='url', lims=None, max_delay=7, persist=True) |
Parse a COA PDF. The method searches PDF images, alternating from the first then the last to the middle, for a QR code that decodes to a URL. If a URL is found, then results are attempted to be collected from the web. If no QR code is found or results can't be found on the web, then data is extracted from the PDF. |
parse_url(url, headers={}, kind='url', lims=None, max_delay=7, persist=True) |
Parse a COA URL using web data collection methods. |
pdf_ocr(filename, outfile, temp_path='/tmp', resolution=300, cleanup=True) |
Pass a PDF through OCR to recognize its text. Outputs a new PDF. A temporary directory is used, because the algorithm is to: 1. Convert all PDF pages to images. 2. Convert each image to PDF with text. 3. Compile the PDFs with text to a single PDF. The rendered images and individual PDF files are removed by default. |
save(data, outfile, codings=None, column_order=None, nuisance_columns=None, numeric_columns=None, standard_analyses=None, standard_analytes=None, standard_fields=None, google_maps_api_key=None) |
Save all COA data, elongating results and widening values. That is, a Workbook is created with a "Details" worksheet that has all of the raw data, a "Results" worksheet with long-form data where each row is a result for an analyte, and a "Values" worksheet with wide-form data where each row is an observation and each column is the value field for each of the results . |
standardize(data, codings=None, column_order=None, nuisance_columns=None, numeric_columns=None, how='details', details_data=None, results_data=None, standard_analyses=None, standard_analytes=None, standard_fields=None, google_maps_api_key=None) |
Standardize (and normalize) given data. Pass A google_maps_api_key to supplement addresses with latitude and longitude. Specify column_order as a list of columns in desired order. Specify nuisance_columns as a list of column suffixes to remove. Specify numeric_columns as a list of columns to apply codings and convert to numeric values. Specify how for a simple clean of the data details by default. Alternatively specify wide for a wide-form DataFrame of values or long for a long-form DataFrame of results. Specify standard_analyses , standard_analytes , standard_fields to use custom standardization mappings. |
quit() |
Close any driver, end any session, and reset the parameters. |
Common COA Data Points
Below is a non-exhaustive list of fields, used to standardize the various data that are encountered, that you may expect to be returned from the parsing of a COA.
Field | Example | Description |
---|---|---|
analyses |
["cannabinoids"] | A list of analyses performed on a given sample. |
{analysis}_status |
"pass" | The pass, fail, or N/A status for pass / fail analyses. |
methods |
[{"analysis: "cannabinoids", "method": "HPLC"}] | The methods used for each analysis. |
coa_urls |
[{"url": "", "filename": ""}] | A list of certificate of analysis (COA) URLs. |
date_collected |
2022-04-20T04:20 | An ISO-formatted time when the sample was collected. |
date_tested |
2022-04-20T16:20 | An ISO-formatted time when the sample was tested. |
date_received |
2022-04-20T12:20 | An ISO-formatted time when the sample was received. |
distributor |
"Your Favorite Dispo" | The name of the product distributor, if applicable. |
distributor_address |
"Under the Bridge, SF, CA 55555" | The distributor address, if applicable. |
distributor_street |
"Under the Bridge" | The distributor street, if applicable. |
distributor_city |
"SF" | The distributor city, if applicable. |
distributor_state |
"CA" | The distributor state, if applicable. |
distributor_zipcode |
"55555" | The distributor zip code, if applicable. |
distributor_license_number |
"L2Stat" | The distributor license number, if applicable. |
images |
[{"url": "", "filename": ""}] | A list of image URLs for the sample. |
lab_results_url |
"https://cannlytics.com/results" | A URL to the sample results online. |
producer |
"Grow Tent" | The producer of the sampled product. |
producer_address |
"3rd & Army, SF, CA 55555" | The producer's address. |
producer_street |
"3rd & Army" | The producer's street. |
producer_city |
"SF" | The producer's city. |
producer_state |
"CA" | The producer's state. |
producer_zipcode |
"55555" | The producer's zipcode. |
producer_license_number |
"L2Calc" | The producer's license number. |
product_name |
"Blue Rhino Pre-Roll" | The name of the product. |
lab_id |
"Sample-0001" | A lab-specific ID for the sample. |
product_type |
"flower" | The type of product. |
batch_number |
"Order-0001" | A batch number for the sample or product. |
metrc_ids |
["1A4060300002199000003445"] | A list of relevant Metrc IDs. |
metrc_lab_id |
"1A4060300002199000003445" | The Metrc ID associated with the lab sample. |
metrc_source_id |
"1A4060300002199000003445" | The Metrc ID associated with the sampled product. |
product_size |
2000 | The size of the product in milligrams. |
serving_size |
1000 | An estimated serving size in milligrams. |
servings_per_package |
2 | The number of servings per package. |
sample_weight |
1 | The weight of the product sample in grams. |
results |
[{…},…] | A list of results, see below for result-specific fields. |
status |
"pass" | The overall pass / fail status for all contaminant screening analyses. |
total_cannabinoids |
14.20 | The analytical total of all cannabinoids measured. |
total_thc |
14.00 | The analytical total of THC and THCA. |
total_cbd |
0.20 | The analytical total of CBD and CBDA. |
total_terpenes |
0.42 | The sum of all terpenes measured. |
sample_id |
"{sha256-hash}" | A generated ID to uniquely identify the producer , product_name , and date_tested . |
strain_name |
"Blue Rhino" | A strain name, if specified. Otherwise, can be attempted to be parsed from the product_name . |
Each result can contain the following fields.
Field | Example | Description |
---|---|---|
analysis |
"pesticides" | The analysis used to obtain the result. |
key |
"pyrethrins" | A standardized key for the result analyte. |
name |
"Pyrethrins" | The lab's internal name for the result analyte |
value |
0.42 | The value of the result. |
mg_g |
0.00000042 | The value of the result in milligrams per gram. |
units |
"ug/g" | The units for the result value , limit , lod , and loq . |
limit |
0.5 | A pass / fail threshold for contaminant screening analyses. |
lod |
0.01 | The limit of detection for the result analyte. Values below the lod are typically reported as ND . |
loq |
0.1 | The limit of quantification for the result analyte. Values above the lod but below the loq are typically reported as <LOQ . |
status |
"pass" | The pass / fail status for contaminant screening analyses. |
Advanced Usage
If you are developing a new parsing routine for a lab or LIMS, then you will need to follow these steps.
1. Implement a parse_{lab}_coa
, parse_{lab}_pdf
and/or a parse_{lab}_url
function to parse results from a given COA. If there is a QR code on the COA containing the sample's lab_results_url
, then the PDF parsing routine can be as simple as follows. If not, then you can implement your own PDF parsing logic.
def parse_cc_pdf(
self,
doc: Any,
max_delay: Optional[float] = 7,
persist: Optional[bool] = False,
) -> dict:
"""Parse a Confident Cannabis COA PDF.
Args:
doc (str or PDF): A file path to a PDF or a pdfplumber PDF.
max_delay (float): The maximum number of seconds to wait
for the page to load.
persist (bool): Whether to persist the driver.
The default is `False`. If you do persist
the driver, then make sure to call `quit`
when you are finished.
Returns:
(dict): The sample data.
"""
# TODO: Implement any custom PDF parsing here....
return self.parse_pdf(
self,
doc,
lims='Confident Cannabis',
max_delay=max_delay,
persist=persist,
)
sample_id
before returning the observation (obs
) data.
from cannlytics.data.data import create_sample_id
def parse_cc_url(
self,
url: str,
headers: Optional[Any] = None,
max_delay: Optional[float] = 7,
persist: Optional[bool] = False,
) -> dict:
"""Parse a Confident Cannabis COA URL.
Args:
url (str): The COA URL.
headers (Any): Optional headers for standardization.
max_delay (float): The maximum number of seconds to wait
for the page to load.
persist (bool): Whether to persist the driver.
The default is `False`. If you do persist
the driver, then make sure to call `quit`
when you are finished.
Returns:
(dict): The sample data.
"""
# TODO: Implement API request and parsing here....
# Return the sample with a freshly minted sample ID.
obs['sample_id'] = create_sample_id(
private_key=producer,
public_key=product_name,
salt=date_tested,
)
return obs
parse_{lab}_coa
function to parse either a PDF or a URL.
def parse_cc_coa(
parser,
doc: Any,
**kwargs,
) -> dict:
"""Parse a Confident Cannabis COA PDF or URL.
Args:
doc (str or PDF): A PDF file path or pdfplumber PDF.
Returns:
(dict): The sample data.
"""
if isinstance(doc, str):
if doc.startswith('http'):
return parse_cc_url(parser, doc, **kwargs)
elif doc.endswith('.pdf'):
return parse_cc_pdf(parser, doc, **kwargs)
else:
return parse_cc_pdf(parser, doc, **kwargs)
else:
return parse_cc_pdf(parser, doc, **kwargs)
coas.py
, where your lab or LIMS details contain the following fields:
YOUR_FAVORITE_LIMS = {
'coa_algorithm': 'favorite.py',
'coa_algorithm_entry_point': 'parse_favorite_coa',
'lims': 'Your Favorite LIMS',
'url': 'https://cannlytics.com',
}
LIMS
dictionary with the preferred name of your lab / LIMS as the key. For Example:
LIMS = {
'Your Favorite LIMS': YOUR_FAVORITE_LIMS,
}
You can use CoADoc
's built-in helper functions in your parsing algorithms.
Function | Description |
---|---|
decode_pdf_qr_code(page, img, resolution=300) |
Decode a PDF QR Code from a given image. |
find_pdf_qr_code_url(pdf, image_index=None) |
Find the QR code given a COA PDF or page. If no image_index is provided, then all images are tried to be decoded until a QR code is found. If no QR code is found, then a IndexError is raised. |
find_metrc_id(pdf) |
Under Development |
get_metrc_results |
Under Development |
get_pdf_creation_date(pdf) |
Get the creation date of a PDF in ISO format. |
identify_lims(doc) |
Identify if a COA was created by a common LIMS. |
Once you have created a function or functions to parse COAs for a new lab or LIMS, then you can create a pull request to have your algorithm reviewed and included in the main Cannlytics repository upon approval. Once published, your algorithm and the knowledge that it generates can be used to help parse COAs all around the world. As more and more COAs are parsed, the knowledge base will grow and grow. Now you have rich COA data, cleverly unlocked, continuously being polished, and ripe for your plundering.