Cannlytics Utilities
The cannlytics.utils
module contains constants and general utility functions for working with cannabis data. Due to the use of zoneinfo
for managing time, Python 3.9+ is recommended.
Constants
There are a number of useful constants in the cannlytics.utils.constants
submodule that you can use for standardizing data.
Constant | Description |
---|---|
ANALYSES |
A map of encountered analyses to their standardized analysis. |
ANALYTES |
A map of encountered analytes to their standardized analyte. |
STANDARD_ANALYSES |
Standard analysis key map. |
STANDARD_FIELDS |
A map of encountered fields to their standardized field. |
STANDARD_UNITS |
A map of standard units by analysis to use when no units are obtainable. |
PRODUCT_TYPES |
A map of encountered product types to their standardized product type. |
STRAINS |
A map of encountered strains to their standardized strain name. |
CODINGS |
Standard value codings. |
DECARB |
Cannabinoid decarboxylation rate. |
DEFAULT_HEADERS |
Default headers to use for HTTP requests, because we are AI and should not be treated as a bot. |
RANDOM_STRING_CHARS |
Random characters to use in password generation. |
states |
A map of state abbreviations to state names. |
state_names |
A map of state names to state abbreviations. |
state_time_zones |
A map of state abbreviations to timezone. |
Utility Functions
String Utilities
Function | Description |
---|---|
camelcase(string) |
Turn a given string to CamelCase. |
camel_to_snake(string) |
Turn a camel-case string to a snake-case string. This function handles CamelCase better than snake_case . The function does not do well with all caps, e.g. "APP_ID". |
kebab_case(string) |
Turn a string into a kebab-case string. |
format_billions(value) |
Format a number in billions. |
format_millions(value) |
Format a number in millions. |
format_thousands(value) |
Format a number in thousands. |
get_keywords(string) |
Get keywords for a given string. |
get_random_string(length, allowed_chars=RANDOM_STRING_CHARS) |
Return a securely generated random string. |
sentence_case(string) |
Format a string as a sentence. |
snake_case(string) |
Turn a given string to snake case. Handles CamelCase, replaces known special characters with preferred namespaces, replaces spaces with underscores, and removes all other nuisance characters. |
strip_whitespace(string) |
Strip whitespace from a string. |
Number Utilities
Function | Description |
---|---|
convert_to_numeric(string, strip=False) |
Convert a string to numeric, optionally replacing non-numeric characters. |
List Utilities
Function | Description |
---|---|
sandwich_list(a) |
Create a range that cycles from start to the end to the middle. |
sorted_nicely(a) |
Sort the given iterable in the way that humans expect. |
split_list(a, at_index=None) |
Split a list in half or at a given index. |
Dictionary Utilities
Function | Description |
---|---|
clean_dictionary(data, function=snake_case) |
Format dictionary keys with given function, snake case by default. |
clean_nested_dictionary(data, function=snake_case) |
Format nested (at most 2 levels) dictionary keys with a given function, snake case by default. |
remove_dict_fields(data, fields) |
Remove multiple keys from a dictionary. |
remove_dict_nulls(data) |
Return a shallow copy of a dictionary with all None values excluded. |
update_dict(context, function=camel_to_snake, **kwargs) |
Update dictionary with keyword arguments. |
DataFrame Utilities
Function | Description |
---|---|
clean_column_strings(data, columns) |
Clean the column names of a given DataFrame. |
end_of_period_timeseries(data, period='M') |
Convert a DataFrame from beginning-of-the-period to end-of-the-period timeseries. |
nonzero_columns(data) |
Return the non-zero column names of a DataFrame. |
nonzero_rows(data) |
Return the non-zero row keys of a DataFrame. |
combine_columns(data, new_key, old_key, drop=True) |
Combine two numeric columns of a DataFrame. |
reorder_columns(data, columns) |
Re-order a DataFrame given a specific order of columns. Remaining columns will be appended to the end of the DataFrame. |
reverse_dataframe(data) |
Reverse the ordering of a DataFrame. |
sum_columns(data, new_key, columns, drop=True) |
Sum multiple numeric columns of a DataFrame. |
rmerge(left, right, **kwargs) |
Perform a merge using pandas with optional removal of overlapping column names not associated with the join. |
set_training_period(series, date_start, date_end) |
Helper function to restrict a series to the desired training time period. |
to_excel_with_style(data, file_name, index=False, sheet_name='Sheet1', style=None) |
Save a DataFrame to Excel with no style. |
Time Utilities
Function | Description |
---|---|
convert_month_year_to_date(x) |
Convert a month, year series to datetime. E.g. 'April 2022' . |
end_of_month(value) |
Format a datetime as an ISO formatted date at the end of the month. |
end_of_year(value) |
Format a datetime as an ISO formatted date at the end of the year. |
format_iso_date(date, sep='/') |
Format a human-written date into an ISO formatted date. |
get_timestamp(date, past=0, future=0, zone='utc) |
Get an ISO formatted timestamp. |
months_elapsed(start, end) |
Calculate the months elapsed between two times, returning 0 if a negative time span. |
File Utilities
Function | Description |
---|---|
decode_pdf(data, destination) |
Save an base-64 encoded string as a PDF. |
encode_pdf(filename) |
Open a PDF file in binary mode. |
get_directory_files(target_dir, file_type) |
Get all of the files of a specified type in a given directory. |
get_number_of_lines(file_name, encoding='utf-16', errors='ignore') |
Read the number of lines in a large file. |
download_file_from_url(url, destination='', ext='') |
Download a file from a URL to a given directory. |
unzip_files(zip_dir, extension='.zip') |
Unzip all files in a specified folder. Alternatively, pass a .zip file to extract that file. |