Add slack notification (#44)

PyMedic · web-flow · commit 071246cb3fd0 · 2025-07-29T12:12:21.000-07:00
* Added the Lambda function for the slack notification. * Added the AWS resource for the slack notification lambda function and Webhook URL parameter. Also added new parameters for the SendNotification task for the CD2RefreshStateMachine that will be passed to the slack notification lambda function. * Reverted the SNS topic name because it is unnecessary. * Added the Slack Webhook URL secret name parameter value. * Removed the response.text due to github code scanning alert. * Added the If condition when deciding where to get the SNS Topic ARN depending on the CreateNotificationTopic condition. * Removed the Lambda function code because we will use the existing SNS topic. * Removed the Lambda function AWS resource and Slack related paramter because they are no longer needed. * Updated the message format for SendNotification to match the one expected in the lambda function named ubcla-notifications-sns-slack-notification. * Added a comma in the JSON format for the SNS message. * Updated the Message format for the SendNotification because using the Sub function in the JSON caused an error during the cloudformation stack update. * Removed quotes in the Message.$ parameter. * Fixed the cfn linting error. * Revert "Removed the Lambda function code because we will use the existing SNS topic." This reverts commit b5118f8. * Revert "Removed the Lambda function AWS resource and Slack related paramter because they are no longer needed." This reverts commit 7a37c39. * Reverted the previous changes for the Message value for the SendNotification step. * Added a function for processing the processed CD2 table data and outputs the human readable message. * Took out the inline IAM policy from SlackNotificationFunction and created a new IAM role for it. * Removed unnecessary paramters from SendNotfication that I added previously. * Added a step to transform the string message into the real data. * Fixed the slack secret name references in the AWS resources. * Removed the value for SlackWebHookURLSecretNameParameter. This value will be provided via the environment variables in the AWS Codepipeline. * Changed the docker image source from docker.io to public.ecr.aws to prevent anonymous pull rate limits from Docker Hub during the CodeBuild. * Added the AWS environment value in the sns message title to tell if the message is from staging or production. * Commented the complete tables because it makes the notification too crowded. The failed tables are more important information for the notification. * Added the error message part in the sync_table payload message because these information is missing in the sync_table output. Added the parameter called failed_error_messages in the PivotResults state in the step function. * Added the lines to test if the failed_error_messages get passed correctly from the step function. * Changed the output payload from the PivotResult to SendNotification state to send the entire input data. We will process the CD2 data update result in the Lambda function. * Added generate_error_string() to generate a error string. Added get_ecs_log_url() to get the cloudwatch log url. Replaced the hardcoded error messages with generate_error_string() calls. * Replaced the encoding for the forward slash in the cloudwatch log URL with the URL-encoded version. * Encoding the forward slash in the cloudwatch log stream string. * Fixing a typo in the cloudwatch log url template string. * Changed the Error string format. Also added the condition to handle the case where the exception message is empty. * Refactored the message processing part to reflect the input data format changes from the step function. * Simplified the error format. * Introduced red and green emojis in the slack notification message. * Added the complete table with schema update. * Shorten the message category headers. * Removed failed_init and failed_sync because they are included under the failed section. * Added 2 different thresholds for the number of failed tables. Added a conditional statement to provide different emojis depends on the number of failed tables. When there is a high number of failed tables, we will alert everyone in the notification channel in slack. * Updated the slack emoji codes. * Added the dubugging message for testing. * Revert "Added the dubugging message for testing." This reverts commit 496f107. * Added the comment regarding the edge case where we use the exception class name in case the error message is missing. * Simplified the slack notification title. Added the cloudformation stack name in the title in case you have multiple stacksets for multiple Canvas environments. * Merged the failed tables and errors in the notification message. * Fixed the missing bold tag for the notification title. * Moved get_secret_value() and send_to_slack() into the shared folder because these functions will be used in multiple places. * Created a lambda layer for shared functions for lambda functions. * Added a step to properly name the environment type in the notification title. * Moved the environment name conversion code as a separate function and put it into the lambda layer because it will be used in the multiple lambda functions. Removed sys.path.append() because it is not needed when referencing functions from the lambda layers. * Added the slack notification message in case the list_tables operation fails for any reasons. * Added the requests module for sending a slack message. * Added the red x emoji for the listTables error. * Added the permission to access the Slack Webhook URL from the AWS Secrets Manager for ListTablesRole for slack notification. * Added comments. * Changed the print() statement with the logger.exception(). * Removed the unnecessary instruction on what to do for the invalid_client error. * Changed the LambdaLayer related names. * Removed the parameters that has been commented out.
diff --git a/init_table/Dockerfile b/init_table/Dockerfile
@@ -1,4 +1,4 @@
-FROM docker.io/library/python:3-alpine
+FROM public.ecr.aws/docker/library/python:3-alpine
 
 ARG UID=1012
 ARG GID=1012
diff --git a/lambda-layers/python/shared/__init__.py b/lambda-layers/python/shared/__init__.py
diff --git a/lambda-layers/python/shared/utils.py b/lambda-layers/python/shared/utils.py
@@ -0,0 +1,43 @@
+import boto3
+import requests
+from botocore.exceptions import ClientError
+import logging
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+def get_secret_value(secret_name, region):
+    client = boto3.client("secretsmanager", region)
+
+    try:
+        response = client.get_secret_value(SecretId=secret_name)
+
+        if 'SecretString' in response:
+            secret = response['SecretString']
+        else:
+            # Decode binary secret if necessary
+            secret = response['SecretBinary']
+
+        return secret
+
+    except ClientError as e:
+        logger.exception(f"Error while fetching secret: {e}")
+        return None
+
+def send_to_slack(message, slack_webhook_url):
+    """Send a message to Slack."""
+    try:
+        response = requests.post(slack_webhook_url, json={"text": message})
+
+        if not response.ok:
+            logger.error(f"Failed to send message to Slack: {response.status_code}")
+    except Exception as e:
+        logger.exception(f"An error occured during the send_to_slack() operation: {e}")
+
+def get_full_environment_name(environment_string):
+    if  "stg" in environment_string.lower() or "stag" in environment_string.lower():
+        full_environment_name = "Staging"
+    elif "prod" in environment_string.lower():
+        full_environment_name = "Production"
+
+    return full_environment_name
diff --git a/list_tables/app.py b/list_tables/app.py
@@ -7,6 +7,7 @@
 from botocore.config import Config
 from dap.api import DAPClient
 from dap.dap_types import Credentials
+from shared.utils import get_secret_value, send_to_slack, get_full_environment_name
 
 region = os.environ.get('AWS_REGION')
 
@@ -23,28 +24,52 @@
 
 namespace = 'canvas'
 
+REGION = os.environ["AWS_REGION"]
+SLACK_WEBHOOK_URL_SECRET_NAME = os.getenv("SLACK_WEBHOOK_SECRET_NAME")
+STACK_NAME =  os.environ["STACK_NAME"]
+SLACK_WEBHOOK_URL = get_secret_value(SLACK_WEBHOOK_URL_SECRET_NAME, REGION)
+
+def generate_error_message(input_error):
+    environment_name = get_full_environment_name(env)
+    red_cross_mark_emoji = ':x:'
+
+    sns_title = f"<!channel> *{STACK_NAME} ({environment_name})*\n"
+    message = f"{red_cross_mark_emoji} The ListTables step failed with the following error: \n {input_error}"
+
+    return sns_title + message
 
 @logger.inject_lambda_context(log_event=True)
 def lambda_handler(event, context: LambdaContext):
-    params = ssm_provider.get_multiple(param_path, max_age=600, decrypt=True)
+    try:
+        params = ssm_provider.get_multiple(param_path, max_age=600, decrypt=True)
+
+        dap_client_id = params['dap_client_id']
+        dap_client_secret = params['dap_client_secret']
 
-    dap_client_id = params['dap_client_id']
-    dap_client_secret = params['dap_client_secret']
+        logger.info(f"dap_client_id: {dap_client_id}")
 
-    logger.info(f"dap_client_id: {dap_client_id}")
+        credentials = Credentials.create(client_id=dap_client_id, client_secret=dap_client_secret)
 
-    credentials = Credentials.create(client_id=dap_client_id, client_secret=dap_client_secret)
+        os.chdir("/tmp/")
 
-    os.chdir("/tmp/")
+        tables = asyncio.get_event_loop().run_until_complete(async_get_tables(api_base_url, credentials, namespace))
 
-    tables = asyncio.get_event_loop().run_until_complete(async_get_tables(api_base_url, credentials, namespace))
+        # we can skip certain tables if necessary by setting an environment variable (comma-separated list)
+        skip_tables = os.environ.get('SKIP_TABLES', '').split(',')
 
-    # we can skip certain tables if necessary by setting an environment variable (comma-separated list)
-    skip_tables = os.environ.get('SKIP_TABLES', '').split(',')
+        tmap = list(map(lambda t: {'table_name': t, "state": "needs_sync"}, [t for t in tables if t not in skip_tables]))
 
-    tmap = list(map(lambda t: {'table_name': t, "state": "needs_sync"}, [t for t in tables if t not in skip_tables]))
+        return {'tables': tmap}
+    except Exception as e:
+        logger.exception(e)
+        message = generate_error_message(e)
 
-    return {'tables': tmap}
+        # Send a slack notification to alert any issue during the list_tables operation.
+        try:
+            send_to_slack(message, SLACK_WEBHOOK_URL)
+        except Exception as e:
+            logger.exception(f"Slack notification failed: {e}")
+            raise
 
 
 async def async_get_tables(api_base_url: str, credentials: Credentials, namespace: str):
diff --git a/list_tables/requirements.txt b/list_tables/requirements.txt
@@ -1,3 +1,4 @@
 aws-lambda-powertools==2.43.1
 pysqlsync==0.8.2
 instructure-dap-client[postgresql]==1.0.0
+requests ~= 2.32.3
diff --git a/slack_notification/__init__.py b/slack_notification/__init__.py
diff --git a/slack_notification/app.py b/slack_notification/app.py
@@ -0,0 +1,74 @@
+import os
+import sys
+import logging
+import ast
+
+from shared.utils import get_secret_value, send_to_slack, get_full_environment_name
+
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+REGION = os.environ["AWS_REGION"]
+ENVIRONMENT = os.environ["AWS_ENVIRONMENT"]
+SLACK_WEBHOOK_URL_SECRET_NAME = os.getenv("SLACK_WEBHOOK_SECRET_NAME")
+STACK_NAME =  os.environ["STACK_NAME"]
+
+SLACK_WEBHOOK_URL = get_secret_value(SLACK_WEBHOOK_URL_SECRET_NAME, REGION)
+
+def process_table_update_message(message):
+    green_check_mark_emoji = ':white_check_mark:'
+    red_cross_mark_emoji = ':x:'
+    warning_mark_emoji = ':warning:'
+    failed_table_number_emoji = green_check_mark_emoji
+    failed_tables_number_lower_threshold = 2
+    failed_tables_number_upper_threshold = 10
+
+    environment_name = get_full_environment_name(ENVIRONMENT)
+
+    sns_title = f"*{STACK_NAME} ({environment_name})*\n"
+
+    #Transform the string message from the step function into the real data.
+    message = ast.literal_eval(message)
+
+    # Extract different table state information and error messages from the input data from the SNS topic.
+    complete_tables = [item["table_name"] for item in message if item.get("state") == "complete"]
+    complete_tables_with_schema_update = [item["table_name"] for item in message if item.get("state") == "complete_with_update"]
+    failed = [item for item in message if item.get("state") == "failed" or item.get("state") == "needs_init" or item.get("state") == "needs_sync"]
+    failed_tables = [item["table_name"] for item in failed if item.get("state") == "failed" or item.get("state") == "needs_init" or item.get("state") == "needs_sync"]
+    error_messages = [item.get("error_message") for item in failed]
+
+    number_of_failed_tables = len(failed_tables)
+
+    # Apply different emojis and the <!channel> tag depending on the number of errors.
+    if number_of_failed_tables > failed_tables_number_upper_threshold:
+        sns_title = "<!channel> " + sns_title
+        failed_table_number_emoji = red_cross_mark_emoji
+    elif number_of_failed_tables > failed_tables_number_lower_threshold:
+        failed_table_number_emoji = red_cross_mark_emoji
+    else:
+        failed_table_number_emoji = warning_mark_emoji
+
+    # Create a multi-line message for the slack notification.
+    message = (
+        f'{green_check_mark_emoji} Complete: {str(len(complete_tables))} \n'
+        f'{green_check_mark_emoji} Complete w/ Schema Update: {str(len(complete_tables_with_schema_update))} \n'
+        f'{failed_table_number_emoji} Failed: {str(number_of_failed_tables)} \n'
+        f'Failed Tables: \n' + '\n'.join(f'{i + 1}. {msg}' for i, msg in enumerate(error_messages))
+    )
+
+    message = sns_title + message
+
+    return message
+
+def lambda_handler(event, context):
+
+    # Get the SNS message payload
+    sns_message = event['Records'][0]['Sns']['Message']
+
+    sns_message = process_table_update_message(sns_message)
+
+    try:
+        send_to_slack(sns_message, SLACK_WEBHOOK_URL)
+    except Exception as e:
+        logger.exception(f"Slack notification failed: {e}")
+        raise
diff --git a/slack_notification/requirements.txt b/slack_notification/requirements.txt
@@ -0,0 +1 @@
+requests ~= 2.32.3
diff --git a/sync_table/Dockerfile b/sync_table/Dockerfile
@@ -1,4 +1,4 @@
-FROM docker.io/library/python:3-alpine
+FROM public.ecr.aws/docker/library/python:3-alpine
 
 ARG UID=1012
 ARG GID=1012
diff --git a/sync_table/app.py b/sync_table/app.py
@@ -13,6 +13,7 @@
 from dap.integration.database_errors import NonExistingTableError
 from dap.replicator.sql import SQLReplicator
 from pysqlsync.base import QueryException
+import requests
 
 region = os.environ.get("AWS_REGION")
 
@@ -31,6 +32,38 @@
 param_path = f"/{env}/canvas_data_2"
 api_base_url = os.environ.get("API_BASE_URL", "https://api-gateway.instructure.com")
 
+FUNCTION_NAME = 'sync_table'
+
+def get_ecs_log_url():
+    # Get region from env
+    region = os.environ.get('AWS_REGION', 'ca-central-1')  # fallback if not set
+
+    # Get ECS metadata
+    metadata_uri = os.environ.get('ECS_CONTAINER_METADATA_URI_V4')
+    if not metadata_uri:
+        raise Exception("ECS_CONTAINER_METADATA_URI_V4 not set")
+
+    metadata = requests.get(f"{metadata_uri}/task").json()
+
+    log_group = metadata['Containers'][0]['LogOptions']['awslogs-group']
+    log_stream = metadata['Containers'][0]['LogOptions']['awslogs-stream']
+
+    log_url = (
+        f"https://{region}.console.aws.amazon.com/cloudwatch/home"
+        f"?region={region}#logsV2:log-groups/log-group/{log_group.replace('/', '$252F')}/log-events/{log_stream.replace('/', '$252F')}"
+    )
+
+    return log_url
+
+def generate_error_string(function_name, table_name, state, exception, cloudwatch_log_url):
+    if len(str(exception)) != 0:
+        return f"{table_name} - {function_name} - {state}, Error: {str(exception)} (<{cloudwatch_log_url}|CloudWatch Log>)"
+
+    # This is for the ProcessingError thrown by the tables: grading_period_groups and grading_periods.
+    # This particular error object doesn't have any error message. In this case, we use the name of the class of an exception object.
+    else:
+        return f"{table_name} - {function_name} - {state}, Error: {type(exception).__name__} (<{cloudwatch_log_url}|CloudWatch Log>)"
+
 def start(event):
     params = ssm_provider.get_multiple(param_path, max_age=600, decrypt=True)
 
@@ -52,6 +85,8 @@ def start(event):
         client_id=dap_client_id, client_secret=dap_client_secret
     )
 
+    cloudwatch_log_url = get_ecs_log_url()
+
     table_name = event["table_name"]
 
     logger.info(f"syncing table: {table_name}")
@@ -80,22 +115,27 @@ def start(event):
             except Exception as e:
                 logger.exception(e)
                 event["state"] = "failed"
+                # Make the each error as string.
+                event["error_message"] = generate_error_string(FUNCTION_NAME, table_name, event["state"], e, cloudwatch_log_url)
             finally:
                 restore_dependencies(db_name="cd2", table_name=table_name)
         else:
             event["state"] = "failed"
     except NonExistingTableError as e:
         logger.exception(e)
         event["state"] = "needs_init"
+        event["error_message"] = generate_error_string(FUNCTION_NAME, table_name, event["state"], e, cloudwatch_log_url)
     except ValueError as e:
         logger.exception(e)
         if "table not initialized" in str(e):
             event["state"] = "needs_init"
         else:
             event["state"] = "failed"
+        event["error_message"] = generate_error_string(FUNCTION_NAME, table_name, event["state"], e, cloudwatch_log_url)
     except Exception as e:
         logger.exception(e)
         event["state"] = "failed"
+        event["error_message"] = generate_error_string(FUNCTION_NAME, table_name, event["state"], e, cloudwatch_log_url)
 
     logger.info(f"event: {event}")
 
@@ -105,7 +145,6 @@ def start(event):
 async def sync_table(credentials, api_base_url, db_connection, namespace, table_name):
     async with DAPClient(api_base_url, credentials) as session:
         await SQLReplicator(session, db_connection).synchronize(namespace, table_name)
-        
 
 
 def drop_dependencies(db_name, table_name):
@@ -176,9 +215,9 @@ def restore_dependencies(db_name, table_name):
     if token:
         stepfunctions.send_task_success(
             taskToken=token,
-            output=json.dumps(payload)) 
+            output=json.dumps(payload))
 
-""" 
+"""
     if token and result['state'] == 'failed':
         stepfunctions.send_task_failure(
             taskToken=token,
diff --git a/sync_table/requirements.txt b/sync_table/requirements.txt
@@ -1,4 +1,5 @@
 aws-lambda-powertools==2.43.1
 pysqlsync==0.8.2
 instructure-dap-client[postgresql]==1.4.0
-boto3==1.34.144
+boto3==1.34.144
+requests ~= 2.32.3
diff --git a/template.yaml b/template.yaml

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-FROM docker.io/library/python:3-alpine`
	`1`	`+FROM public.ecr.aws/docker/library/python:3-alpine`
`2`	`2`
`3`	`3`	`ARG UID=1012`
`4`	`4`	`ARG GID=1012`