{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Importing S3 Data into FinSpace\n", "\n", "This notebook will show how to use FinSpace APIs to create a dataset and populate it with data from an external (to FinSpace) S3 source.\n", "\n", "## Preparation\n", "Before running the cells below you need to create an S3 bucket, load dataset (in CSV format) into S3 bucket created and then apply Bucket Policy as described below, to allow FinSpace access CSV file on S3.\n", "\n", "## DataSet\n", "ESG News Sentiment Dataset - Trial\n", "\n", "Provided By: Amenity Analytics \n", "\n", "https://aws.amazon.com/marketplace/pp/prodview-4doy3qrqm3y6g?ref_=srh_res_product_title\n", "\n", "Copy data into a bucket that has entitled the FinSpace service account to it. That bucket must grant \n", "s3:GetObject and s3:ListBucket actions to the service account ARN.\n", "\n", "FinSpace Service Account ARN (replace with your environment's service account): \n", " arn:aws:iam::**INFRASTRUCTURE_ACCOUNT_ID**:role/FinSpaceServiceRole\n", "\n", "## S3 Bucket Policy to be used\n", "\n", "- S3 bucket is externally accessible\n", "- replace INFRASTRUCTURE_ACCOUNT_ID with your environment's service account\n", "- replace S3_BUCKET with your s3 bucket\n", "\n", "```\n", "{\n", " \"Version\": \"2012-10-17\",\n", " \"Id\": \"CrossAccountAccess\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"AWS\": [\n", " \"arn:aws:iam::INFRASTRUCTURE_ACCOUNT_ID:role/FinSpaceServiceRole\"\n", " ]\n", " },\n", " \"Action\": \"s3:GetObject\",\n", " \"Resource\": \"arn:aws:s3:::S3_BUCKET/*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"AWS\": [\n", " \"arn:aws:iam::INFRASTRUCTURE_ACCOUNT_ID:role/FinSpaceServiceRole\"\n", " ]\n", " },\n", " \"Action\": \"s3:ListBucket\",\n", " \"Resource\": \"arn:aws:s3:::S3_BUCKET\"\n", " }\n", " ]\n", "}\n", " ```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connecting to cluster - fin-cluster-3d77[ar6oql9k]\n", "cleared existing credential location\n", "Persisted krb5.conf secret to /etc/krb5.conf\n", "re-establishing connection...\n", "Persisted keytab secret to /home/sagemaker-user/livy.keytab\n", "Authenticated to Spark cluster\n", "Persisted sparkmagic config to /home/sagemaker-user/.sparkmagic/config.json\n", "Started Spark cluster with clusterId: ar6oql9k\n", "finished reloading all magics & configurations\n", "Persisted finspace cluster connection info to /home/sagemaker-user/.sparkmagic/finspace_connection_info.json\n" ] } ], "source": [ "%local\n", "from aws.finspace.cluster import FinSpaceClusterManager\n", "\n", "# if this was already run, no need to run again\n", "if 'finspace_clusters' not in globals():\n", " finspace_clusters = FinSpaceClusterManager()\n", " finspace_clusters.auto_connect()\n", "else:\n", " print(f'connected to cluster: {finspace_clusters.get_connected_cluster_id()}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import Python Utility Classes" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starting Spark application\n" ] }, { "data": { "text/html": [ "
ID | YARN Application ID | Kind | State | Spark UI | Driver log | Current session? |
---|---|---|---|---|---|---|
2 | application_1628754622409_0003 | pyspark | idle | Link | Link | ✔ |