From 7eef0aca44be0947abb9fe289e1d7045586ecff0 Mon Sep 17 00:00:00 2001
From: Jun-te Kim <juntekim@googlemail.com>
Date: Fri, 27 Feb 2026 12:50:16 +0000
Subject: [PATCH] save

---
 etl/hubSpotClient/scripts/README.md | 325 ++++++++++++++++++++++++++++
 1 file changed, 325 insertions(+)
 create mode 100644 etl/hubSpotClient/scripts/README.md

diff --git a/etl/hubSpotClient/scripts/README.md b/etl/hubSpotClient/scripts/README.md
new file mode 100644
index 0000000..81360e3
--- /dev/null
+++ b/etl/hubSpotClient/scripts/README.md
@@ -0,0 +1,325 @@
+# HubSpot Client Scripts - Onboarding Guide
+
+## Overview
+
+The scripts in this directory form a **3-stage ETL pipeline** for syncing HubSpot data (companies and deals) into the local database:
+
+1. **Stage 0 (Seed)**: `hubspot_company.py` — Load company master records
+2. **Stage 1 (Bulk Load)**: `hubspot_gather_all_deals.py` — Initial load of all deals for a company
+3. **Stage 2 (Sync/Update)**: `hubspot_update_script.py` — Ongoing synchronization (runs on a schedule)
+
+These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity.
+
+---
+
+## Onboarding a New Client
+
+Follow these steps in order when adding a new company/client:
+
+### Step 1: Add the Company to the `Companies` Enum
+
+Edit `../hubspotClient.py` and add your new company to the `Companies` enum class:
+
+```python
+class Companies(Enum):
+    ABRI = "237615001799"
+    SOUTHERN_HOUSING_GROUP = "109343619305"
+    LIVEWEST = "86205872354"
+    SURESERVE = "301745289413"
+    HOMEGROUP = "94946071794"
+    APPLE = "184769046716"
+    THE_GUINESS_PARTNERSHIP = "86970043613"
+    YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID"  # ← Add here
+```
+
+**How to find your HubSpot Company ID:**
+- Log into HubSpot
+- Navigate to **Contacts** → **Companies**
+- Click on the company name
+- The URL will be: `https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID` — copy that ID
+
+### Step 2: Update Each Script to Include Your Company
+
+After adding the enum, update the company lists in all three scripts:
+
+#### `hubspot_company.py` (line ~6)
+```python
+companies = [
+    Companies.THE_GUINESS_PARTNERSHIP,
+    Companies.YOUR_NEW_COMPANY  # ← Add here
+]
+```
+
+#### `hubspot_gather_all_deals.py` (line ~7)
+```python
+valuable_companies = [
+    Companies.THE_GUINESS_PARTNERSHIP.value,
+    Companies.YOUR_NEW_COMPANY.value  # ← Add here
+]
+```
+
+#### `hubspot_update_script.py` (line ~12)
+```python
+companies = [
+    Companies.THE_GUINESS_PARTNERSHIP,
+    Companies.YOUR_NEW_COMPANY  # ← Add here
+]
+```
+
+### Step 3: Run `hubspot_company.py` (One-time setup)
+
+This script seeds the company record into the `hubspot_company_data` table. Run it once:
+
+```bash
+python etl/hubSpotClient/scripts/hubspot_company.py
+```
+
+**What it does:**
+- Connects to HubSpot and fetches company information (name, ID)
+- Inserts the company record into the local database
+
+**Output:** You'll see the company added to `hubspot_company_data` table.
+
+### Step 3.5: Update Group ID in Database (Manual)
+
+After the company record is created, you need to manually update the **group ID** for the new company. This is done via DBeaver or pgAdmin:
+
+**Steps:**
+1. Open DBeaver or pgAdmin and connect to the database
+2. Navigate to the `hubspot_company_data` table
+3. Find the row with your new company (search by `company_name` or `company_id`)
+4. Edit the **`group_id`** column to the portfolio/group ID you want to track for this company
+5. Save the changes
+
+**Example Query** (if you prefer SQL):
+```sql
+UPDATE hubspot_company_data
+SET group_id = 'YOUR_GROUP_ID'
+WHERE company_id = 'YOUR_COMPANY_ID';
+```
+
+**What is Group ID?**
+- The group ID identifies which portfolio/group in your system this company belongs to
+- Each company can be associated with one group ID for tracking and organization
+- This field is used for tracking and reporting across your survey data
+
+### Step 4: Run `hubspot_gather_all_deals.py` (One-time bulk load)
+
+This script performs the initial load of all deals for your company, filtered by the `OPERATIONS_SOCIAL_HOUSING` pipeline. Run it once per company:
+
+```bash
+python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py
+```
+
+**What it does:**
+- Fetches all deal IDs associated with your company from HubSpot
+- For each deal, retrieves detailed properties:
+  - `dealname`, `dealstage`, `pipeline`, `outcome`, `outcome_notes`, `project_code`
+  - `major_condition_issue_description`, `major_condition_issue_photos`
+  - `coordination_status__stage_1_`, `retrofit_design_status`
+- Filters to only deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline
+- Fetches the associated listing (UPRN, property IDs)
+- Inserts each deal into the `hubspot_data` table
+- **Downloads photo evidence files** and uploads them to S3 (bucket: `retrofit-data-dev`)
+
+**⚠️ Note:** This script can take a long time if your company has many deals. It processes deals serially with progress reporting via `tqdm`.
+
+**Output:** Deals appear in `hubspot_data` table; photos appear in S3 at `s3://retrofit-data-dev/hubspot/awaabs_law_evidence/`.
+
+### Step 5: `hubspot_update_script.py` (Automatic scheduling)
+
+After the initial setup, **no manual action is needed**. This script runs automatically every 15 minutes during working hours as a scheduled job.
+
+**What it does:**
+- Queries the local database for all stored deals for each company
+- Compares each deal's stored fields against the live HubSpot data (13 fields checked)
+- Updates the database if any values have changed in HubSpot
+- **Uploads newly available photos** to S3 (with SHA-256 integrity verification)
+- Prints a summary report of changes, updates, and any failures
+
+---
+
+## Script Reference
+
+### `hubspot_company.py`
+
+**Stage:** Seed (one-time setup)
+**Frequency:** Run once per new company
+**Speed:** Fast
+
+**Purpose:** Load company master data into the database.
+
+**Database Output:**
+- Table: `hubspot_company_data`
+- Fields: `company_id`, `company_name`
+
+**Code Flow:**
+```
+For each company in config:
+  1. Call HubSpot API: get_company_information(company_id)
+  2. Insert record into hubspot_company_data table
+```
+
+---
+
+### `hubspot_gather_all_deals.py`
+
+**Stage:** Bulk Load (one-time per company)
+**Frequency:** Run once per company (manually triggered)
+**Speed:** Slow (serial processing of all deals)
+
+**Purpose:** Perform initial load of all deals for target companies.
+
+**Database Output:**
+- Table: `hubspot_data`
+- Fields: `deal_id`, `deal_name`, `company_id`, `stage`, `outcome`, `photos_s3_url`, and others
+
+**S3 Output:**
+- Bucket: `retrofit-data-dev`
+- Path: `hubspot/awaabs_law_evidence/{filename}`
+
+**Code Flow:**
+```
+For each company in config:
+  1. Fetch all deal IDs from HubSpot
+  2. For each deal:
+     a. Get deal properties from HubSpot
+     b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline
+     c. Fetch associated listing data (UPRN, property IDs)
+     d. Insert deal into hubspot_data table
+     e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB
+     f. Print progress: "Uploaded deal_id {id} to db"
+```
+
+**Error Handling:** None — script will abort on first error. Re-run to retry.
+
+---
+
+### `hubspot_update_script.py`
+
+**Stage:** Sync/Update (ongoing maintenance)
+**Frequency:** Every 15 minutes during working hours (automated schedule)
+**Speed:** Fast (only processes stored deals, compares, updates deltas)
+
+**Purpose:** Keep database synchronized with live HubSpot data; handle new/updated photos.
+
+**Database Operations:**
+- Reads: All deals from `hubspot_data` for each company
+- Writes: Updates only when fields differ from HubSpot
+- S3 Uploads: New or previously missing photos
+
+**Summary Report:**
+After completion, prints a table of per-company statistics:
+```
+Company | Checked | Updated | Up-to-date | Failed
+```
+
+Plus detailed error messages for any failed updates.
+
+**Code Flow:**
+```
+1. Initialize HubSpot client (warm-up: get_deal_stages)
+2. For each company:
+   a. Query DB for all deals with company_id
+   b. For each deal:
+      - Fetch live deal data from HubSpot
+      - Compare 13 fields: deal_id, company_id, landlord_property_id, outcome,
+                          dealstage, dealname, project_code, uprn,
+                          outcome_notes, major_condition_issue_description,
+                          major_condition_issue_photos, coordination_status,
+                          design_status
+      - If any field differs: call upsert_hubspot_deal() to update DB
+      - If photos exist in HubSpot but not yet in S3:
+        * Download file from HubSpot URL
+        * Upload to S3
+        * Verify SHA-256 hash integrity
+        * Save S3 URL back to DB
+      - Collect success/failure counts
+   c. Print per-company summary
+3. Print all failures (if any) with error messages
+```
+
+**Error Handling:** Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal.
+
+---
+
+## Common Tasks
+
+### I added a new company but deals aren't showing up
+
+**Checklist:**
+- [ ] Company added to `Companies` enum in `hubspotClient.py`
+- [ ] Company added to the `companies` list in **all three scripts**
+- [ ] Ran `hubspot_company.py` successfully
+- [ ] Ran `hubspot_gather_all_deals.py` and watched for "Uploaded deal_id" messages
+- [ ] Check database: `SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID'`
+- [ ] Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline?
+
+### Deals exist in HubSpot but aren't syncing
+
+The `hubspot_gather_all_deals.py` script only loads deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot.
+
+### Photos aren't uploading
+
+- First run of `hubspot_gather_all_deals.py` should upload photos at import time
+- Subsequent runs of `hubspot_update_script.py` will upload newly available photos
+- Check S3 bucket `retrofit-data-dev` under `hubspot/awaabs_law_evidence/`
+- Check DB field `major_condition_issue_photos` (photo S3 URL is stored here)
+
+### I need to re-sync everything for a company
+
+1. Clear the deals from the database:
+   ```sql
+   DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID';
+   ```
+2. Clear the company:
+   ```sql
+   DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID';
+   ```
+3. Re-run from **Step 3** above (run `hubspot_company.py`, then `hubspot_gather_all_deals.py`)
+
+---
+
+## Dependencies
+
+All scripts depend on:
+- `HubSpotClient` from `../hubspotClient.py` — Handles HubSpot API calls
+- `HubspotTodb` from `../../db/hubSpotLoad.py` — Handles database operations (insert/upsert/query)
+- `tqdm` — Progress bars
+- Python `requests` — HTTP downloads for photo files
+
+Environment Requirements:
+- Valid HubSpot API token (configured in `HubSpotClient.__init__()`)
+- Database connection (configured in `HubspotTodb`)
+- S3 credentials (for photo uploads)
+- Network access to HubSpot API and S3
+
+---
+
+## Notes & Tips
+
+1. **Idempotency:** `hubspot_gather_all_deals.py` and `hubspot_update_script.py` use upsert logic, so they can be run multiple times without creating duplicates.
+
+2. **Large Portfolios:** If a company has thousands of deals, `hubspot_gather_all_deals.py` will take a while. Use `tqdm` progress indicators to monitor.
+
+3. **Error Handling:** `hubspot_update_script.py` has error handling per deal. `hubspot_company.py` and `hubspot_gather_all_deals.py` do not — any failure aborts the script. If interrupted, simply re-run.
+
+4. **Schedule:** `hubspot_update_script.py` is scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler).
+
+5. **Photo Integrity:** The `hubspot_update_script.py` verifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database.
+
+6. **Unused Fields:** The scripts populate `deals_to_add` and `deal_to_companies` dicts in `hubspot_gather_all_deals.py` but don't use them downstream. This is harmless but could be cleaned up in future refactors.
+
+---
+
+## Troubleshooting
+
+| Issue | Likely Cause | Solution |
+|-------|--------------|----------|
+| "Company not found" error | Company enum not added or typo in name | Double-check `Companies` enum in `hubspotClient.py` |
+| Deal count mismatch | Company wasn't added to the script's companies list | Ensure company is in `valuable_companies` / `companies` in all 3 scripts |
+| Slow script execution | Large portfolio or network latency | Normal for first run; `hubspot_update_script.py` is faster on subsequent runs |
+| Photos not uploading | Deal doesn't have `major_condition_issue_photos` property | Photos only upload if HubSpot deal has photos attached |
+| S3 upload fails | Credentials or bucket issues | Check IAM permissions and bucket name (`retrofit-data-dev`) |
+| Update script reports failures | Stale data or missing DB fields | Check error messages in summary report; may need to re-sync company |
+