From 7eef0aca44be0947abb9fe289e1d7045586ecff0 Mon Sep 17 00:00:00 2001 From: Jun-te Kim Date: Fri, 27 Feb 2026 12:50:16 +0000 Subject: [PATCH] save --- etl/hubSpotClient/scripts/README.md | 325 ++++++++++++++++++++++++++++ 1 file changed, 325 insertions(+) create mode 100644 etl/hubSpotClient/scripts/README.md diff --git a/etl/hubSpotClient/scripts/README.md b/etl/hubSpotClient/scripts/README.md new file mode 100644 index 0000000..81360e3 --- /dev/null +++ b/etl/hubSpotClient/scripts/README.md @@ -0,0 +1,325 @@ +# HubSpot Client Scripts - Onboarding Guide + +## Overview + +The scripts in this directory form a **3-stage ETL pipeline** for syncing HubSpot data (companies and deals) into the local database: + +1. **Stage 0 (Seed)**: `hubspot_company.py` — Load company master records +2. **Stage 1 (Bulk Load)**: `hubspot_gather_all_deals.py` — Initial load of all deals for a company +3. **Stage 2 (Sync/Update)**: `hubspot_update_script.py` — Ongoing synchronization (runs on a schedule) + +These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity. + +--- + +## Onboarding a New Client + +Follow these steps in order when adding a new company/client: + +### Step 1: Add the Company to the `Companies` Enum + +Edit `../hubspotClient.py` and add your new company to the `Companies` enum class: + +```python +class Companies(Enum): + ABRI = "237615001799" + SOUTHERN_HOUSING_GROUP = "109343619305" + LIVEWEST = "86205872354" + SURESERVE = "301745289413" + HOMEGROUP = "94946071794" + APPLE = "184769046716" + THE_GUINESS_PARTNERSHIP = "86970043613" + YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID" # ← Add here +``` + +**How to find your HubSpot Company ID:** +- Log into HubSpot +- Navigate to **Contacts** → **Companies** +- Click on the company name +- The URL will be: `https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID` — copy that ID + +### Step 2: Update Each Script to Include Your Company + +After adding the enum, update the company lists in all three scripts: + +#### `hubspot_company.py` (line ~6) +```python +companies = [ + Companies.THE_GUINESS_PARTNERSHIP, + Companies.YOUR_NEW_COMPANY # ← Add here +] +``` + +#### `hubspot_gather_all_deals.py` (line ~7) +```python +valuable_companies = [ + Companies.THE_GUINESS_PARTNERSHIP.value, + Companies.YOUR_NEW_COMPANY.value # ← Add here +] +``` + +#### `hubspot_update_script.py` (line ~12) +```python +companies = [ + Companies.THE_GUINESS_PARTNERSHIP, + Companies.YOUR_NEW_COMPANY # ← Add here +] +``` + +### Step 3: Run `hubspot_company.py` (One-time setup) + +This script seeds the company record into the `hubspot_company_data` table. Run it once: + +```bash +python etl/hubSpotClient/scripts/hubspot_company.py +``` + +**What it does:** +- Connects to HubSpot and fetches company information (name, ID) +- Inserts the company record into the local database + +**Output:** You'll see the company added to `hubspot_company_data` table. + +### Step 3.5: Update Group ID in Database (Manual) + +After the company record is created, you need to manually update the **group ID** for the new company. This is done via DBeaver or pgAdmin: + +**Steps:** +1. Open DBeaver or pgAdmin and connect to the database +2. Navigate to the `hubspot_company_data` table +3. Find the row with your new company (search by `company_name` or `company_id`) +4. Edit the **`group_id`** column to the portfolio/group ID you want to track for this company +5. Save the changes + +**Example Query** (if you prefer SQL): +```sql +UPDATE hubspot_company_data +SET group_id = 'YOUR_GROUP_ID' +WHERE company_id = 'YOUR_COMPANY_ID'; +``` + +**What is Group ID?** +- The group ID identifies which portfolio/group in your system this company belongs to +- Each company can be associated with one group ID for tracking and organization +- This field is used for tracking and reporting across your survey data + +### Step 4: Run `hubspot_gather_all_deals.py` (One-time bulk load) + +This script performs the initial load of all deals for your company, filtered by the `OPERATIONS_SOCIAL_HOUSING` pipeline. Run it once per company: + +```bash +python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py +``` + +**What it does:** +- Fetches all deal IDs associated with your company from HubSpot +- For each deal, retrieves detailed properties: + - `dealname`, `dealstage`, `pipeline`, `outcome`, `outcome_notes`, `project_code` + - `major_condition_issue_description`, `major_condition_issue_photos` + - `coordination_status__stage_1_`, `retrofit_design_status` +- Filters to only deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline +- Fetches the associated listing (UPRN, property IDs) +- Inserts each deal into the `hubspot_data` table +- **Downloads photo evidence files** and uploads them to S3 (bucket: `retrofit-data-dev`) + +**⚠️ Note:** This script can take a long time if your company has many deals. It processes deals serially with progress reporting via `tqdm`. + +**Output:** Deals appear in `hubspot_data` table; photos appear in S3 at `s3://retrofit-data-dev/hubspot/awaabs_law_evidence/`. + +### Step 5: `hubspot_update_script.py` (Automatic scheduling) + +After the initial setup, **no manual action is needed**. This script runs automatically every 15 minutes during working hours as a scheduled job. + +**What it does:** +- Queries the local database for all stored deals for each company +- Compares each deal's stored fields against the live HubSpot data (13 fields checked) +- Updates the database if any values have changed in HubSpot +- **Uploads newly available photos** to S3 (with SHA-256 integrity verification) +- Prints a summary report of changes, updates, and any failures + +--- + +## Script Reference + +### `hubspot_company.py` + +**Stage:** Seed (one-time setup) +**Frequency:** Run once per new company +**Speed:** Fast + +**Purpose:** Load company master data into the database. + +**Database Output:** +- Table: `hubspot_company_data` +- Fields: `company_id`, `company_name` + +**Code Flow:** +``` +For each company in config: + 1. Call HubSpot API: get_company_information(company_id) + 2. Insert record into hubspot_company_data table +``` + +--- + +### `hubspot_gather_all_deals.py` + +**Stage:** Bulk Load (one-time per company) +**Frequency:** Run once per company (manually triggered) +**Speed:** Slow (serial processing of all deals) + +**Purpose:** Perform initial load of all deals for target companies. + +**Database Output:** +- Table: `hubspot_data` +- Fields: `deal_id`, `deal_name`, `company_id`, `stage`, `outcome`, `photos_s3_url`, and others + +**S3 Output:** +- Bucket: `retrofit-data-dev` +- Path: `hubspot/awaabs_law_evidence/{filename}` + +**Code Flow:** +``` +For each company in config: + 1. Fetch all deal IDs from HubSpot + 2. For each deal: + a. Get deal properties from HubSpot + b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline + c. Fetch associated listing data (UPRN, property IDs) + d. Insert deal into hubspot_data table + e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB + f. Print progress: "Uploaded deal_id {id} to db" +``` + +**Error Handling:** None — script will abort on first error. Re-run to retry. + +--- + +### `hubspot_update_script.py` + +**Stage:** Sync/Update (ongoing maintenance) +**Frequency:** Every 15 minutes during working hours (automated schedule) +**Speed:** Fast (only processes stored deals, compares, updates deltas) + +**Purpose:** Keep database synchronized with live HubSpot data; handle new/updated photos. + +**Database Operations:** +- Reads: All deals from `hubspot_data` for each company +- Writes: Updates only when fields differ from HubSpot +- S3 Uploads: New or previously missing photos + +**Summary Report:** +After completion, prints a table of per-company statistics: +``` +Company | Checked | Updated | Up-to-date | Failed +``` + +Plus detailed error messages for any failed updates. + +**Code Flow:** +``` +1. Initialize HubSpot client (warm-up: get_deal_stages) +2. For each company: + a. Query DB for all deals with company_id + b. For each deal: + - Fetch live deal data from HubSpot + - Compare 13 fields: deal_id, company_id, landlord_property_id, outcome, + dealstage, dealname, project_code, uprn, + outcome_notes, major_condition_issue_description, + major_condition_issue_photos, coordination_status, + design_status + - If any field differs: call upsert_hubspot_deal() to update DB + - If photos exist in HubSpot but not yet in S3: + * Download file from HubSpot URL + * Upload to S3 + * Verify SHA-256 hash integrity + * Save S3 URL back to DB + - Collect success/failure counts + c. Print per-company summary +3. Print all failures (if any) with error messages +``` + +**Error Handling:** Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal. + +--- + +## Common Tasks + +### I added a new company but deals aren't showing up + +**Checklist:** +- [ ] Company added to `Companies` enum in `hubspotClient.py` +- [ ] Company added to the `companies` list in **all three scripts** +- [ ] Ran `hubspot_company.py` successfully +- [ ] Ran `hubspot_gather_all_deals.py` and watched for "Uploaded deal_id" messages +- [ ] Check database: `SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID'` +- [ ] Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline? + +### Deals exist in HubSpot but aren't syncing + +The `hubspot_gather_all_deals.py` script only loads deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot. + +### Photos aren't uploading + +- First run of `hubspot_gather_all_deals.py` should upload photos at import time +- Subsequent runs of `hubspot_update_script.py` will upload newly available photos +- Check S3 bucket `retrofit-data-dev` under `hubspot/awaabs_law_evidence/` +- Check DB field `major_condition_issue_photos` (photo S3 URL is stored here) + +### I need to re-sync everything for a company + +1. Clear the deals from the database: + ```sql + DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID'; + ``` +2. Clear the company: + ```sql + DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID'; + ``` +3. Re-run from **Step 3** above (run `hubspot_company.py`, then `hubspot_gather_all_deals.py`) + +--- + +## Dependencies + +All scripts depend on: +- `HubSpotClient` from `../hubspotClient.py` — Handles HubSpot API calls +- `HubspotTodb` from `../../db/hubSpotLoad.py` — Handles database operations (insert/upsert/query) +- `tqdm` — Progress bars +- Python `requests` — HTTP downloads for photo files + +Environment Requirements: +- Valid HubSpot API token (configured in `HubSpotClient.__init__()`) +- Database connection (configured in `HubspotTodb`) +- S3 credentials (for photo uploads) +- Network access to HubSpot API and S3 + +--- + +## Notes & Tips + +1. **Idempotency:** `hubspot_gather_all_deals.py` and `hubspot_update_script.py` use upsert logic, so they can be run multiple times without creating duplicates. + +2. **Large Portfolios:** If a company has thousands of deals, `hubspot_gather_all_deals.py` will take a while. Use `tqdm` progress indicators to monitor. + +3. **Error Handling:** `hubspot_update_script.py` has error handling per deal. `hubspot_company.py` and `hubspot_gather_all_deals.py` do not — any failure aborts the script. If interrupted, simply re-run. + +4. **Schedule:** `hubspot_update_script.py` is scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler). + +5. **Photo Integrity:** The `hubspot_update_script.py` verifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database. + +6. **Unused Fields:** The scripts populate `deals_to_add` and `deal_to_companies` dicts in `hubspot_gather_all_deals.py` but don't use them downstream. This is harmless but could be cleaned up in future refactors. + +--- + +## Troubleshooting + +| Issue | Likely Cause | Solution | +|-------|--------------|----------| +| "Company not found" error | Company enum not added or typo in name | Double-check `Companies` enum in `hubspotClient.py` | +| Deal count mismatch | Company wasn't added to the script's companies list | Ensure company is in `valuable_companies` / `companies` in all 3 scripts | +| Slow script execution | Large portfolio or network latency | Normal for first run; `hubspot_update_script.py` is faster on subsequent runs | +| Photos not uploading | Deal doesn't have `major_condition_issue_photos` property | Photos only upload if HubSpot deal has photos attached | +| S3 upload fails | Credentials or bucket issues | Check IAM permissions and bucket name (`retrofit-data-dev`) | +| Update script reports failures | Stale data or missing DB fields | Check error messages in summary report; may need to re-sync company | +