mirror of https://github.com/Hestia-Homes/survey-extraction.git synced 2026-06-30 13:10:56 +00:00

History

Jun-te Kim b51a96e7d4 fail gracefully		2026-03-20 11:53:01 +00:00
..
debug_one_address.py	fail gracefully	2026-03-20 11:53:01 +00:00
hubspot_company.py	updated script to do live tracking and folder generation better	2026-03-03 14:52:21 +00:00
hubspot_gather_all_deals.py	updated script to do live tracking and folder generation better	2026-03-03 14:52:21 +00:00
hubspot_update_script.py	updated script to do live tracking and folder generation better	2026-03-03 14:52:21 +00:00
README.md	save	2026-02-27 12:50:16 +00:00

README.md

HubSpot Client Scripts - Onboarding Guide

Overview

The scripts in this directory form a 3-stage ETL pipeline for syncing HubSpot data (companies and deals) into the local database:

Stage 0 (Seed): hubspot_company.py — Load company master records
Stage 1 (Bulk Load): hubspot_gather_all_deals.py — Initial load of all deals for a company
Stage 2 (Sync/Update): hubspot_update_script.py — Ongoing synchronization (runs on a schedule)

These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity.

Onboarding a New Client

Follow these steps in order when adding a new company/client:

Step 1: Add the Company to the `Companies` Enum

Edit ../hubspotClient.py and add your new company to the Companies enum class:

class Companies(Enum):
    ABRI = "237615001799"
    SOUTHERN_HOUSING_GROUP = "109343619305"
    LIVEWEST = "86205872354"
    SURESERVE = "301745289413"
    HOMEGROUP = "94946071794"
    APPLE = "184769046716"
    THE_GUINESS_PARTNERSHIP = "86970043613"
    YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID"  # ← Add here

How to find your HubSpot Company ID:

Log into HubSpot
Navigate to Contacts → Companies
Click on the company name
The URL will be: https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID — copy that ID

Step 2: Update Each Script to Include Your Company

After adding the enum, update the company lists in all three scripts:

`hubspot_company.py` (line ~6)

companies = [
    Companies.THE_GUINESS_PARTNERSHIP,
    Companies.YOUR_NEW_COMPANY  # ← Add here
]

`hubspot_gather_all_deals.py` (line ~7)

valuable_companies = [
    Companies.THE_GUINESS_PARTNERSHIP.value,
    Companies.YOUR_NEW_COMPANY.value  # ← Add here
]

`hubspot_update_script.py` (line ~12)

companies = [
    Companies.THE_GUINESS_PARTNERSHIP,
    Companies.YOUR_NEW_COMPANY  # ← Add here
]

Step 3: Run `hubspot_company.py` (One-time setup)

This script seeds the company record into the hubspot_company_data table. Run it once:

python etl/hubSpotClient/scripts/hubspot_company.py

What it does:

Connects to HubSpot and fetches company information (name, ID)
Inserts the company record into the local database

Output: You'll see the company added to hubspot_company_data table.

Step 3.5: Update Group ID in Database (Manual)

After the company record is created, you need to manually update the group ID for the new company. This is done via DBeaver or pgAdmin:

Steps:

Open DBeaver or pgAdmin and connect to the database
Navigate to the hubspot_company_data table
Find the row with your new company (search by company_name or company_id)
Edit the group_id column to the portfolio/group ID you want to track for this company
Save the changes

Example Query (if you prefer SQL):

UPDATE hubspot_company_data
SET group_id = 'YOUR_GROUP_ID'
WHERE company_id = 'YOUR_COMPANY_ID';

What is Group ID?

The group ID identifies which portfolio/group in your system this company belongs to
Each company can be associated with one group ID for tracking and organization
This field is used for tracking and reporting across your survey data

Step 4: Run `hubspot_gather_all_deals.py` (One-time bulk load)

This script performs the initial load of all deals for your company, filtered by the OPERATIONS_SOCIAL_HOUSING pipeline. Run it once per company:

python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py

What it does:

Fetches all deal IDs associated with your company from HubSpot
For each deal, retrieves detailed properties:
- dealname, dealstage, pipeline, outcome, outcome_notes, project_code
- major_condition_issue_description, major_condition_issue_photos
- coordination_status__stage_1_, retrofit_design_status
Filters to only deals in the OPERATIONS_SOCIAL_HOUSING pipeline
Fetches the associated listing (UPRN, property IDs)
Inserts each deal into the hubspot_data table
Downloads photo evidence files and uploads them to S3 (bucket: retrofit-data-dev)

⚠️ Note: This script can take a long time if your company has many deals. It processes deals serially with progress reporting via tqdm.

Output: Deals appear in hubspot_data table; photos appear in S3 at s3://retrofit-data-dev/hubspot/awaabs_law_evidence/.

Step 5: `hubspot_update_script.py` (Automatic scheduling)

After the initial setup, no manual action is needed. This script runs automatically every 15 minutes during working hours as a scheduled job.

What it does:

Queries the local database for all stored deals for each company
Compares each deal's stored fields against the live HubSpot data (13 fields checked)
Updates the database if any values have changed in HubSpot
Uploads newly available photos to S3 (with SHA-256 integrity verification)
Prints a summary report of changes, updates, and any failures

Script Reference

`hubspot_company.py`

Stage: Seed (one-time setup) Frequency: Run once per new company Speed: Fast

Purpose: Load company master data into the database.

Database Output:

Table: hubspot_company_data
Fields: company_id, company_name

Code Flow:

For each company in config:
  1. Call HubSpot API: get_company_information(company_id)
  2. Insert record into hubspot_company_data table

`hubspot_gather_all_deals.py`

Stage: Bulk Load (one-time per company) Frequency: Run once per company (manually triggered) Speed: Slow (serial processing of all deals)

Purpose: Perform initial load of all deals for target companies.

Database Output:

Table: hubspot_data
Fields: deal_id, deal_name, company_id, stage, outcome, photos_s3_url, and others

S3 Output:

Bucket: retrofit-data-dev
Path: hubspot/awaabs_law_evidence/{filename}

Code Flow:

For each company in config:
  1. Fetch all deal IDs from HubSpot
  2. For each deal:
     a. Get deal properties from HubSpot
     b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline
     c. Fetch associated listing data (UPRN, property IDs)
     d. Insert deal into hubspot_data table
     e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB
     f. Print progress: "Uploaded deal_id {id} to db"

Error Handling: None — script will abort on first error. Re-run to retry.

`hubspot_update_script.py`

Stage: Sync/Update (ongoing maintenance) Frequency: Every 15 minutes during working hours (automated schedule) Speed: Fast (only processes stored deals, compares, updates deltas)

Purpose: Keep database synchronized with live HubSpot data; handle new/updated photos.

Database Operations:

Reads: All deals from hubspot_data for each company
Writes: Updates only when fields differ from HubSpot
S3 Uploads: New or previously missing photos

Summary Report: After completion, prints a table of per-company statistics:

Company | Checked | Updated | Up-to-date | Failed

Plus detailed error messages for any failed updates.

Code Flow:

1. Initialize HubSpot client (warm-up: get_deal_stages)
2. For each company:
   a. Query DB for all deals with company_id
   b. For each deal:
      - Fetch live deal data from HubSpot
      - Compare 13 fields: deal_id, company_id, landlord_property_id, outcome,
                          dealstage, dealname, project_code, uprn,
                          outcome_notes, major_condition_issue_description,
                          major_condition_issue_photos, coordination_status,
                          design_status
      - If any field differs: call upsert_hubspot_deal() to update DB
      - If photos exist in HubSpot but not yet in S3:
        * Download file from HubSpot URL
        * Upload to S3
        * Verify SHA-256 hash integrity
        * Save S3 URL back to DB
      - Collect success/failure counts
   c. Print per-company summary
3. Print all failures (if any) with error messages

Error Handling: Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal.

Common Tasks

I added a new company but deals aren't showing up

Checklist:

Company added to Companies enum in hubspotClient.py
Company added to the companies list in all three scripts
Ran hubspot_company.py successfully
Ran hubspot_gather_all_deals.py and watched for "Uploaded deal_id" messages
Check database: SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID'
Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline?

Deals exist in HubSpot but aren't syncing

The hubspot_gather_all_deals.py script only loads deals in the OPERATIONS_SOCIAL_HOUSING pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot.

Photos aren't uploading

First run of hubspot_gather_all_deals.py should upload photos at import time
Subsequent runs of hubspot_update_script.py will upload newly available photos
Check S3 bucket retrofit-data-dev under hubspot/awaabs_law_evidence/
Check DB field major_condition_issue_photos (photo S3 URL is stored here)

I need to re-sync everything for a company

Clear the deals from the database:

DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID';

Clear the company:

DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID';

Re-run from Step 3 above (run hubspot_company.py, then hubspot_gather_all_deals.py)

Dependencies

All scripts depend on:

HubSpotClient from ../hubspotClient.py — Handles HubSpot API calls
HubspotTodb from ../../db/hubSpotLoad.py — Handles database operations (insert/upsert/query)
tqdm — Progress bars
Python requests — HTTP downloads for photo files

Environment Requirements:

Valid HubSpot API token (configured in HubSpotClient.__init__())
Database connection (configured in HubspotTodb)
S3 credentials (for photo uploads)
Network access to HubSpot API and S3

Notes & Tips

Idempotency: hubspot_gather_all_deals.py and hubspot_update_script.py use upsert logic, so they can be run multiple times without creating duplicates.
Large Portfolios: If a company has thousands of deals, hubspot_gather_all_deals.py will take a while. Use tqdm progress indicators to monitor.
Error Handling: hubspot_update_script.py has error handling per deal. hubspot_company.py and hubspot_gather_all_deals.py do not — any failure aborts the script. If interrupted, simply re-run.
Schedule: hubspot_update_script.py is scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler).
Photo Integrity: The hubspot_update_script.py verifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database.
Unused Fields: The scripts populate deals_to_add and deal_to_companies dicts in hubspot_gather_all_deals.py but don't use them downstream. This is harmless but could be cleaned up in future refactors.

Troubleshooting

Issue	Likely Cause	Solution
"Company not found" error	Company enum not added or typo in name	Double-check `Companies` enum in `hubspotClient.py`
Deal count mismatch	Company wasn't added to the script's companies list	Ensure company is in `valuable_companies` / `companies` in all 3 scripts
Slow script execution	Large portfolio or network latency	Normal for first run; `hubspot_update_script.py` is faster on subsequent runs
Photos not uploading	Deal doesn't have `major_condition_issue_photos` property	Photos only upload if HubSpot deal has photos attached
S3 upload fails	Credentials or bucket issues	Check IAM permissions and bucket name (`retrofit-data-dev`)
Update script reports failures	Stale data or missing DB fields	Check error messages in summary report; may need to re-sync company

README.md

HubSpot Client Scripts - Onboarding Guide

Overview

Onboarding a New Client

Step 1: Add the Company to the Companies Enum

Step 2: Update Each Script to Include Your Company

hubspot_company.py (line ~6)

hubspot_gather_all_deals.py (line ~7)

hubspot_update_script.py (line ~12)

Step 3: Run hubspot_company.py (One-time setup)

Step 3.5: Update Group ID in Database (Manual)

Step 4: Run hubspot_gather_all_deals.py (One-time bulk load)

Step 5: hubspot_update_script.py (Automatic scheduling)

Script Reference

hubspot_company.py

hubspot_gather_all_deals.py

hubspot_update_script.py

Common Tasks

I added a new company but deals aren't showing up

Deals exist in HubSpot but aren't syncing

Photos aren't uploading

I need to re-sync everything for a company

Dependencies

Notes & Tips

Troubleshooting

Step 1: Add the Company to the `Companies` Enum

`hubspot_company.py` (line ~6)

`hubspot_gather_all_deals.py` (line ~7)

`hubspot_update_script.py` (line ~12)

Step 3: Run `hubspot_company.py` (One-time setup)

Step 4: Run `hubspot_gather_all_deals.py` (One-time bulk load)

Step 5: `hubspot_update_script.py` (Automatic scheduling)

`hubspot_company.py`

`hubspot_gather_all_deals.py`

`hubspot_update_script.py`