survey-extraction/etl/hubSpotClient/scripts
2026-03-20 11:53:01 +00:00
..
debug_one_address.py fail gracefully 2026-03-20 11:53:01 +00:00
hubspot_company.py updated script to do live tracking and folder generation better 2026-03-03 14:52:21 +00:00
hubspot_gather_all_deals.py updated script to do live tracking and folder generation better 2026-03-03 14:52:21 +00:00
hubspot_update_script.py updated script to do live tracking and folder generation better 2026-03-03 14:52:21 +00:00
README.md save 2026-02-27 12:50:16 +00:00

HubSpot Client Scripts - Onboarding Guide

Overview

The scripts in this directory form a 3-stage ETL pipeline for syncing HubSpot data (companies and deals) into the local database:

  1. Stage 0 (Seed): hubspot_company.py — Load company master records
  2. Stage 1 (Bulk Load): hubspot_gather_all_deals.py — Initial load of all deals for a company
  3. Stage 2 (Sync/Update): hubspot_update_script.py — Ongoing synchronization (runs on a schedule)

These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity.


Onboarding a New Client

Follow these steps in order when adding a new company/client:

Step 1: Add the Company to the Companies Enum

Edit ../hubspotClient.py and add your new company to the Companies enum class:

class Companies(Enum):
    ABRI = "237615001799"
    SOUTHERN_HOUSING_GROUP = "109343619305"
    LIVEWEST = "86205872354"
    SURESERVE = "301745289413"
    HOMEGROUP = "94946071794"
    APPLE = "184769046716"
    THE_GUINESS_PARTNERSHIP = "86970043613"
    YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID"  # ← Add here

How to find your HubSpot Company ID:

  • Log into HubSpot
  • Navigate to ContactsCompanies
  • Click on the company name
  • The URL will be: https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID — copy that ID

Step 2: Update Each Script to Include Your Company

After adding the enum, update the company lists in all three scripts:

hubspot_company.py (line ~6)

companies = [
    Companies.THE_GUINESS_PARTNERSHIP,
    Companies.YOUR_NEW_COMPANY  # ← Add here
]

hubspot_gather_all_deals.py (line ~7)

valuable_companies = [
    Companies.THE_GUINESS_PARTNERSHIP.value,
    Companies.YOUR_NEW_COMPANY.value  # ← Add here
]

hubspot_update_script.py (line ~12)

companies = [
    Companies.THE_GUINESS_PARTNERSHIP,
    Companies.YOUR_NEW_COMPANY  # ← Add here
]

Step 3: Run hubspot_company.py (One-time setup)

This script seeds the company record into the hubspot_company_data table. Run it once:

python etl/hubSpotClient/scripts/hubspot_company.py

What it does:

  • Connects to HubSpot and fetches company information (name, ID)
  • Inserts the company record into the local database

Output: You'll see the company added to hubspot_company_data table.

Step 3.5: Update Group ID in Database (Manual)

After the company record is created, you need to manually update the group ID for the new company. This is done via DBeaver or pgAdmin:

Steps:

  1. Open DBeaver or pgAdmin and connect to the database
  2. Navigate to the hubspot_company_data table
  3. Find the row with your new company (search by company_name or company_id)
  4. Edit the group_id column to the portfolio/group ID you want to track for this company
  5. Save the changes

Example Query (if you prefer SQL):

UPDATE hubspot_company_data
SET group_id = 'YOUR_GROUP_ID'
WHERE company_id = 'YOUR_COMPANY_ID';

What is Group ID?

  • The group ID identifies which portfolio/group in your system this company belongs to
  • Each company can be associated with one group ID for tracking and organization
  • This field is used for tracking and reporting across your survey data

Step 4: Run hubspot_gather_all_deals.py (One-time bulk load)

This script performs the initial load of all deals for your company, filtered by the OPERATIONS_SOCIAL_HOUSING pipeline. Run it once per company:

python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py

What it does:

  • Fetches all deal IDs associated with your company from HubSpot
  • For each deal, retrieves detailed properties:
    • dealname, dealstage, pipeline, outcome, outcome_notes, project_code
    • major_condition_issue_description, major_condition_issue_photos
    • coordination_status__stage_1_, retrofit_design_status
  • Filters to only deals in the OPERATIONS_SOCIAL_HOUSING pipeline
  • Fetches the associated listing (UPRN, property IDs)
  • Inserts each deal into the hubspot_data table
  • Downloads photo evidence files and uploads them to S3 (bucket: retrofit-data-dev)

⚠️ Note: This script can take a long time if your company has many deals. It processes deals serially with progress reporting via tqdm.

Output: Deals appear in hubspot_data table; photos appear in S3 at s3://retrofit-data-dev/hubspot/awaabs_law_evidence/.

Step 5: hubspot_update_script.py (Automatic scheduling)

After the initial setup, no manual action is needed. This script runs automatically every 15 minutes during working hours as a scheduled job.

What it does:

  • Queries the local database for all stored deals for each company
  • Compares each deal's stored fields against the live HubSpot data (13 fields checked)
  • Updates the database if any values have changed in HubSpot
  • Uploads newly available photos to S3 (with SHA-256 integrity verification)
  • Prints a summary report of changes, updates, and any failures

Script Reference

hubspot_company.py

Stage: Seed (one-time setup) Frequency: Run once per new company Speed: Fast

Purpose: Load company master data into the database.

Database Output:

  • Table: hubspot_company_data
  • Fields: company_id, company_name

Code Flow:

For each company in config:
  1. Call HubSpot API: get_company_information(company_id)
  2. Insert record into hubspot_company_data table

hubspot_gather_all_deals.py

Stage: Bulk Load (one-time per company) Frequency: Run once per company (manually triggered) Speed: Slow (serial processing of all deals)

Purpose: Perform initial load of all deals for target companies.

Database Output:

  • Table: hubspot_data
  • Fields: deal_id, deal_name, company_id, stage, outcome, photos_s3_url, and others

S3 Output:

  • Bucket: retrofit-data-dev
  • Path: hubspot/awaabs_law_evidence/{filename}

Code Flow:

For each company in config:
  1. Fetch all deal IDs from HubSpot
  2. For each deal:
     a. Get deal properties from HubSpot
     b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline
     c. Fetch associated listing data (UPRN, property IDs)
     d. Insert deal into hubspot_data table
     e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB
     f. Print progress: "Uploaded deal_id {id} to db"

Error Handling: None — script will abort on first error. Re-run to retry.


hubspot_update_script.py

Stage: Sync/Update (ongoing maintenance) Frequency: Every 15 minutes during working hours (automated schedule) Speed: Fast (only processes stored deals, compares, updates deltas)

Purpose: Keep database synchronized with live HubSpot data; handle new/updated photos.

Database Operations:

  • Reads: All deals from hubspot_data for each company
  • Writes: Updates only when fields differ from HubSpot
  • S3 Uploads: New or previously missing photos

Summary Report: After completion, prints a table of per-company statistics:

Company | Checked | Updated | Up-to-date | Failed

Plus detailed error messages for any failed updates.

Code Flow:

1. Initialize HubSpot client (warm-up: get_deal_stages)
2. For each company:
   a. Query DB for all deals with company_id
   b. For each deal:
      - Fetch live deal data from HubSpot
      - Compare 13 fields: deal_id, company_id, landlord_property_id, outcome,
                          dealstage, dealname, project_code, uprn,
                          outcome_notes, major_condition_issue_description,
                          major_condition_issue_photos, coordination_status,
                          design_status
      - If any field differs: call upsert_hubspot_deal() to update DB
      - If photos exist in HubSpot but not yet in S3:
        * Download file from HubSpot URL
        * Upload to S3
        * Verify SHA-256 hash integrity
        * Save S3 URL back to DB
      - Collect success/failure counts
   c. Print per-company summary
3. Print all failures (if any) with error messages

Error Handling: Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal.


Common Tasks

I added a new company but deals aren't showing up

Checklist:

  • Company added to Companies enum in hubspotClient.py
  • Company added to the companies list in all three scripts
  • Ran hubspot_company.py successfully
  • Ran hubspot_gather_all_deals.py and watched for "Uploaded deal_id" messages
  • Check database: SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID'
  • Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline?

Deals exist in HubSpot but aren't syncing

The hubspot_gather_all_deals.py script only loads deals in the OPERATIONS_SOCIAL_HOUSING pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot.

Photos aren't uploading

  • First run of hubspot_gather_all_deals.py should upload photos at import time
  • Subsequent runs of hubspot_update_script.py will upload newly available photos
  • Check S3 bucket retrofit-data-dev under hubspot/awaabs_law_evidence/
  • Check DB field major_condition_issue_photos (photo S3 URL is stored here)

I need to re-sync everything for a company

  1. Clear the deals from the database:
    DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID';
    
  2. Clear the company:
    DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID';
    
  3. Re-run from Step 3 above (run hubspot_company.py, then hubspot_gather_all_deals.py)

Dependencies

All scripts depend on:

  • HubSpotClient from ../hubspotClient.py — Handles HubSpot API calls
  • HubspotTodb from ../../db/hubSpotLoad.py — Handles database operations (insert/upsert/query)
  • tqdm — Progress bars
  • Python requests — HTTP downloads for photo files

Environment Requirements:

  • Valid HubSpot API token (configured in HubSpotClient.__init__())
  • Database connection (configured in HubspotTodb)
  • S3 credentials (for photo uploads)
  • Network access to HubSpot API and S3

Notes & Tips

  1. Idempotency: hubspot_gather_all_deals.py and hubspot_update_script.py use upsert logic, so they can be run multiple times without creating duplicates.

  2. Large Portfolios: If a company has thousands of deals, hubspot_gather_all_deals.py will take a while. Use tqdm progress indicators to monitor.

  3. Error Handling: hubspot_update_script.py has error handling per deal. hubspot_company.py and hubspot_gather_all_deals.py do not — any failure aborts the script. If interrupted, simply re-run.

  4. Schedule: hubspot_update_script.py is scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler).

  5. Photo Integrity: The hubspot_update_script.py verifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database.

  6. Unused Fields: The scripts populate deals_to_add and deal_to_companies dicts in hubspot_gather_all_deals.py but don't use them downstream. This is harmless but could be cleaned up in future refactors.


Troubleshooting

Issue Likely Cause Solution
"Company not found" error Company enum not added or typo in name Double-check Companies enum in hubspotClient.py
Deal count mismatch Company wasn't added to the script's companies list Ensure company is in valuable_companies / companies in all 3 scripts
Slow script execution Large portfolio or network latency Normal for first run; hubspot_update_script.py is faster on subsequent runs
Photos not uploading Deal doesn't have major_condition_issue_photos property Photos only upload if HubSpot deal has photos attached
S3 upload fails Credentials or bucket issues Check IAM permissions and bucket name (retrofit-data-dev)
Update script reports failures Stale data or missing DB fields Check error messages in summary report; may need to re-sync company