12 KiB
HubSpot Client Scripts - Onboarding Guide
Overview
The scripts in this directory form a 3-stage ETL pipeline for syncing HubSpot data (companies and deals) into the local database:
- Stage 0 (Seed):
hubspot_company.py— Load company master records - Stage 1 (Bulk Load):
hubspot_gather_all_deals.py— Initial load of all deals for a company - Stage 2 (Sync/Update):
hubspot_update_script.py— Ongoing synchronization (runs on a schedule)
These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity.
Onboarding a New Client
Follow these steps in order when adding a new company/client:
Step 1: Add the Company to the Companies Enum
Edit ../hubspotClient.py and add your new company to the Companies enum class:
class Companies(Enum):
ABRI = "237615001799"
SOUTHERN_HOUSING_GROUP = "109343619305"
LIVEWEST = "86205872354"
SURESERVE = "301745289413"
HOMEGROUP = "94946071794"
APPLE = "184769046716"
THE_GUINESS_PARTNERSHIP = "86970043613"
YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID" # ← Add here
How to find your HubSpot Company ID:
- Log into HubSpot
- Navigate to Contacts → Companies
- Click on the company name
- The URL will be:
https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID— copy that ID
Step 2: Update Each Script to Include Your Company
After adding the enum, update the company lists in all three scripts:
hubspot_company.py (line ~6)
companies = [
Companies.THE_GUINESS_PARTNERSHIP,
Companies.YOUR_NEW_COMPANY # ← Add here
]
hubspot_gather_all_deals.py (line ~7)
valuable_companies = [
Companies.THE_GUINESS_PARTNERSHIP.value,
Companies.YOUR_NEW_COMPANY.value # ← Add here
]
hubspot_update_script.py (line ~12)
companies = [
Companies.THE_GUINESS_PARTNERSHIP,
Companies.YOUR_NEW_COMPANY # ← Add here
]
Step 3: Run hubspot_company.py (One-time setup)
This script seeds the company record into the hubspot_company_data table. Run it once:
python etl/hubSpotClient/scripts/hubspot_company.py
What it does:
- Connects to HubSpot and fetches company information (name, ID)
- Inserts the company record into the local database
Output: You'll see the company added to hubspot_company_data table.
Step 3.5: Update Group ID in Database (Manual)
After the company record is created, you need to manually update the group ID for the new company. This is done via DBeaver or pgAdmin:
Steps:
- Open DBeaver or pgAdmin and connect to the database
- Navigate to the
hubspot_company_datatable - Find the row with your new company (search by
company_nameorcompany_id) - Edit the
group_idcolumn to the portfolio/group ID you want to track for this company - Save the changes
Example Query (if you prefer SQL):
UPDATE hubspot_company_data
SET group_id = 'YOUR_GROUP_ID'
WHERE company_id = 'YOUR_COMPANY_ID';
What is Group ID?
- The group ID identifies which portfolio/group in your system this company belongs to
- Each company can be associated with one group ID for tracking and organization
- This field is used for tracking and reporting across your survey data
Step 4: Run hubspot_gather_all_deals.py (One-time bulk load)
This script performs the initial load of all deals for your company, filtered by the OPERATIONS_SOCIAL_HOUSING pipeline. Run it once per company:
python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py
What it does:
- Fetches all deal IDs associated with your company from HubSpot
- For each deal, retrieves detailed properties:
dealname,dealstage,pipeline,outcome,outcome_notes,project_codemajor_condition_issue_description,major_condition_issue_photoscoordination_status__stage_1_,retrofit_design_status
- Filters to only deals in the
OPERATIONS_SOCIAL_HOUSINGpipeline - Fetches the associated listing (UPRN, property IDs)
- Inserts each deal into the
hubspot_datatable - Downloads photo evidence files and uploads them to S3 (bucket:
retrofit-data-dev)
⚠️ Note: This script can take a long time if your company has many deals. It processes deals serially with progress reporting via tqdm.
Output: Deals appear in hubspot_data table; photos appear in S3 at s3://retrofit-data-dev/hubspot/awaabs_law_evidence/.
Step 5: hubspot_update_script.py (Automatic scheduling)
After the initial setup, no manual action is needed. This script runs automatically every 15 minutes during working hours as a scheduled job.
What it does:
- Queries the local database for all stored deals for each company
- Compares each deal's stored fields against the live HubSpot data (13 fields checked)
- Updates the database if any values have changed in HubSpot
- Uploads newly available photos to S3 (with SHA-256 integrity verification)
- Prints a summary report of changes, updates, and any failures
Script Reference
hubspot_company.py
Stage: Seed (one-time setup) Frequency: Run once per new company Speed: Fast
Purpose: Load company master data into the database.
Database Output:
- Table:
hubspot_company_data - Fields:
company_id,company_name
Code Flow:
For each company in config:
1. Call HubSpot API: get_company_information(company_id)
2. Insert record into hubspot_company_data table
hubspot_gather_all_deals.py
Stage: Bulk Load (one-time per company) Frequency: Run once per company (manually triggered) Speed: Slow (serial processing of all deals)
Purpose: Perform initial load of all deals for target companies.
Database Output:
- Table:
hubspot_data - Fields:
deal_id,deal_name,company_id,stage,outcome,photos_s3_url, and others
S3 Output:
- Bucket:
retrofit-data-dev - Path:
hubspot/awaabs_law_evidence/{filename}
Code Flow:
For each company in config:
1. Fetch all deal IDs from HubSpot
2. For each deal:
a. Get deal properties from HubSpot
b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline
c. Fetch associated listing data (UPRN, property IDs)
d. Insert deal into hubspot_data table
e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB
f. Print progress: "Uploaded deal_id {id} to db"
Error Handling: None — script will abort on first error. Re-run to retry.
hubspot_update_script.py
Stage: Sync/Update (ongoing maintenance) Frequency: Every 15 minutes during working hours (automated schedule) Speed: Fast (only processes stored deals, compares, updates deltas)
Purpose: Keep database synchronized with live HubSpot data; handle new/updated photos.
Database Operations:
- Reads: All deals from
hubspot_datafor each company - Writes: Updates only when fields differ from HubSpot
- S3 Uploads: New or previously missing photos
Summary Report: After completion, prints a table of per-company statistics:
Company | Checked | Updated | Up-to-date | Failed
Plus detailed error messages for any failed updates.
Code Flow:
1. Initialize HubSpot client (warm-up: get_deal_stages)
2. For each company:
a. Query DB for all deals with company_id
b. For each deal:
- Fetch live deal data from HubSpot
- Compare 13 fields: deal_id, company_id, landlord_property_id, outcome,
dealstage, dealname, project_code, uprn,
outcome_notes, major_condition_issue_description,
major_condition_issue_photos, coordination_status,
design_status
- If any field differs: call upsert_hubspot_deal() to update DB
- If photos exist in HubSpot but not yet in S3:
* Download file from HubSpot URL
* Upload to S3
* Verify SHA-256 hash integrity
* Save S3 URL back to DB
- Collect success/failure counts
c. Print per-company summary
3. Print all failures (if any) with error messages
Error Handling: Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal.
Common Tasks
I added a new company but deals aren't showing up
Checklist:
- Company added to
Companiesenum inhubspotClient.py - Company added to the
companieslist in all three scripts - Ran
hubspot_company.pysuccessfully - Ran
hubspot_gather_all_deals.pyand watched for "Uploaded deal_id" messages - Check database:
SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID' - Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline?
Deals exist in HubSpot but aren't syncing
The hubspot_gather_all_deals.py script only loads deals in the OPERATIONS_SOCIAL_HOUSING pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot.
Photos aren't uploading
- First run of
hubspot_gather_all_deals.pyshould upload photos at import time - Subsequent runs of
hubspot_update_script.pywill upload newly available photos - Check S3 bucket
retrofit-data-devunderhubspot/awaabs_law_evidence/ - Check DB field
major_condition_issue_photos(photo S3 URL is stored here)
I need to re-sync everything for a company
- Clear the deals from the database:
DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID'; - Clear the company:
DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID'; - Re-run from Step 3 above (run
hubspot_company.py, thenhubspot_gather_all_deals.py)
Dependencies
All scripts depend on:
HubSpotClientfrom../hubspotClient.py— Handles HubSpot API callsHubspotTodbfrom../../db/hubSpotLoad.py— Handles database operations (insert/upsert/query)tqdm— Progress bars- Python
requests— HTTP downloads for photo files
Environment Requirements:
- Valid HubSpot API token (configured in
HubSpotClient.__init__()) - Database connection (configured in
HubspotTodb) - S3 credentials (for photo uploads)
- Network access to HubSpot API and S3
Notes & Tips
-
Idempotency:
hubspot_gather_all_deals.pyandhubspot_update_script.pyuse upsert logic, so they can be run multiple times without creating duplicates. -
Large Portfolios: If a company has thousands of deals,
hubspot_gather_all_deals.pywill take a while. Usetqdmprogress indicators to monitor. -
Error Handling:
hubspot_update_script.pyhas error handling per deal.hubspot_company.pyandhubspot_gather_all_deals.pydo not — any failure aborts the script. If interrupted, simply re-run. -
Schedule:
hubspot_update_script.pyis scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler). -
Photo Integrity: The
hubspot_update_script.pyverifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database. -
Unused Fields: The scripts populate
deals_to_addanddeal_to_companiesdicts inhubspot_gather_all_deals.pybut don't use them downstream. This is harmless but could be cleaned up in future refactors.
Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| "Company not found" error | Company enum not added or typo in name | Double-check Companies enum in hubspotClient.py |
| Deal count mismatch | Company wasn't added to the script's companies list | Ensure company is in valuable_companies / companies in all 3 scripts |
| Slow script execution | Large portfolio or network latency | Normal for first run; hubspot_update_script.py is faster on subsequent runs |
| Photos not uploading | Deal doesn't have major_condition_issue_photos property |
Photos only upload if HubSpot deal has photos attached |
| S3 upload fails | Credentials or bucket issues | Check IAM permissions and bucket name (retrofit-data-dev) |
| Update script reports failures | Stale data or missing DB fields | Check error messages in summary report; may need to re-sync company |