Instagram Datasets for Machine Learning
Instagram Datasets for Machine Learning
Training machine learning models on Instagram data usually starts with the wrong question: "How do I scrape enough data?" The better question is "Where can I buy or license pre-built Instagram datasets that match my feature schema?" — because scraping 10M+ records yourself takes weeks of engineering, thousands of dollars in proxies, and constant upkeep, while pre-built datasets ship as a single CSV download.
This guide covers what Instagram datasets typically look like, what to evaluate when buying, and how DataLikers and competing providers compare.
What's in a Typical Instagram Dataset
Most Instagram datasets break down into three core tables:
- Profiles — username, full name, bio, follower / following / post counts, verification, business contact fields, external URL, profile image URL.
- Posts — post ID / shortcode, author user ID, caption, media URL, taken-at timestamp, like count, comment count, hashtags, mentions, location, sponsored flag.
- Comments — comment ID, post ID, author user ID, text, taken-at timestamp, like count, parent comment ID (for replies).
Higher-quality datasets add:
- Reels metrics — play count, share count, music ID.
- Engagement scores — derived metrics from raw counts (engagement rate, follower quality).
- Demographics — aggregated geography, age, gender (derived from profile / comment analysis).
- Face embeddings — visual similarity vectors for deduplication or creator clustering.
Make sure the schema matches your feature needs before buying — re-deriving fields from raw data is where dataset projects get slow.
Common ML Use Cases
1. Creator recommendation / audience matching
You want to find creators whose audience overlaps with a target brand. Features needed: profile fields, post topics (from caption NLP), follower overlap hints.
2. Influencer authenticity / fraud detection
You want to detect fake-follower accounts. Features needed: follower-to-engagement ratios, comment-to-like ratios, follower demographic diversity, posting cadence.
3. Content performance prediction
Predict which posts will go viral from the first hour of engagement. Features needed: historical post data at minute-level granularity (harder to source), caption text, hashtags, creator history.
4. Trend detection
Find emerging hashtags or topics. Features needed: time-series hashtag post counts, caption NLP.
5. Brand / product discovery from images
Identify branded products in Instagram posts. Features needed: post images + computer-vision pipeline. Face / logo embeddings help.
Each use case has a different minimum viable dataset — the first rule is matching your use case to the right dataset shape, not buying the biggest one.
Buying vs. Building
Scraping Instagram at dataset scale from scratch typically involves:
- Residential proxies ($500–$2,000/mo during build, lower after)
- Signed-request maintenance (signature algorithms change; plan for ongoing engineer time)
- Session pool management (fresh authenticated sessions, rotated on challenges)
- Several weeks of ramp time before the first usable full crawl
- Ongoing monitoring — one missed Instagram update can break the pipeline overnight
At mid-scale (1M–10M records), buying is almost always cheaper and faster. At very high scale (100M+ records), you're usually better off contracting with a provider who already has the crawl infrastructure running, whether that's a dataset purchase or a custom engagement.
Dataset Providers
| Provider | Starting price | Volume | Format | Best For |
|---|---|---|---|---|
| DataLikers | Per-dataset pricing | Curated per use case | CSV | ML training, market analysis, custom shapes |
| Bright Data | $0.0025/record, $250 min | 970M+ Instagram records | CSV, JSON | Enterprise-scale snapshots, compliance |
| Apify Datastore | Per-dataset | Varies by actor | CSV, JSON | Bundled with scraper purchase |
| ScrapeCreators | $47–$497 one-time | Varies | CSV | One-time purchase, 20+ platforms |
| Scrape.do / RapidAPI providers | Varies | Small | API-delivered | Small exports |
DataLikers offers curated datasets sized to your use case plus a custom Dataset Builder for specific schemas. If you need 500K tech-niche creators with bio + follower count + last-30-day posts, you get exactly that schema — not 970M records you have to filter down.
Example: Pulling a Custom Dataset via API
For programmatic custom datasets you can compose via the Cache API:
import requests
import csv
KEY = 'YOUR_DL_KEY'
creators = ['natgeo', 'nasa', 'blueoriginals', 'patagonia', ...]
with open('creators.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['username', 'followers', 'posts', 'bio'])
writer.writeheader()
for username in creators:
profile = requests.get(
'https://api.datalikers.com/v1/user/by/username',
params={'username': username},
headers={'x-access-key': KEY},
).json()
writer.writerow({
'username': username,
'followers': profile.get('follower_count'),
'posts': profile.get('media_count'),
'bio': profile.get('biography'),
})
For larger lists (tens of thousands of profiles), contact DataLikers for a pre-built dataset — typically cheaper than paginated API calls at that volume.
Evaluating a Dataset Before You Buy
- Sample first. Ask for a 1,000-row sample. Check that all the fields you need are populated (not just documented).
- Freshness. When was the data last crawled? Instagram changes rapidly; a 6-month-old dataset has stale follower counts.
- Coverage. Is the dataset a uniform random sample or filtered to a niche? Make sure the distribution matches your use case.
- Licensing. Can you use it commercially? Re-distribute? Train models you later monetize? Enterprise buyers especially — check the ToS.
- Update cadence. Is it a one-time snapshot or a refreshing feed? Predictive models usually need refreshing.
Summary
Instagram datasets for ML exist in a spectrum: from $47 one-time bundles that include Instagram + 20 other platforms, to enterprise-tier $0.0025/record subscriptions from Bright Data. DataLikers sits in the middle — curated datasets scoped to your use case, plus a Cache API for any custom shape you want to build from scratch, plus an MCP Server for ad-hoc exploration before you commit to a full dataset purchase.
Start: datalikers.com/datasets — browse available datasets, request a sample, or talk to the team about a custom shape.