TikTok Datasets for Machine Learning
TikTok Datasets for Machine Learning
Training machine learning models on TikTok data usually starts with the wrong question: "How do I scrape enough videos?" The better question is "Where can I buy or license pre-built TikTok datasets that match my feature schema?" — because scraping 10M+ records yourself takes weeks of engineering, thousands of dollars in proxies, and constant upkeep, while pre-built datasets ship as a single CSV download.
This guide covers what TikTok datasets typically look like, what to evaluate when buying, and how DataLikers and competing providers compare.
What's in a Typical TikTok Dataset
Most TikTok datasets break down into three core tables:
- Users — unique_id (handle), nickname, signature (bio), follower / following / video counts, verified, region, profile image URL.
- Videos — video PK, author user PK, description (caption), media URL, created_at timestamp, play count, like count, comment count, share count, music ID, hashtags.
- Comments — comment PK, video PK, author user PK, text, created_at timestamp, like count, reply count, parent comment PK.
Higher-quality datasets add:
- Engagement metrics — derived ratios (engagement rate, viral score, follower quality).
- Hashtag time series — daily / hourly post counts and view counts per hashtag.
- Playlist (mix) data — playlist IDs and the video PKs they contain.
- Music tracks — music PK, title, artist, originality flag, video count using the track.
- Region / language — aggregate distributions per user or hashtag.
Make sure the schema matches your feature needs before buying — re-deriving fields from raw data is where dataset projects get slow.
Common ML Use Cases on TikTok
1. Creator discovery and influencer matching
Find creators whose audience overlaps with a target brand or niche. Features needed: user profile fields, video topics (from caption NLP), follower demographics, engagement rates.
2. Viral prediction / early-stage trending
Predict which videos will go viral from the first hour of engagement. Features needed: minute-level engagement series (play count, like count, share count), caption embeddings, hashtag co-occurrence, creator history.
3. Trend detection (hashtag and music)
Find emerging hashtags or songs. Features needed: time-series hashtag post counts and view counts, music ID frequency, day-over-day deltas.
4. Brand / product recognition in UGC
Detect brand mentions or product appearances. Features needed: video metadata, caption NLP, optional frame-level visual analysis.
5. Authenticity / fake-follower detection
Detect inauthentic accounts via engagement asymmetry. Features needed: follower-to-engagement ratios, comment-to-like ratios, comment authenticity (real comments vs bot patterns), posting cadence.
6. Cross-platform creator graphs
Match TikTok creators to their Instagram presence for cross-platform analytics. Features needed: TikTok handle, Instagram handle (sometimes in bio), engagement metrics on both platforms. DataLikers covers both — see Instagram Datasets for ML.
DataLikers TikTok Datasets
DataLikers ships pre-built TikTok datasets and a Dataset Builder for custom collections.
Pre-built TikTok datasets
- Top TikTok users by country — millions of public user profiles, ranked by follower count, segmented by region.
- TikTok hashtag history — time series for thousands of hashtags, daily post and view counts.
- TikTok music tracks — popular music PKs with metadata and the videos using each.
CSV format, S3-signed download links, refreshed monthly.
Dataset Builder (custom)
Compose a custom TikTok dataset by filter:
- By hashtag — collect all videos under a hashtag with full metadata + comments.
- By user list — pull video catalogs for a fixed list of TikTok handles.
- By music ID — find every video using a specific track.
- By region — TikTok users / videos filtered by reported country.
Custom builds are quoted by record count; per-record pricing follows the same freshness tiers as the Cache API.
Provider Comparison
| Provider | TikTok focus | Format | Custom builds | MCP-ready |
|---|---|---|---|---|
| DataLikers | Yes (primary) | CSV download + Cache API + MCP | Yes — Dataset Builder | Yes (20 TikTok tools) |
| Bright Data | Yes (enterprise) | CSV / JSON dataset | Yes — enterprise quote | No |
| Apify TikTok Scraper | Actor-based | JSON dataset, async | Yes — via actor configuration | Generic actor MCP wrapper |
| Scrape Creators | Yes (primary) | API-only, no pre-built CSV | No — request-by-request | No |
| Kaggle / academic | Snapshot-only, often stale | CSV | No | No |
For research / training workloads where you need millions of records once, DataLikers and Bright Data are the two real choices. DataLikers wins on price ($0.0003-$0.0006 per record vs Bright Data's $0.50-$5 per record) and bundling (Cache API + MCP from the same key). Bright Data wins on enterprise compliance (SOC 2 / ISO 27001) and pre-collected dataset depth.
Buy vs Build Tradeoff
Building your own TikTok dataset means standing up:
- TikTok scraping infrastructure (anti-bot, proxies, fingerprints, regional VPNs)
- TikTok account pool for authenticated endpoints
- A storage and refresh pipeline (typical TikTok scraping projects produce ~1 TB / month of raw data)
- Schema normalization (TikTok's internal response shape changes regularly)
- Ongoing fix-it engineering as TikTok ships countermeasures
For most research / product workloads under 100M records / month, buying from DataLikers or a competitor is significantly cheaper than building. Building makes sense when you need very specific custom features (visual analysis, audio fingerprinting, etc.) that no provider ships.
Evaluation Checklist Before Buying
- Schema fit — does the dataset include every field your model needs, or will you have to re-derive?
- Freshness — when was the dataset last refreshed? For trending / viral work, freshness matters; for static training data, monthly snapshots are fine.
- Sample size — request a 10K-record sample before committing to a full purchase.
- Re-distribution rights — can you redistribute derived models, datasets, or summaries?
- API access alongside — do you also get live API access (DataLikers, yes; Bright Data datasets, no) so you can refresh single records?
- MCP / AI-tool compatibility — does the provider expose an MCP server so your AI agents can query live? DataLikers does; most others do not.
Pricing (DataLikers TikTok)
Same pricing model as the Cache API — pay-per-record at four freshness tiers:
| Plan | Price/record | Best for |
|---|---|---|
| Outdated | $0.0003 | Bulk research, training data |
| Month | $0.0004 | Trend analysis |
| Week | $0.0005 | Weekly reporting |
| Day | $0.0006 | Near-realtime feature pipelines |
100 free requests on signup, $50 minimum deposit, no monthly subscription, funds never expire.
FAQ
What does a typical TikTok user record cost?
At the Outdated tier (records >30 days old), $0.0003 per record. A 1M-user dataset is $300.
Can I get a sample before buying?
Yes — 100 free requests on signup let you spot-check the schema and field quality before committing to a larger pull. Custom dataset builds typically include a 10K-record sample for quote validation.
Do TikTok datasets include music or audio?
DataLikers exposes music track metadata (track PK, title, artist, video count) but does not currently ship raw audio files. For audio fingerprinting workflows, license the audio separately.
Are TikTok comments available at scale?
Yes — DataLikers exposes both individual comment lookups (Cache API /t1/comment/by/id) and bulk comment pulls by user PK. For deep comment research, query via the Dataset Builder rather than per-record.
How does this compare to Instagram datasets?
See Instagram Datasets for ML. Same DataLikers operator, same freshness tiers, similar schema shape — but Instagram has 30+ MCP tools vs TikTok's 20, and the schemas differ (e.g. TikTok uses unique_id instead of username for handles).
Getting Started
Register for a DataLikers account to receive 100 free requests. Spot-check the TikTok schema, then request a custom dataset via the Dataset Builder or query records on demand via the Cache API and TikTok MCP Server.