Adren's profile

22 Messages

 • 

428 Points

Saturday, September 24th, 2022

Solved

Missing entries (tconst) in title.basics.tsv with regard to other files

When checking in the main file (title.basics.tsv) all the tconst (tt....) found in the other files (akas, episodes and principals) there seems to be incoherencies and missing movies or TV shows. The parent series of some TV show episodes found in the title.episodes.tsv (the parentTconst value) are absent in the title.basics.tsv For instance, tt0086748 show 8 episodes in the title.episodes.tsv +-----------+--------------+--------------+---------------+| tconst | parentTconst | seasonNumber | episodeNumber |+-----------+--------------+--------------+---------------+| tt0630512 | tt0086748 | 1 | 6 || tt0630513 | tt0086748 | 1 | 8 || tt0630514 | tt0086748 | 1 | 7 || tt0630515 | tt0086748 | 1 | 4 || tt0630516 | tt0086748 | 1 | 2 || tt0630517 | tt0086748 | 1 | 5 || tt0630518 | tt0086748 | 1 | 1 || tt0630519 | tt0086748 | 1 | 3 |+-----------+--------------+--------------+---------------+ but tt0086748 is not present on title.basics.tsv while the TV show exists on IMdb: https://www.imdb.com/title/tt0086748/ What is even more surprising, is that for some cases the webpage of the parent TV show doesn't exists on the file nor on the website: +-----------+--------------+--------------+---------------+| tconst | parentTconst | seasonNumber | episodeNumber |+-----------+--------------+--------------+---------------+| tt7153894 | tt4824012 | 6 | 21 |+-----------+--------------+--------------+---------------+ tt4824012 is also absent from tile.basics.tsv and it returns a dead page but tt6781204 shows the episode with a different ID for the serie (tt2912216 instead of tt4824012) And for a more extreme case: +-----------+--------------+--------------+---------------+| tconst | parentTconst | seasonNumber | episodeNumber |+-----------+--------------+--------------+---------------+| tt4839832 | tt4820982 | 1 | 17 |+-----------+--------------+--------------+---------------+ here, both IDs (episode and parent serie) are absent in the titles.basics and the website tt4820982 -> 404 Not Found tt4839832 -> 404 Not Found In conclusion, from those 3 files: - title.akas.tsv - title.episodes.tsv - title.principals.tsv when we extract respectively the titleId, parentTconst and tconst and check them against the tconst in title.basics.tsv there are 6220 missing entries 1086 seem to be valid pages on IMDb website (although 48 are redirections -> HTTP/1.1 308 Redirect) but 5134 are dead pages (see tt4820982 above) The files title.ratings and title.crew doesn't seem to have the same problem.

Oldest First
Selected Oldest First

Employee

 • 

5.6K Messages

 • 

58.9K Points

3 years ago

Hi

22 Messages

 • 

428 Points

3 years ago

[Update 2022-09-29] The column "knownForTitles" from the character's file (name.basics.tsv), provides a large number of tconst that should also be present in the title.basics.tsv Unfortunately, there are a large number of IDs that are unavailable such as this 1922 movie (Beyond the Rainbow) https://www.imdb.com/title/tt0012937/ or the more recent TV show (Joe Scott - TMI) https://www.imdb.com/title/tt21318080/ both tconst are only present in the names.basics.tsv : +------------+----------------+-----------------------------------------+| nconst | primaryName | knownForTitles |+------------+----------------+-----------------------------------------+| nm0150767 | F. Champury | tt0005330,tt0012937,tt0011624 || nm0604384 | Harry T. Morey | tt0010289,tt0012444,tt0185913,tt0012937 || nm12640960 | Joe Scott | tt14797924,tt21318080 |+------------+----------------+-----------------------------------------+ but in no other file from the datasets downloaded last week (23 September 2022) As a conclusion, I found 7895 unique tt (tconst) IMDb absent from the title.basics.tsv file while they are present in one (or many) of those 4 other files (title.akas.tsv title.episodes.tsv title.principals.tsv names.basics) checked individually, the missing IDs from title.basics are distributed as follows: name.basics 2578title.akas 5741title.crew 0title.episodes 27title.principals 928title.ratings 0 And when those 7895 tconst are checked on the IMDb website, here is the result: - 1866 are existing films/series/... (returns an "HTTP/1.1 200 OK"), including 535 that are redirections towards another page (308 redirect) - but an astonishing number doesn't exist: 6029 sends back a "404 Not Found"

(edited)

Employee

 • 

5.6K Messages

 • 

58.9K Points

2 years ago

Hi

22 Messages

 • 

428 Points

Hello @Bethanny​ Unfortunately, with the files retrieved yesterday (2023-04-26 @ 15:15) there are still thousands of tconst found in both title.akas.tsv and names.basics that are not present in the title.basics.tsv for instance: tt0021006 tt0021453 tt0023019 tt0024677 tt0036165 tt0038098 tt0046142 tt0052041 tt0052206 ... tt7766088 tt7779806 tt7829938 tt7869672 tt7892078 tt7978886 tt8206494 tt8466868 tt8982514 tt9496006 from akas are not in the main title.basics if I look more precisely at the first example (tt0021006) here is the line in title.akas.tsv tt0021006 1 Ja, der Himmel über Wien AT \N \N \N \N but there is no corresponding web page https://www.imdb.com/title/tt0021006/ (404 not found) on the other hand there is a movie with this particular title for Austria https://www.imdb.com/title/tt0023449/releaseinfo/#akas so it should be tt0023449 instead of tt0021006 To conclude, there are still 5201 tconst missing in akas and 2537 in names I hope this will help

Employee

 • 

5.6K Messages

 • 

58.9K Points

22 Messages

 • 

428 Points

2 years ago

If you want to check how many tconst are missing in the titles.basics file compared to the title.akas you can launch the following commands on a Unix system (Linux or else) to get both lists of sorted IDs cut -d$'\t' -f1 datasets.imdbws.com/title.basics.tsv |sort -u > /tmp/basics_tconst.tsv cut -d$'\t' -f1 datasets.imdbws.com/title.akas.tsv |sort -u > /tmp/akas_tconst.tsv and to compute the IDs only in akas but not in basics comm -13 /tmp/basics_tconst.tsv /tmp/akas_tconst.tsv > /tmp/only_in_akas.tsv the result is that there are still 5303 missing tconst in the akas file that are not found in basics (files retrieved on the 2023-05-05) grep -c ^tt /tmp/only_in_akas.tsv

(edited)

22 Messages

 • 

428 Points

1 year ago

Hello @Bethanny I just wanted to let you know as of today, the inconsistency is not present anymore in the files. This is probably due to the change of mentioned in the message/banner explaining that the "datasets are backed by a new data source as of March 18th, 2024". You can close this ticket.

22 Messages

 • 

428 Points

1 year ago

Hello @Bethanny Unfortunately, I have to reopen this ticket as that there is again a large number of missing tconst in all the related (sub)files compared to the reference (title.basics.tsv). Here is a detailed list of missing tconst in the following files name.basics 236title.akas 1854title.crew 189045title.episode 1512 (tconst) / 10 (parentTconst)title.principals 1413title.ratings 0 I checked the first 10 IDs on the title.crew and they are mostly redirects.

22 Messages

 • 

428 Points

As of today (11th of July 2024), the incoherencies in title.axas and title.episode are fixed. But title.crew still contains nearly 190k tconst missing from title.basics Here is the update: name.basics 314 title.akas 0 title.crew 189820 title.episode 0 / 0 title.principals 1310 title.ratings 0 Here are the first missing tconst (order by number) in title.crew that are not present in title.basics ┌───────────┬─────────────────────┬───────────┐ │ tconst │ directors │ writers │ ├───────────┼─────────────────────┼───────────┤ │ tt0000021 │ nm0525910 │ │ │ tt0000136 │ nm0525910 │ │ │ tt0000311 │ │ │ │ tt0000600 │ nm0488932 │ nm0241414 │ │ tt0000635 │ nm0085865,nm0448682 │ nm0000636 │ └───────────┴─────────────────────┴───────────┘ (all of them are redirections to another page)

(edited)

Employee

 • 

18K Messages

 • 

318.8K Points

11 months ago

Hi

22 Messages

 • 

428 Points

Hi @Michelle​ Only the issue with title.principals have been corrected some weeks ago. As for the other problems, there are still today 192570 tconst in the title.crew.tsv that are not present in the title.basics.tsv file ┌────────────────────────────────────────┬───────────────────────────────┐│ URL_tt │ directors │├────────────────────────────────────────┼───────────────────────────────┤│ https://www.imdb.com/title/tt0000021/ │ nm0525910 ││ https://www.imdb.com/title/tt0000136/ │ nm0525910 ││ https://www.imdb.com/title/tt0000311/ │ ││ https://www.imdb.com/title/tt0000600/ │ nm0488932 ││ https://www.imdb.com/title/tt0000635/ │ nm0085865,nm0448682 ││ https://www.imdb.com/title/tt0000702/ │ nm0159015 ││ https://www.imdb.com/title/tt0000710/ │ nm0085865,nm0710362 ││ https://www.imdb.com/title/tt0000735/ │ nm0143333,nm0892614 ││ https://www.imdb.com/title/tt0000937/ │ nm0000428 ││ https://www.imdb.com/title/tt0000973/ │ nm0000428 ││ https://www.imdb.com/title/tt0001433/ │ nm0085865 ││ https://www.imdb.com/title/tt0001651/ │ nm0159015 ││ https://www.imdb.com/title/tt0001745/ │ nm0048864 ││ https://www.imdb.com/title/tt0001938/ │ nm0000428 ││ https://www.imdb.com/title/tt0001953/ │ nm0135052 ││ https://www.imdb.com/title/tt0001958/ │ nm0309163 ││ https://www.imdb.com/title/tt0001991/ │ nm0300487 ││ https://www.imdb.com/title/tt0002032/ │ nm0085865,nm0448682,nm0949648 ││ https://www.imdb.com/title/tt0002275/ │ nm0408436 ││ https://www.imdb.com/title/tt0002957/ │ nm0102643 ││ · │ · ││ · │ · ││ · │ · ││ https://www.imdb.com/title/tt34232952/ │ nm16643084,nm6354377 ││ https://www.imdb.com/title/tt34235957/ │ ││ https://www.imdb.com/title/tt34241255/ │ ││ https://www.imdb.com/title/tt34241258/ │ ││ https://www.imdb.com/title/tt34241261/ │ ││ https://www.imdb.com/title/tt34241262/ │ ││ https://www.imdb.com/title/tt34241268/ │ ││ https://www.imdb.com/title/tt34259430/ │ nm2980216 ││ https://www.imdb.com/title/tt34267400/ │ ││ https://www.imdb.com/title/tt34280336/ │ ││ https://www.imdb.com/title/tt34281236/ │ nm9742632 ││ https://www.imdb.com/title/tt34281469/ │ nm0333132,nm6091692 ││ https://www.imdb.com/title/tt34284619/ │ ││ https://www.imdb.com/title/tt34286337/ │ nm1414582 ││ https://www.imdb.com/title/tt34316127/ │ nm10245362 ││ https://www.imdb.com/title/tt34316310/ │ nm15705740,nm15384155 ││ https://www.imdb.com/title/tt34322279/ │ nm0591101 ││ https://www.imdb.com/title/tt34338732/ │ ││ https://www.imdb.com/title/tt34340098/ │ nm15281673 ││ https://www.imdb.com/title/tt34376056/ │ │├────────────────────────────────────────┴───────────────────────────────┤│ 192570 rows (40 shown) 2 columns │└────────────────────────────────────────────────────────────────────────┘ After a quick check, all the IDs tested appear to be redirected Here are the four first tconst https://www.imdb.com/title/tt0000021/ -> https://www.imdb.com/title/tt0000013/ https://www.imdb.com/title/tt0000136/ -> https://www.imdb.com/title/tt0000014/ https://www.imdb.com/title/tt0000311/ -> https://www.imdb.com/title/tt0000265/ https://www.imdb.com/title/tt0000600/ -> https://www.imdb.com/title/tt0000583/ and the last ones in chronological/ranking order https://www.imdb.com/title/tt34322279/ -> https://www.imdb.com/title/tt33550053/ https://www.imdb.com/title/tt34338732/ -> https://www.imdb.com/title/tt28255955/ https://www.imdb.com/title/tt34340098/ -> https://www.imdb.com/title/tt34326340/ https://www.imdb.com/title/tt34376056/ -> https://www.imdb.com/title/tt34376050/ I haven't had the time to check them all systematically, but I'm sure that over the past months, the number of missing tconst from title.crew has always been above 190k IDs with most if not all that redirects to another page (http code 308 / Permanent Redirect) Another remaining incoherency found in the name.basics.tsv file is that among the knownForTitles column, 14 of them to not relate with the tconst in the title.basics https://www.imdb.com/title/tt4864946/ -> 404 (page not found) https://www.imdb.com/title/tt8170096/ -> the page exists on wwwimdb.com, but tconst is missing from title.basics https://www.imdb.com/title/tt9174576/ -> redirects to https://www.imdb.com/title/tt8694398/ https://www.imdb.com/title/tt9745406/ -> 404 (page not found) https://www.imdb.com/title/tt11127492/ -> 404 (page not found) https://www.imdb.com/title/tt11670206/ -> 404 (page not found) https://www.imdb.com/title/tt14598938/ -> exists on wwwimdb.com, but tconst is missing from title.basics https://www.imdb.com/title/tt22183320/ -> 404 (page not found) https://www.imdb.com/title/tt27243323/ -> 404 (page not found) https://www.imdb.com/title/tt29332867/ -> 404 (page not found) https://www.imdb.com/title/tt29332868/ -> 404 (page not found) https://www.imdb.com/title/tt31012569/ -> 404 (page not found) https://www.imdb.com/title/tt32339746/ -> 404 (page not found) NB The files on which those numbers are calculated are all dated 2024-10-30 with title.basics.tsv having 11201578 lines (11201577 unique tconst) Thank you very much following-up this issue

(edited)

Employee

 • 

18K Messages

 • 

318.8K Points

Hi

22 Messages

 • 

428 Points

Hi @Michelle I just wanted to let you know that there is not a single incoherency between all the files (title.akas, name.basics, title.episode, title.crew and title.ratings) regarding some missing tconst that used to be absent from the title.basics.tsv "main" file Same with title.principals.tsv for which there is no discrepancy regarding nconst with name.basics.tsv Thank you very much for fixing everything, this ticket can be closed (again)