johnny_m's profile

7 Messages

 • 

150 Points

Tuesday, July 26th, 2022

Closed

Solved

title.basics.tsv.gz is broken - https://datasets.imdbws.com/

The title.basics.tsv.gz dataset is broken in https://datasets.imdbws.com/ . It now only includes 3,477,496 titles. It should have 3 times that number almost. The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". This value is found in titleType. The value in tconst for that record is "ial for George Floyd". Could some at IMDb please correct this? Thank you!

Oldest First
Selected Oldest First

14 Messages

 • 

330 Points

3 years ago

There are tconst entries that appear to be missing from titles.basic.tsv. The entry tt0055928 , "Dr. No", was missing for a few days, but is present once again. Here is are some missing entries of which I am aware, but there could be others as well. Nor do these tconst values have any entries in the title.crew.tsv, title.ratings.tsv, or title.episode.tsv files. tt0562856 tt0562972 tt0811753 tt0811802 tt0811803 tt0811804 tt10756720 tt10927782 tt10927786 tt10927788 tt10927792 tt10927794 tt11252880 tt11301906 tt11669800 tt12919806 tt12919828 tt13056134 tt13056158 tt13286384 tt13286388 tt13422242 tt13422246 tt13431146 tt13675380 tt13675384 tt13825212 tt14739882 tt14883030 tt14921134 tt15130830 tt15588572 tt15806014 tt1747582 tt1752041 tt1771906 tt18231284 tt18258626 tt4462678 tt6579350 tt8084176 tt8893678 Note: This comment was created from a merged conversation originally titled Some tconst titles missing from titles.basic.tsv

2 Messages

 • 

70 Points

3 years ago

Sometime over the last two-three weeks (Between files downloaded on 2022-07-10 and 2022-07-24), it seems as if the IMDB datasets available from https://datasets.imdbws.com/ no longer include some movies. Download https://datasets.imdbws.com/title.basics.tsv.gz for instance, and try to find the following IMDB-ids entries, there on July 10 but not on July 24: tt0044502 Clash by Night (1952) tt0047573 Them! (1954) tt0048977 The Bad Seed (1956) tt0050539 The Incredible Shrinking Man (1957) tt0053290 Solomon and Sheba (1959) tt0056700 The Wonderful World of the Brothers Grimm (1962) tt0057449 The Raven (1963) tt0060980 The Silencers (1966) tt0065421 The AristoCats (1970) The same IMDB-ids seem to have disappeared from https://datasets.imdbws.com/title.ratings.tsv.gz as well. I did re-download the files on July 25 and got the same results missing. What could explain this? Note: This comment was created from a merged conversation originally titled IMDB Datasets no longer including some movies?

7 Messages

 • 

150 Points

The dataset is broken. It now only includes 3,477,496 titles. It should have 3 times that number almost. The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". The value in tconst for that character is "ial for George Floyd". Could some at IMDb please correct this? Thank you!

7 Messages

 • 

150 Points

3 years ago

The source of the issue might be another record. See tconst = tt14491350. The value in genre contains the value for tconst of another record.

Employee

 • 

18.2K Messages

 • 

320.5K Points

3 years ago

Hi

Employee

 • 

18.2K Messages

 • 

320.5K Points

3 years ago

Hi All - I'm just following up here to confirm that the issue with the 'title.basics.tsv.gz' dataset should now be resolved and the titles should now be included. Cheers!

14 Messages

 • 

330 Points

@Michelle​ I'm replying about a smaller issue that still exists with the IMDB datasets available from https://datasets.imdbws.com/. Several titles remain missing from various files there. Specifically: tt8084176 -- "Mr. Robot"; Season 4, Episode 7; "407 Proxy Authentication Required" is available via the web UI, but is not present in any of title.basics.tsv, title.ratings.tsv, title.crew.tsv, or title.episodes.tsv. tt0562856, tt0811802, tt0811803, tt0811804, tt0562972, and tt0811753 -- "Doctor Who (1963)"; Season 19, Episodes 9-14 are available via the web UI, but are not present in any of title.basics.tsv, title.ratings.tsv, title.crew.tsv, or title.episodes.tsv. On Sat, Jul 23, 2022, I mentioned a longer list of titles that are/were missing in a post titled "Some tconst titles missing from titles.basic.tsv". Thanks!

Employee

 • 

18.2K Messages

 • 

320.5K Points

Hi

23 Messages

 • 

450 Points

3 years ago

Here are some additional info which could help find the source of the problem that still persists The missing IDs in the TSV files can be as old as the 1923 movie "The Hunchback of Notre Dame" (tt0014142) or "The Wild Child" (tt0064285) from François Truffaut 1970, as well as more recent ones such as "Feast" (tt13097910) or Sinkhole (tt21953638) both released in 2021. There are also some TV shows in that list (ex: "Norman" tt4191702) What is peculiar is that some IDs can be found on the title.akas and title.principals or even only on the name.basics without appearing in the main title.basics. After doing a crosscheck between what appears to be the main title.basics and the 4 "sub-files" (name.basics, title.akas, title.crew and title.principals), there seems to be 8712 incoherencies (mismatch) organized in 2 categories: 2670 IMDb IDs (tconst) correspond to an existing movie/TV show/video/... on the website with 2130 of them that land to a regular page (returns HTTP 200 code) and 540 are 302 redirects to another id. 6042 are inexistent when checking on the imdb.com website (Not found / 404 HTTP error code). here are some examples of the IDs found (top/bottom five for each group): - IDs absent in title.basics but correspond to an existing movie/serie tt0012182 tt0012852 tt0012937 tt0013743 tt0014142 ... tt21953306 tt21953412 tt21953604 tt21953610 tt21953638 total count = 2130 - IDs absent in title.basics but correspond to an existing movie/serie after being redirected (302) tt0014327 -> tt0014325 tt0047941-> tt0047940 tt0059860 -> tt0059845 tt0088641 -> tt0085111 tt0103358 -> tt0103357 ... tt21312196 -> tt14604694 tt21336160 -> tt7227442 tt21931516 -> tt21905038 tt21943026 -> tt21926422 tt21944362 -> tt14794336 total count = 540 - IDs found in one of the 4 "sub-files" but not in title.basic and which returns a 404 error in the imdb.com website tt0021006 tt0021453 tt0023019 tt0024677 tt0036165 ... tt20412466 tt20877586 tt21327936 tt21809398 tt21952054 total count = 6042 I can provide the full list if need be.

(edited)

Employee

 • 

18.2K Messages

 • 

320.5K Points

Hi

Employee

 • 

5.6K Messages

 • 

58.9K Points

3 years ago

Hello everyone- This has now been solved. Cheers!

1 Message

 • 

60 Points

1 year ago

I think the problem has re-appeared: title.basics.tsv.gz downloaded July 9, 2024, does not seem to have Oppenheimer (tt15398776) for example, as well as other major movies.

23 Messages

 • 

450 Points

As of today (10th July 2024 with 4 files still dated 2024-07-09), the problem seems to be fixed with the title.basics.tsv that has more than 109M lines. But there are still some incoherencies between the files with tconst that are missing from the title.basics (see this other ticket)