gromit82's profile
Champion

Champion

 • 

7.5K Messages

 • 

277K Points

Wednesday, April 20th, 2022 3:27 AM

No Status

20

Unicode - 25-year update

IMDb used to have a newsletter, and the mid-April 1997 issue contained an item titled "The Great ISO Swap" reporting that IMDb had implemented the ISO 8859-1 (also known as ISO Latin-1) character set, allowing names and titles to use all the common letters with diacritical marks of the major Western European languages (such as å, ç, é, ï, ñ, ô, and ù).

http://web.archive.org/web/20060101140203/http://www.imdb.com/Newsletter/newsletter-13#iso

Near the end of the item the following statement appeared:

Ideally all data should be presented using its native character sets/ pictograms. Technically this is not possible though with current widespread software for web access, e-mail and operating systems in general.

In the future there will be a new huge standardized 16 bit character set called Unicode. It will offer the capability to freely combine Japanese Kanji with ISO 1 text and Hindi, for example. We will use it as it becomes widely available and supported by the industry.

I note that some additional character sets have been made available for the Alternate Titles section over the last few years (among them Greek, Chinese, Japanese, Korean, and Cyrillic), and I personally am not that affected by the lack of full Unicode support. However, I know that some contributors here would like to see further progress made in terms of implementing Unicode, so I am bringing this up to mark the 25 years since IMDb announced plans to implement it.

8 Messages

 • 

184 Points

3 years ago

When entire world is moving away from proprietary fonts towards will Unicode, IMDb still strictly bans the use of Unicode characters

What is the reason? How does the use of Unicode character harms the website?

When will Unicode character be allowed to be used at the site?

Thanks.

Note: This comment was created from a merged conversation originally titled When will Unicode character be allowed?

1 Message

 • 

60 Points

3 years ago

A good example of the importance of full Unicode support:

The Romanian film "Față în Față" ("Face to Face") becomes "Fata în Fata" on IMDb - "The Girl in the Girl" ! The is also how "Face/Off" is known in Romania, apparently...

8 Messages

 • 

184 Points

3 years ago

Without Unicode we cannot even use Latin alphabet, and entire astronomy has major use of that. Most of stars have Latin alphabet in their names, so we cannot properly mention the star without Unicode, have to write full name of that Latin letter. There are so many astronomy related movies that would need mention of Latin alphabet.

4 Messages

 • 

84 Points

2 months ago

It would be great if imdb expanded the character set for writing movie and person names. My name contains the character "š", but on the name page I have "s" instead of this character. In order for my name to be spelled correctly, the character set would have to be extended by at least "Latin Extended-A". I guess I'm not the only one who has a garbled name due to this lack.

Note: This comment was created from a merged conversation originally titled Character set extension

Champion

 • 

7.5K Messages

 • 

277K Points

Tomaš: While IMDb has been planning for full implementation of Unicode, and they have added certain character sets for certain purposes, they first announced their intent to implement Unicode 26 years ago (see here). So I can't predict when the "š" in your name will be available for use.

Champion

 • 

3K Messages

 • 

72.5K Points

IMDb now supports these character in the attribute field, so you can enter the attribute "as Tomaš Wenigr". You can see this with some of the newer productions like The Ark

4 Messages

 • 

84 Points

@adrian​ I know about this possibility, but I still find it a shame that some names are changed due to the outdated system. In addition, google has generated a knowledge panel about me that pulls information from this page and I have the wrong name in it.

Champion

 • 

7.5K Messages

 • 

277K Points

dhtbrowne: The problem is that vowels with macrons are not supported by the main character set used by IMDb, ISO-Latin-1.

All the vowels with macrons will be available once IMDb has fully implemented Unicode. However, that is probably still quite a while away; it has been under contemplation since 1997. (See https://web.archive.org/web/20060101140203/http://www.imdb.com/Newsletter/newsletter-13#iso.)

35 Messages

 • 

694 Points

Hello,

My name is Adrian Țofei, but because the IMDb platform doesn't accept the accented character "Ț", I could only be listed as Adrian Tofei. Please add that character so that my name can be displayed correctly on IMDb, as written on my movie's credits, website, social media and everywhere else. 

You can find more info on Wikipedia about the Romanian letter Ț. There are many other Romanian actors on IMDb whose names are not correctly spelled because of this issue. 

Thanks a lot!
Note: This comment was created from a merged conversation originally titled Please add the accented letter Ț

Filmmaker/Actor

3 Messages

 • 

90 Points

In the Romanian language, we have the A-breve letter "ă" (https://en.wikipedia.org/wiki/%C4%82), but, unfortunately, IMDB does not allow it:

The Unicode character at code point 259 [ă] is not supported.

Since we should be able to use this common letter for Romanian titles, please add support for it.

Note: This comment was created from a merged conversation originally titled Make the website inclusive for the Romanian language

3 Messages

 • 

70 Points

We're based in New Zealand and the Māori language uses macrons over letters to denote a longer vowel sound. At the moment in the list of special characters the closest thing I can find is the tilda (wiggly line above a letter) but that's not technically correct and it should be a macron. I would like to suggest that special vowels with macrons be added to the special characters list. In Te Reo Māori (the Māori language) a macron can change the meaning of a word. For example from the film industry - the famous wētā workshop should be spelt with macrons over the e and the a, and is the name of an insect. Without the macrons weta means excrement. Quite the different meaning from two little lines!

Note: This comment was created from a merged conversation originally titled Macrons to be added to special characters

2 Messages

 • 

70 Points

2 months ago

It isn't fair that other language's alphabet characters are accepted for the Alternative Titles but not Latvian. These are the Latvian alphabet characters that weren't accepted for the Alternative Titles. Ā, Č, Ē, Ģ, Ī, Ķ, Ļ , Ņ, Š, Ū, Ž, ā , č, ē, ģ, ī, ķ, ļ, š, ū, ž. IMDb please accept all Latvian alphabet characters for the Alternative Titles. 
Note: This comment was created from a merged conversation originally titled Accept all Latvian Alphabet characters for the Alternative Titles

Champion

 • 

1.9K Messages

 • 

92.6K Points

Unfortunately this is not a simple change.

The problem is that IMDb was begun very early in the internet period, when 8-bit data was fairly new. IMDb was designed to support only the Latin-1 code page, which does not include all the characters required for non Western European languages.

Because the current update system has become very complex, simply switching to a scheme that supports 16-bit code (like UTF-8) was considered too likely to cause problems. Some of the high-level extended ASCII characters were (are) used as control characters, and it is possible that a Unicode character could break the update routines.  Moving to Unicode would require a complete review of the entire software.

As you may know, IMDb is currently rewriting the update system. I assume that they are assuring that it will be able to support Unicode characters. Therefore I doubt that there will be any change to the titles until after that change is complete.

2 Messages

 • 

70 Points

2 months ago

I am working on a multicultural project that features characters fluent in foreign languages and who speak those languages in the TV series. As this I feel, as well as the diversity in authority figures in the show, makes it a concept with a high likelihood for international popularity. I tried to list the alternative titles and I was unable to as only the Western and Cyrillic Alphabets are supported on IMDb. I also think that allowing talent to use their native languages for their names or AKA blurbs I think would help increase world use of the site. I have had difficulties casting Asians fluent in their native languages, I suspect because those who don't speak fluent English don't use IMDb, making casting for Asian roles, where fluent English is not necessary difficult. I think IMDb should be more true to International Movie Database and support international Alphabets. What do you think?
Note: This comment was created from a merged conversation originally titled International Text. I think all text types should be supported.

4 Messages

 • 

150 Points

2 months ago

Since 2014, the messageboards have supported the full Unicode set, so it obviously isn't that hard. Why doesn't the main site support Unicode? Right now it's still mired in a Western European-centric interface, where all names have to be transliterated to be posted, despite the transliterations being debatable and unofficial. The submission form even recognizes individual Unicode symbols, but specifically disallows them!
Note: This comment was created from a merged conversation originally titled When will IMDB support full Unicode?

1.8K Messages

 • 

55.3K Points

Admins, please include the votes and merge this thread into:
https://getsatisfaction.com/imdb/topics/support_for_unicode

12 Messages

 • 

370 Points

2 months ago

Unicode is not fully supported in IMDb. For example, in Polish: you could change all references by searching “milosc” and then changing them to “miłość”. And Jiří Hnídek is written without an r-hacek on the start of their first name. It can also do the same for the ILM person Coşku Özdemır which is an Turkish person listed on Cinefex.
Note: This comment was created from a merged conversation originally titled Support for Unicode.

2 Messages

 • 

100 Points

I agree. This affects titles, names, characters, discussion, and probably more.

Where I run into the problem most is in the discussion forums. If you paste non-ASCII characters copied from somewhere else (for example, to show a symbol that was in the film, or indeed to show the native-language title of the film), they just get turned into what appears to be the HTML text code for those characters, instead of the symbol itself.

It's 2013. This shouldn't be happening.

Champion

 • 

19.6K Messages

 • 

478.6K Points

Since this is the International Movie Data Base, it is truly surprising that they do not support unicode.

12 Messages

 • 

370 Points

Since MobyGames supports Unicode, macrons in Japanese are OK for long vowels. Note that the title ends with a punctuation mark (full stop). Hungarian, Czech, Polish, Romanian, Slovak etc. requires a bunch of accented letters.

Champion

 • 

4.6K Messages

 • 

236.3K Points

Except that it's "Internet Movie Database," not "International." ;)

Champion

 • 

19.6K Messages

 • 

478.6K Points

This must be a Freudian slip. [wink]
Reminder to self: Don't post when tired.

Champion

 • 

4.6K Messages

 • 

236.3K Points

LOL. Too be honest, though, it's almost like they want it to be known that way. Most mentions of the spelled-out name are gone. Kind of like Kentucky Fried Chicken is only KFC now. New visitors seem to be having a hard time figuring out what the site is...video streaming, file sharing?

Champion

 • 

1.9K Messages

 • 

146.1K Points

Or after taking some random prescription medication you found lying in the meep.

12 Messages

 • 

370 Points

Yeah. Do not post nonsense. You have accidentally removed a comment (you need to dispute this remove).

Champion

 • 

19.6K Messages

 • 

478.6K Points

It is almost like Randall Munroe has been reading this forum.
http://xkcd.com/1209/

12 Messages

 • 

370 Points

You quote the comic: “The Skywriter we hired has terrible Unicode support.”

After correcting Miroslav Kure's suname to Miroslav Kuře (to match Czech support: the Danish/Faroese/Norwegian ø is rcaron) in Battle for Wesnoth 1.11.1 contribution community, you have many problems with the Internet Archive Wayback Machine this time. First the connection is too slow to load and you get the error mesage “The machine that serves this file is down. We're working on it.” twice. Unicode in their own forum affects subjects (titles) and more. Note that the thread has nonsense!

Champion

 • 

1.9K Messages

 • 

92.6K Points

This has been mentioned many times over the past few years. A bit of history may help here.

When IMDb first started, it was updated by an automated email system. This was at a time when some of the email routers still only handled 7-bit ASCII and special encoding was needed to ensure that 8-bit codes would not be trashed. Moreover, some characters (e.g. | the 'pipe') were used internally (and in the email) as controls/delimiters. This is why you may sometimes see older contributors indicate a credit update as :

John Doe | 2nd Pirate | 22

By the time Unicode became standard, the system had grown quite complex. Before Unicode can be implemented, every part of the system needs to be checked and potentially modified to ensure that it will not be broken by any of the Unicode codes.

IMDb is currently in the process of moving the various lists (sections) to new internal systems. I hope and expect that they are designing these systems so that they will be able to support Unicode.

Once the moves have been completed, we may see support for Unicode, but don't expect it any time soon.

12 Messages

 • 

370 Points

You will need an answer. You have removed the first reply by accident. Where a name includes a suffix, we use a comma to separate it from the name. On game credits and indexes it is not treated as an integral part of the surname. Examples are:

Hernandez, Jonathan, Jr
Rowe, William A., Jr.
Tibbetts, Richard S., III

It thinks that the Get Satisfaction software uses Unicode. It supports different accented characters for Eastern European languages.

Champion

 • 

4.6K Messages

 • 

236.3K Points

The change log says you removed it...??? What the..??

12 Messages

 • 

370 Points

This reply was removed on 2013-03-25.

Champion

 • 

4.6K Messages

 • 

236.3K Points

Yep. And:


3 months ago
taewong, the poster:
Removed a reply in this topic
Reason: removed by the poster

2 Messages

 • 

60 Points

Actually, it seems that after the message-board makeover, Unicode support is even worse! At least with the old ones you could enter most extended ASCII glyphs (assuming proper code-page is set). But now anything that is above 127 doesn’t work.

2 Messages

 • 

82 Points

It's year 2014 and some Czech characters are still not supported.

12 Messages

 • 

160 Points

It's almost 2015 and Greek characters aren't supported AT ALL.

9 Messages

 • 

116 Points

This reply was created from a merged topic originally titled
How many years will it take you to understand UNICODE?.


In 2009, in Contact #3034383 (http://www.imdb.com/helpdesk/thread?tid=3034383) you the owner of IMDB promised professional usage of UNCODE "in a little while". It is now 5 years and a half later and your web site is still crippled with no UNICODE implementation. 5 years and a half??? Don't you fill embarrassed with your "professionalism"? Shall we wait another 5 years for IMDB to understand the word "international"?

(This post is addressed solely and specifically to IMDb staff.)

Employee

 • 

18 Messages

 • 

2.9K Points

We are making slow and steady progress on Unicode support.  Note that until every single part of a system supports Unicode, none of it works.  We have a lot of critical backend systems that need to be migrated.  Unfortunately, we don't have a timetable that we can share, but please be aware that we are working on it.

Note that in the last few weeks we've enabled full Unicode support in the message boards:

http://www.imdb.com/board/bd0000043/nest/235469052

We had a number of encoding issues that I believe we have fixed.

Employee

 • 

18 Messages

 • 

2.9K Points

Note that user reviews:

http://www.imdb.com/user/ur2278015/

...and lists:

http://www.imdb.com/list/ls001825868/

...also support Unicode.

12 Messages

 • 

160 Points

Yes, but no movie display titles...

Employee

 • 

18 Messages

 • 

2.9K Points

There is already limited support for this; see the Greek title here:

http://www.imdb.com/title/tt0015648/releaseinfo#akas

Our systems currently use a mixture of ISO-8859-1, UTF-8, and KOI8-R.  Untangling this mess while keeping things running is like changing the fan belt on an engine without switching it off.

12 Messages

 • 

160 Points

I tried to add a title in a movie but the system didn't let me. It errored in every letter i entered.

Employee

 • 

18 Messages

 • 

2.9K Points

Yup.  The submissions pipeline doesn't yet handle Unicode.

12 Messages

 • 

160 Points

So, the movie titles written with Greek characters are made by the people inside?

Is there a timeline when I will be able to contribute Greek titles?

Employee

 • 

18 Messages

 • 

2.9K Points

Yes, there were some cases added manually years ago.

We don't have a timeline yet, but we know people really want it.

2 Messages

 • 

64 Points

3 years has passed and IMDB is still mentally in the pre-unicode 1990's.

If you don't want to fix your database for unicode support, then just write parsers and translate user input to html codes.
Moreover, some html codes are not supported, eg. ń

NB. It is not possible to have a title with a non-basic-latin character. Even if I fix a movie and input a html the form will on the fly change it to unicode and report a problem (!!)

2.8K Messages

 • 

84.1K Points

The last update on this (at least in this thread) was two years ago, so can a staffer tell us what has happened these past two years regarding this issue?
(I note that in the message boards on IMDb, one could see exactly when a post was made, here I can only see that Murray responded two years ago, not very specific).

Champion

 • 

7.5K Messages

 • 

277K Points

Marco: In response to your latter comment, you can see the exact time of a post here, at least on the desktop version of GetSatisfaction. To do that, hover your mouse over the time designation of the post (such as "2 years ago"). So, for example, Murray's post that begins "Yes, there were some cases added manually years ago" was posted October 9, 2014 at 10:46:58 PM UTC.

I don't know whether or how it is possible to see the exact date and time on the mobile version of GetSatisfaction.

Employee

 • 

18 Messages

 • 

2.9K Points

Checking in to say that we're still working on it, but at this point can't commit to a timeline.

2.8K Messages

 • 

84.1K Points

Thanks Gromit!
Is there also a way I could've replied this post to you instead of to myself that I haven't found?

2.8K Messages

 • 

84.1K Points

Thanks for letting us know you're still working on it.

9 Messages

 • 

116 Points

Come on! If you haven't done much in 7 years, the timeline is clear: for ever! :)

9 Messages

 • 

116 Points

Correction: "don't you FEEL"

461 Messages

 • 

14.6K Points

SGML/HTML/XML character references are no more useful in solving the underlying problem of representing and processing the full range of Unicode than any of a number of other encodings. They make sense if the data is represented and processed in XML - perhaps using technology such as XSLT - but even then they would appear only in externalised forms emitted as output or accepted as input. Since XML is, for preference, represented in UTF-8 in externalised forms, using character references does not give much benefit.

Using SGML character references in internal representations would cause all sorts of problems, especially with searching and matching.

10.7K Messages

 • 

225.5K Points

I see that the IMDb staff has left this proposal in an "under consideration" state. Very interesting.

I shall opine that it is not so challenging for search algorithms to be made to account for strings encoded with standard character entity references, and it would be a shame if most of the libraries and engines behind most search tools used deployed in any electronic database anywhere throughout the World-Wide Web lacked such a capability. But likewise, the same could be said of Unicode deployment, or that of Internet Protocol v6 for that matter.

461 Messages

 • 

14.6K Points

According to https://en.wikipedia.org/wiki/SGML_entity#Character_entities - and I have no reason to doubt its accuracy -
HTML 4, for example, has 252 built-in character entities that don't have to be explicitly declared. XML has five. XHTML has the same five as XML, but if its DTDs are explicitly used, then it has 253 (' being the extra entity beyond those in HTML 4).
This calls into doubt the concept of "standard character entity references" and also makes it clear that the sets of character entities that can be considered to be in common use do not cover the range of Unicode codepoints.

If we allow numeric character references - both decimal and hexadecimal - then each Unicode codepoint in the data can be represented in three or four ways in any system that can handle Unicode. The only rational way to deal with that complexity is to decode the data to strings of Unicode codepoints before applying normalisation and then using it in whatever processing is required. Having decoded the data to Unicode codepoints, the simplest and most widely supported encoding to use for any sort of I/O is UTF-8. Unless the data is being embedded in some SGML-like format such as XML, there is no reason to use character references and there is never a reason to use references for characters that do not have specific meanings in the markup if the underlying representation can support Unicode.

The most fundamental requirement in handling character encodings is to be obsessively strict in tracking how each piece of data is encoded. In general, data may have multiple layers of encodings and it is essential to keep track of which have been applied to each piece of data. Each additional kind of encoding adds complexity, especially if it can be layered on other encodings, so the goal should always be to use as few encodings as possible.

I expect that some filmmaker will want to capture the essence of the World Wide Web and will decide to use a title such as "Markup: < &lt; & &amp; changed the world" and whatever encodings are used by IMDb had better be able to cope with that. (and I hope that this forum can handle it too!).

5 Messages

 • 

358 Points

How is this still a thing.

5 Messages

 • 

248 Points

When the time comes, please don't forget Georgian language characters to be in the supported characters list.

Employee

 • 

18 Messages

 • 

2.9K Points

There's a Unicode block for Georgian characters, so they will be supported automatically. Whether or not the characters display properly in browsers will depend on whether people have a font installed locally.... but presumably those who are interested in Georgian characters will!

5 Messages

 • 

204 Points

When will Unicode be fully supported in text fields in IMDB? If this website is really Internet Movie DataBase, it's supposed to support non-English languages, and how come in 2019 your website doesn't support unicode, it's a shame.

3 Messages

 • 

132 Points

Adding to this from 2020 and Quarantine Land: Discovered this after trying to correct Abed's Polish from S01E08 of Community from "Czesc" to "Cześć". It's 2020. If your site is supposed to be international, it really should support Unicode. Although I'll at least grant that it's *uniform* in not allowing non-ASCII, as opposed to the weird trend elsewhere on the internet of only bothering with diacritics if it's a Western European language.

172 Messages

 • 

5.1K Points

Justin Eberlein, IMDb started allowing Unicode in 2019, but it's only allowed in 'Alternative titles' field, not in the 'Original title' field.

9 Messages

 • 

116 Points

They are really dumb! :D I don't think there's another web site with this huge flaw. There is no other web site struggling so hard to get international! :D

80 Messages

 • 

1.5K Points

The main reason IMDb doesn't support Unicode throughout is that it was founded in 1990. Changing that isn't a case of pressing a button and it works, as there is so much data entered in the current system, which is built with the presumption of the non-existence of Unicode, and new data coming in all the time. If you want a database of movies and TV that supports Unicode throughout, you can either complain here (to no effect whatsoever), or use one founded in 2008.

2 Messages

 • 

60 Points

Any news on unicode support? It's 2022 and it's absolutely shameful that IMDB doesn't support unicode characters. So many foreign names on here are misspelt because of it. Unicode characters aren't just aesthetic quirks - they change the meaning of words.

11 Messages

 • 

290 Points

2 months ago

Dear IMDb.

I am writing to share some observations and suggestions regarding the handling of names on your platform. I have noticed that there are some challenges with the correct representations of names, especially when it comes to diacritical marks and non-Latin writing systems.

For example, if one searches for a person named "Lasse Kvelnes", they are directed to a profile named Lasse Kvalsnes and his real name is only an alternative name but in reality he doesn't seem to have an alternative name. Maybe it would be a good idea to only use "alternative names" for nicknames or abbreviated names. Otherwise this can be confusing and potentially lead to errors. "Zdena Pelikanova" is a twin profile to "Zdenka Pelikanova" using one of her nicknames and according to Czech orthography the name should be written "Zdeňka Pelikánová". The profile "Zdenka Pelikonova" would represent a pronounciation error if it could speak and present itself. This highlights the need for IMDb to upgrade the system to accept the full range of Unicode characters. Moreover, it is difficult for users who are searching for individuals with names written in non-Latin scripts such as Hangul.

I understand that there may be technical limitations that make it difficult to implement full Unicode support. However, I believe that this could be a valuable improvement to your platform, making it more inclusive and user-friendly for an international audience.

I look forward to hearing your thoughts on this suggestion.

Best regards Maatamun

Note: This comment was created from a merged conversation originally titled Handling of names on your platform

461 Messages

 • 

14.6K Points

See also https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/support-for-unicode/5f4a7a4c8815453dbaa0d1c9 - 54 votes for it when I looked right now.

There have been various posts on better support for Unicode and I think it might be better to consolidate the support into many votes on one idea rather than spreading them across many.

461 Messages

 • 

14.6K Points

The thread I cited has been merged into this one and it look as if that does not carry over the votes. This thread has 17 votes as I am posting this so we have lost at least 37 votes unless people who had voted for it had removed their votes.

This is unfortunate if the number of votes is being used as a measure of support for an idea.

2.8K Messages

 • 

84.1K Points

@owenrees​ I seem to recall the issues of votes not being carried over has been raised before but nothing has been done about it, but I can't find the thread about it...

24 Messages

 • 

724 Points

1 month ago

Suggestion: The title can support the two tone letters ā and ē.

Note: This comment was created from a merged conversation originally titled 2024-12-06 Suggestion

85 Messages

 • 

1.4K Points

27 days ago

I'd like to use the following accented character in a submission: š but it's not on your list of approved accents - can a solution be found?

Note: This comment was created from a merged conversation originally titled Accented character

Employee

 • 

1.9K Messages

 • 

20.7K Points

Hi MartinK75-

Thank you for reporting this issue. I've forwarded this information to the appropriate team for further reviewing (Ref Ticket #V1621449887). We'll reply once we receive further information.

Cheers!

Employee

 • 

17.6K Messages

 • 

314.9K Points

Hi @MartinK75​ -

Unfortunately, at this time IMDb only accepts non-latin 1 characters in a limited set of places, specifically in title AKAs.