11 comments

  • simonw 1 day ago
    I'm surprised this story didn't mention the scandal with Scots Wikipedia: https://www.theguardian.com/uk-news/2020/aug/26/shock-an-aw-...

    > an American teenager – who does not speak Scots, the language of Robert Burns – has been revealed as responsible for almost half of the entries on the Scots language version of Wikipedia

    It wasn't malicious either, it was someone who started editing Wikipedia at 12 and naively failed to recognise the damage they were doing.

    • debtta 21 hours ago
      The background here is that Scots is not really a language. Try asking a Glasgow taxi driver who addresses you in 'Scots' whether he knows any English. Robert Burns wrote in English, with some of his spelling reflecting pronunciation in the Scottish English dialect.

      The people who want it to be considered as a language for political reasons cannot be bothered to translate Wikipedia themselves. They read and edit English Wikipedia and understand it perfectly.

      • mikemcquaid 19 hours ago
        Sort of?

        The Glaswegian taxi driver may not consider themself to be speaking a different language but, if speaking to another local and leaving aside pronunciation, they’d use words, phrases and even grammar that’s incomprehensible to someone with no experience with Scots.

        I’m a “posh Scot”, raised middle class in Edinburgh so my accent is minimal and thickens up or softens depending on who I’m speaking to. Even for me, there’s a lot of words, phrases and ways of speaking I’ve had to adjust to be consistently understood by American coworkers when over the last 10+ years.

        • Xss3 12 hours ago
          Brits do the same. At best it is a dialect at worst an accent. A lot of (most of) Scots is still English but spoken with different grammar or unfamiliar phrases and unfamiliar pronunciation.

          Sort of like extreme cockney rhyming slang or for a more modern example thick BME* full of slang.

          * = British Multicultural English, think fam n blud, lots of Jamaican english influence plus south east asian influence.

        • mig39 15 hours ago
          > The background here is that Scots is not really a language.

          This is supremely ignorant. Scots is its own language. It's a 'brother' or 'sister' of English, with both English and Scots being descendants of West Germanic languages.

          The fact that many (all?) Scots speakers also speak English doesn't mean Scots not a language on its own.

          You could make your exact same arguments that Irish isn't a language because you could ask a Cork taxi driver whether he knows any English.

          Scots = a language with some of the same ancestors as English.

          Scottish English = a dialect (and accent) of English

          Scots Gaelic = another language, with the same ancestors as Irish and Manx.

          • debtta 7 hours ago
            Australians, Jamaicans, African Americans and English-speaking South Africans do not have their own Wikipedia, despite all these dialects having more legitimate demographic and linguistic claims to being languages than 'Scots'.

            James Joyce wrote in English, no Irish person pretends that he wrote in a third language distinct from English and Irish. The fact that they do not do so does not compromise the political basis for independence, republicanism or reunification.

            If a Cork taxi driver, addressed you in Irish (very unlikely), and you asked him to speak English, the request would be both coherent and reasonable. The point you missed is that the Glasgow taxi driver would look at you with consternation and say "But, I am speaking English! What's wrong with my English?' (insert dialect spelling if you like)

            Rabbie Burns wrote in the same language as his compatriots Louis Stevenson and Scott.

            It would be ignorant if I did not know about the meretricious claim of a minority of Scottish people to have their own language, but it is not ignorant to reject that claim. I am Scottish fwiw.

          • zozbot234 19 hours ago
            Scots is somewhat partially intelligible in written form to English speakers, but that does not make it the same language as English. You might as well say that Spanish and Portuguese are the same language.
            • debtta 7 hours ago
              You might as well say that US English and Canadian English are different languages.

              Geordie English is closer to Edinburgh 'Scots' than to RP English or US English or Indian English. Is it a dialect of Scots?

            • overfeed 10 hours ago
              What counts as a language is almost always determined by "political reasons" - as the witticism goes: "A language is a dialect with an army and navy."

              There exists dialects that are less mutually intelligible than apparently distinct languages, and the designation of each as "dialect" or "language" is political. Language is often a proxy for culture, and political actors may wish to suppress or boost the legitimacy of such cultural expression depending on their aims.

            • TZubiri 1 day ago
              The Cebuano wiki is a similar case, not spoken often, but it was a personal project of an editor that was mad at political articles and started making animal articles in the Cebuano wiki.

              The solution is to differentiate and tag inputs and outputs, such that outputs can't be fed as inputs recursively. Funnily enough, wikipedia's sourcing policy does this perfectly, not only are sources the input and page content is just an output, but page content is a tertiary source, and sources by policy should be secondary (and sometimes primary) sources, so the system is even protected against cross tertiary source pollution (say an encyclopedia feeding off wikipedia and viceversa).

              It is only when articles posing as secondary sources fail to cite wikipedia that a recursive quality loss can occur, see [[citogenesis]]

              • Many sources for Wikipedia articles refer to Wikipedia without citing it. Many journalists will work from Wikipedia, and most of Wikipedia's sources are journalistic articles. It happens to be that often this isn't noticed because the information obtained this way is true and uncontroversial. Citogenesis only documents examples where, by bad luck, the result is untrue information.
                • thaumasiotes 1 day ago
                  > It is only when articles posing as secondary sources fail to cite wikipedia that a recursive quality loss can occur

                  I've seen a college professor cite wikipedia in support of a false claim. On investigation, the text in wikipedia was cited to an earlier blog post by that same professor.

                  I wasn't convinced.

                • fooker 1 day ago
                  [flagged]
                  • AlienRobot 1 day ago
                    Yes, half of my entire political ideology is based on posts written by 12 year olds on the Internet. The other half is based on posts written by dogs[1].

                    1. https://en.wikipedia.org/wiki/On_the_Internet,_nobody_knows_...

                    • fooker 1 day ago
                      Yep, it's either dogs or r̸u̸s̸s̸i̸a̸n̸ chinese bots.
                      • aspenmayer 1 day ago
                        > Yep, it's either dogs or r̸u̸s̸s̸i̸a̸n̸ chinese bots.

                        Please consider users of screen readers and other assistive technologies, as your nonstandard usage of nonstandard characters makes parsing your comment difficult if not impossible. Not a slight or a correction, as I am a fan of Zalgo text myself, but after being informed by others about how inscrutable it can be to the differently abled, I have reconsidered using it.

                        • fooker 1 day ago
                          Didn't realize that, thanks!

                          I wonder if the future of screen reading applications is bypassing these issues + avoiding parsing weird websites by just doing AI driven OCR.

                          • a2128 1 day ago
                            TalkBack on Android seems to read it just fine, assumedly without needing any fancy AI or OCR
                            • aspenmayer 1 day ago
                              I used to use Zalgo text to make it harder to read my name when I use it, as I use my “real name” and didn’t want it to be scraped by bots, but some folks literally blocked me on social media after a bit of a spat that I’ll admit was caused by a misunderstanding on my part. Apparently, these kinds of characters’ interpretations are context-specific, and using them as a person as strike-throughs is readily apparent for some, but while the meaning is possible to be deduced by an AI, it shouldn’t be expected or assumed to be understood. My HN username in Zalgo text was taking over 30 seconds to read all of the diacritics per post on platforms I used it on, so I had to change my ways or admit that I didn’t care about the experience others had, the latter of which couldn’t be further from the truth.

                              AI has a hard time deriving how many r’s are in strawberry, so I won’t expect it to parse my text on behalf of others any time soon, though I don’t think you meant any harm. In the interest of respect for those who don’t have a choice in using tech to help them do what comes easily and naturally to me, I thought I’d pay forward the knowledge of how the world and our perceptions of it is as unique as every individual.

                      • Jfc, not everything is about that.
                        • fooker 1 day ago
                          I meant it as an example of--road to hell paved with good intentions and "and naively failed to recognise the damage they were doing".

                          But you do you.

                          • That's extremely tangential. Bringing hot-button political topics into unrelated threads flattens everything into political arguments and starves all other topics of oxygen.
                            • More specifically, it gets important HN discussions quickly flagged and dropped.
                    • > Wehr, who now teaches Greenlandic in Denmark, speculates that perhaps only one or two Greenlanders had ever contributed.

                      That's the core issue, it's not those who use AI translator or worst like Google translate. If there isn't any Greenlander to contribute to their Wikipedia, they don't deserve to have one and instead must rely on other languages.

                      The difference between an empty Wikipedia and one filled with translated articles that contains error isn't much. They should instead close that version of Wikipedia until there are enough volunteers.

                      • yorwba 21 hours ago
                        "Only one or two" isn't zero. The problem isn't that a small community can only write a small Wikipedia, but that there's a global supply of fools who want to make every small Wikipedia bigger, even if they're completely unqualified to do so.

                        Wikipedia is built around the basic principle that if you just let everyone contribute, most contributions will be helpful and you can just revert the bad ones after the fact. This works for large communities that easily outnumber the global supply of fools, but below a certain size threshold, the sign flips and the average edit makes that version of Wikipedia worse rather than better.

                        So smaller communities probably need to flip the operating principle of Wikipedia on its head and limit new users to only creating drafts, on the assumption that most will be useless, and an admin can accept the good ones after the fact.

                        I'm not sure whether Wikipedia already has the software features necessary to operate it in such a closed-by-default manner.

                        • justsomehnguy 1 hour ago
                          > a small community can only write a small Wikipedia

                          For whom?

                        • tdeck 21 hours ago
                          It is worse. Imagine if you were trying to learn English from this phrasebook, written by someone who didn't speak English:

                          https://www.exclassics.com/espoke/espkpdf.pdf

                          Wikipedia is prominent. Wikipedia articles in a language without much representation become prime examples of that language to those who read them.

                        • Symbiote 1 day ago
                          The end of the article says they have closed it.
                          • consp 1 day ago
                            That last part creates a chicken and egg problem. You can argue about it but I will bet it will never get traction if there is no basis to start from.
                            • bawolff 1 day ago
                              Wikipedia has an "incubator" setup where people can start working on a language in the incubator until it demonstrates enough interest.
                            • tomlockwood 1 day ago
                              > they don't deserve to have one

                              By what unholy pact have you been beknighted as the bestower of wikis, my friend?

                              • tjwebbnorfolk 1 day ago
                                If the original authors stop maintaining an OSS project, and you are one of only a very few users, you have two options: do the work yourself, or watch it die. If you are unwilling to do the work yourself, then that's a signal it isn't important enough for anyone else to do the work either.

                                Why should a wiki be any different?

                              • mslt 1 day ago
                                Not the commenter but in this instance it seems like if you want something you need to either be able make/maintain it or fund someone who will, no?
                              • Mars008 1 day ago
                                > who use AI translator or worst like Google translate

                                It's the same. Google translate uses trained AI models.

                              • strogonoff 1 day ago
                                Wikipedia editors is among the many communities that have for a long time mostly successfully relied on the tendency of relatively superficial, easy to validate capabilities (such as being able to use a website, write something resembling real language, and handle basic communication) to correlate with more valuable but harder to validate qualities (such as ability to write reasonably well and follow rules/guidelines, and generally being a well-intentioned person) as one of their main barriers to entry. Attributable to the deluge of commercial LLMs[0] available at such low prices that their operators lose millions to billions of dollars in order to gain market share and ultimately profit, such communities may not be able to continue to exist as is for long, I suspect: either they would be forced to institute more intrusive barriers (be that ID verification, invite-only memberships, or something else) while the deluge lasts, or they may be effectively destroyed when members secretly lacking the requisite qualities and act in bad faith become a majority, damage community’s reputation, and drive out the existing members.

                                [0] Which paradoxically to a significant degree exist thanks to the unpaid work of volunteers in many of such communities.

                                • muldvarp 13 hours ago
                                  LLMs destroying Wikipedia would be incredibly sad and is one of the things that makes me think that LLMs will have a strong negative impact on the lives of most people.
                                  • georgefrowny 12 hours ago
                                    I have previously translated a very small handful of redlink articles into English from another language. Chasing down the sources in the other language and synthesising and cross-referencing with English sources is a fun challenge. To the best of my knowledge, I did an OK job.

                                    While translation tools are a godsend for that, as well as life in general when dealing with a language I am not that good at, LLMs make me increasingly reluctant to do that much more because there is no way I could detect AI slop in a second language. For all I know I'd be translating junk into English and enabling translingual citogenesis.

                                    Bad as the slopwave is for native speakers, it's absolutely brutal for non-native speakers when you can't pick up on the tells. Maybe the gap will narrow and narrow until the slop is stylistically imperceptible.

                                • r2vcap 19 hours ago
                                  While it makes sense that LLMs and machine translation systems primarily rely on English Wikipedia as a data source, depending on smaller-language Wikipedias for training is far less ideal. English Wikipedia is generally well-regulated by its community, but many other language editions are not — so treating all of Wikipedia as an authoritative source is misguided.

                                  For instance, my mother tongue’s Wikipedia (Korean Wikipedia) suffers from serious governance issues. The community often rejects outside contributors, and many experienced editors have already moved to alternative platforms. As a result, I sometimes get mixed, low-quality responses in my native language when using LLMs.

                                  Ultimately, we need high-quality open data. Yet most Korean-language content is locked behind walled gardens run by chaebols like Naver and Kakao — and now they’re lobbying the government to fund their own “sovereign AI” projects. It’s a lose-lose situation.

                                  • orbital-decay 23 hours ago
                                    That happens by default in low-resource languages, no bad translations needed. They don't have enough either written material to train an LLM, or labels for time periods and various dialects in a continuum. For example even the best multilanguage models will lump up all Berber languages into one unstable abomination nobody speaks, usually writing it in Neo-Tifinagh. Not much can be done about that, training a model in all varieties of these would require a huge specialized effort.
                                    • 1f60c 21 hours ago
                                      And it's a lot more profitable to improve sex mode than to hire a small army of native speakers to make it not suck at Greenlandic.
                                      • orbital-decay 20 hours ago
                                        What makes Greenlandic special among ~7000 languages in the world? Most of them are low-resource as well. To train a model in all of them you also need a ton of specialized linguists and ML people, neither of which grow on trees. And it's only one thing generalist models are supposed to master, out of many. The scale is impossible, this needs to be done by models themselves when (if) they get smart enough.
                                    • foxglacier 1 day ago
                                      If nobody's reading them and nobody's writing them, then perhaps it doesn't matter. We could let Wikipedia-Greenlandic persist as its own evolved language that forks from the original.

                                      > potentially pushing the most vulnerable languages on Earth toward the precipice as future generations begin to turn away from them.

                                      OK? We have lots of dead languages. It's fine. People use whatever languages are appropriate to them and we don't need to maintain them forever.

                                      • phantomathkg 22 hours ago
                                        I thought the argument point is letting it die is OK, but letting wrongfully translated text, becoming the source of AI to chuck out wrongfully translated text is not OK.
                                        • foxglacier 4 hours ago
                                          Yea, it's kind of pure and good to delete it all. But I'm imagining some self-sustaining evolution of the language through an LLM-Wikipedia feedback loop.
                                        • arthurjj 1 day ago
                                          This was my take from the article also. These languages are clearly dying and not many people speak them as their primary language so the human suffering is minimal. Which means keeping them around is a past time that some people happen to enjoy (unless there is a Saphir-Whorf hypothesis I'm missing)

                                          But the sentence `well-meaning Wikipedians who think that by creating articles in minority languages they are in some way “helping” those communities` clearly shows the author hasn't really considered the issue.

                                          • I see that this comment get downvoted but I think we can agree on the facts that languages, just like species, die while other flourish. And that's fine.

                                            Survival of the fittest, right ? Not enough people speaking Greenlandic, too complicated even for it's own population who would rather speak danish ? The very reason I'm speaking English is because it was forced military during the 19th century by the UK and since the 20th by Hollywood.

                                            Just like a virus, if a language doesn't spread, it die.

                                            • don-bright 1 day ago
                                              We already see the 'best' LLMs switch between different languages while they are 'thinking'. It seems to me that the more languages it can 'think' in, the better off it will be. Different human languages have different concepts of time, numbers, nature, place, intention, relationships, and so forth and so on.
                                              • thaumasiotes 1 day ago
                                                > Survival of the fittest on a long time horizon means the more diversity the better the survival rate will be.

                                                This is just a misapplication of the analogy. For a language, "fitness" refers to similarity to whatever language is spoken by people relevant to you. Diversity is the worst quality a language can exhibit, and is the quality that causes dying languages to die.

                                                There is no such concept as an external force coming in that certain languages handle better, allowing them to temporarily outcompete other languages. Existing pools of diversity are not protective against this, because it can't happen.

                                                Also unlike genetic diversity, linguistic diversity does not need to be maintained as a legacy of the past. It is constantly being generated in much larger quantities than are desired. If you managed to perform the opposite of the Tower of Babel miracle and replaced every currently-spoken language everywhere in the world with a perfect monoculture, within 1-2 generations you'd be back to having mutually unintelligible varieties in different regions.

                                              • jiggawatts 1 day ago
                                                As an immigrant to an anglophone country, I noticed a few things:

                                                When people have varying levels of capability with languages, they’ll switch to whatever is the lowest common denominator — the language that the group can best communicate in. This tended to be English, even amongst a bunch of native speakers of a common foreign language.

                                                Moreover, this is context dependent: when talking about technical matters (especially computing), the Lingua Franca (pun intended) is English. You’ll hear “locals” switch to either mixed or pure English, even if they’re not great at it. Science, aviation, etc… is the same.

                                                Before English it was French that had this role, and before then it was Latin and Greek.

                                                The thing is, when the whole world speaks one common language like Latin or English, this is a tiny bit sad for some Gaelic tribe that got wiped out culturally, but incredibly valuable for everybody everywhere. International commerce becomes practical. Students can study overseas, spreading ideas further and wider. Books have a bigger market, attracting smarter and better authors. There’s a bigger pool of talented authors to begin with, some of which write educational textbooks of exceptional sparkling quality. These all compound to create a more educated, vibrant, and varied culture… because of, not despite the single language.

                                                • overfeed 8 hours ago
                                                  > The thing is, when the whole world speaks one common language like Latin or English, this is a tiny bit sad for some Gaelic tribe that got wiped out culturally, but incredibly valuable for everybody everywhere.

                                                  I find this cultural Darwinism argument incredibly ironic, given how vocal factions in 2 of largest (native) English-speaking countries have been whinging about "their culture" being sullied by immigrants.

                                            • ratg13 1 day ago
                                              It's ironic that the "solution" to the problem is being driven by yet another person that isn't native to Greenland.

                                              While they may be a Greenlandic teacher, it's almost assured that they are teaching western Greenlandic, which is similar to Canadian Inuktitut.

                                              People in the East of Greenland speak a language that has similarities, but is different enough in vocabulary and sounds that it's often considered a separate language and not a dialect.

                                              When people from East and West Greenland come together, they typically speak Danish because they can't understand each other in their own native language.

                                              So we're talking about a country that has 55k people and a portion of them don't even speak the official language.. This guy would have no way of knowing whether something was written poorly by a computer or a poorly educated greenlandic native that maybe isn't so good with the official language.

                                              Given that the majority of the country's citizens do not use the internet at all, it is not even clear what his solution is other than just deciding to be some sort of magic arbiter .. which is not realistic or sustainable.

                                              • Uehreka 1 day ago
                                                I wish people on HN would stop acting like “magic arbiter” solutions are “not realistic”, when in reality it’s the only way things have every worked. Are federal judges “magic arbiters”? Yes. Do judges make bad calls? Yes. Do we not like when large numbers of judges who are unfriendly to our side get life appointments? Yes. Has anyone proposed an actual better way of solving these kinds of problems? No.

                                                So to get back to the point: Yes the solution is to appoint someone a magic arbiter, and hope they don’t screw up. The fact that it’s a deeply imperfect way of solving problems doesn’t mean it’s not workable. It just means it will backfire at some point, and someone else will get appointed instead.

                                                • vacuity 9 hours ago
                                                  > Has anyone proposed an actual better way of solving these kinds of problems? No.

                                                  This is the heart of the matter. Nothing is good or bad in a vacuum, but when two things (say, outcomes) can be compared, distintions can be drawn. Noticing flaws in the present can't be contrasted with simple models of "the better solution"; this is comparing apples to oranges. Address both the good and the bad of the present, including the days where nothing noteworthy happens and therefore below the awareness of most people, and the good and the bad of an elaborated counterpart.

                                                • optionalsquid 1 day ago
                                                  > Given that the majority of the country's citizens do not use the internet at all

                                                  On what do you base this assertion? I was not able to find up-to-date statistics, but 72% of participants in this survey from 2013 had internet access at home, either via PC or via mobile devices, and another 11% had internet access elsewhere:

                                                  https://digitalimik.gl/-/media/datagl/old_filer/strategi_201...

                                                  • bawolff 1 day ago
                                                    > People in the East of Greenland speak a language that has similarities, but is different enough in vocabulary and sounds that it's often considered a separate language and not a dialect.

                                                    If this is true, then the easy solution would be to just have two separate wikipedia editions (assuming there is interest).

                                                    After all if we have en, sco, jam and ang, surely there is room for two greenlandics. The limitting factor is user interest.

                                                    • thaumasiotes 1 day ago
                                                      > the easy solution would be to just have two separate wikipedia editions (assuming there is interest)

                                                      That's... a reach.

                                                      An easier, and much more realistic, solution would be to just have one edition in Danish, which was already noted as the language Greenlanders have in common.

                                                      • bawolff 9 hours ago
                                                        Well, is the point to have greenlanders be able to read it, or is it to preserve a dying language?
                                                    • AlienRobot 1 day ago
                                                      As someone who isn't a native English speaker, I believe most people who use the Internet would benefit from simply learning English rather than having an unchecked AI translate things to them. Reddit for example has joined millions of terrible Wordpress websites in auto-translating everything for SEO purposes and Google seems to be fine with this for some reason. It's ironic that it has reached the point that if you search for a "multi-language" plugin for Wordpress, most of the results aren't about letting you write an article in multiple languages, they're just about automatically translating a single article to 30 languages with machine translation.

                                                      The reason none of this makes sense to me is that it's intellectually crippling Internet users. Computers and the Internet are tools. If you want something machine translated to you, you can use a tool like Google translate to translate it for you. If the webmaster does this, it robs people from the opportunity to learn to use those tools and they become dependent on third parties to do this for them when they would have a lot more freedom if they just did it themselves (or if they learned English).

                                                      Teach a man to fish...

                                                      • spookie 1 day ago
                                                        A lot of written text out there in other languages isn't available in English, simply put you have many eco chambers of singular languages out there. Most people are ok with just reading what they understand.
                                                        • carlosjobim 20 hours ago
                                                          You miss an advantage. If everything is inter-translated, then you can do your search in the language you know and find the answer written in a language you didn't know.
                                                      • johnea 1 day ago
                                                        [flagged]
                                                        • haiji2025 1 day ago
                                                          [flagged]
                                                          • Hendrikto 20 hours ago
                                                            Wow, fuck that site. I had to dismiss 2 cookie banners and 3 popups before I could even read anything, then the second I scrolled one pixel another one popped up.
                                                            • Pooge 20 hours ago
                                                              Sounds like you don't use the modern-day condom: uBlock Origin.

                                                              I live in Europe (famous place for becoming professional banner clicker) and I didn't get one single distraction.

                                                              • Hendrikto 19 hours ago
                                                                I am on mobile, I should have said that.
                                                                • user205738 18 hours ago
                                                                  ublock origin works on mobile in Firefox, Waterfox, etc.

                                                                  There is Brave with its blocker, there is AdGuard, which blocks ads on websites and applications, regardless of the browser.

                                                              • tim333 20 hours ago
                                                                That's odd. I didn't get much.
                                                              • bradley13 22 hours ago
                                                                I've lived in a couple of countries where there is a "vulnerable" language. I understand the emotional attachment that the native speakers have to their language.

                                                                However, in the larger picture: languages evolve. New ones develop, old ones die. Do artificial attempts to "rescue" a language really make sense?

                                                                • tdeck 21 hours ago
                                                                  It makes no less sense than any other work done to protect or restore something created by human beings. This comment is no more insightful than saying "cathedrals burn down, do artificial attempts to 'restore them' make sense?"
                                                                  • mort96 21 hours ago
                                                                    Languages evolve, but it's probably bad when language evolution is driven by bad AI slop translations made by people who have no relation to the language.
                                                                    • internet_points 8 hours ago
                                                                      The AI slopwave is about as close to natural linguistic evolution as world war 2 was to natural selection (..aaand there we hit godwin's law, I'll see myself out)