MATLAB: If i have an article, how can i count the number of characters(space included), and alphabets it uses? Thank you

wordcount

Pls help me, im really new at this. Idk anything

Best Answer

Counting alphabetics is a much trickier task than it appears. There are (human) languages for which characters such as ü are not considered alphabetic, and there are other human languages for which they are considered alphabetic, and there are yet other human languages which use some characters like that but consider them to be pairs of alphabetic characters such as ue. Furthermore there are human languages where pairs of characters might be written separately for storage purposes, but are technically considered to be single characters. For example if you see ao in a Swedish text, then that is considered to be a representation of the single alphabetic character å
You also have the problem that there are multiple valid representation for some characters in Unicode. For example ü is U+00FC, but it is also u with the COMBINING DIAERESIS (U+0308). There are some characters that involve combining more than two characters; and then there are the "stroke" representations for some writing systems, in which the stroke order can vary.
Because of these crazy differences, you cannot count alphabetic characters properly until you know which language you are dealing with, and you know all the applicable single-to-multi and multi-to-single rules for the language, and you have to translate the input into a canonical representation and then examine the characters with a table of what that language considers to be alphabetic.
Automatically determining which language is being used based upon which characters appear is difficult...
Ah, and the above assumes that the characters have been represented in Unicode (perhaps UTF encoded.) That is not necessarily the case: the characters might be in one of the various Code Pages, and it might not be obvious which Code Page is in use. I once spent a bunch of time on some code that tried to figure out which Code Page was in use by assuming that the characters represented text (and standard controls like newline), and looking carefully at which bytes occurred: if a byte is unassigned in a particular code page then the presence of that byte could rule out that code page as being a possibility. But it turned out I never had a use for that code.
... and then you get to deal with the problem that human languages often allow other languages to be quoted. I mentioned earlier that ao in Swedish text is a representation of the å character. That is true, but it might happen that the text quotes the English words "intraocular" or "aorta" or "chaos", or the Italian "ciao", or the particle physics term "kaon", and within the quoted words, the pair must be counted as two letters, not one.