Text mining is one of the prospering areas in data science that allows data scientist to work with textual contents – however, some common practices around text mining, such as stopwords and stemming, are not applicable to Chinese texts due to the difference in language structures. On the other hand, a study from InternetWorld Stats showed that Chinese Language Internet users accounted for 23.2% of the World Internet users (as of December 31, 2013), which is the second largest group of users (native English users if the largest group at 28.6%). No doubt that the business world has a strong demand on text-mining skills for Chinese texts. It is important to provide knowledge and necessary tools to extend data scientist text-mining capacity to include Chinese text contents.
What am I going to get from this course?
- Know the basics of Chinese text structures: characters, vocabulary types, sentences
- Understand the computer representations of Chinese text encoding and convention: Unicode, GB, HZ, Big5
- Understand the theory for Chinese text segmentation and applying Chinese segmentation using the Jieba library