A Passion for Data Science
It is my participation of Kaggle in autumn 2019 that ignites my avid passion for big data. Our team of three dug deep in the given large quantity of e-commerce transaction data and tried a number of machine learning models for prediction. The joy of winning the Silver Medal continues nourishing me until now and has lighted my career path toward the thriving big data industry.
Specifically, together with two mates, I employed machine learning (ML) models to predict the probability of fraudulent online transactions with a large-scale dataset from a leading payment service company, Vesta Co. I pre-processed the raw data via data cleansing, extracted all the fraud cases to build a dataset to improve training efficiency, and studied the correlation among data. Then, I performed feature engineering to select useful features and create new ones and conducted dimension reduction. As for the ML model selection, my team tried three methods: CatBoost, LightGBM, and XGBoost. After parameter fine-tuning and thorough evaluation, we applied ensemble learning to produce an optimal predictive model by combining the above three algorithms. What unique about our work is that we take the relationship between cards and their owner into consideration. Instead of targeting a single transaction, we realised to predict a card and its owner's behaviour directly. Thanks to our endeavour, we placed in the top 88 among 6312 participated teams worldwide and successfully won the Silver Medal. So excited, I have seen the valuable vision obtained from real-world data, and hence, have determined to keep on my exploration in practical data analysis ever since and pursue a career in applying advanced big data methods to predict the future.
During my participation in the Mathematical Contest in Modelling (MCM) in winter 2020, I built the model to predict Mackerel and Herring habitats’ migration in the next 50 years due to climate change. To study the sea surface temperature change, which is crucial for fish habitats, I built a Scottish water temperature trend model based on the sea surface temperature data set, HadISST1, pre-processed with Python, to get the diagram overtime at the original fishing point. Then we used a regression model to predict the profit of fishing companies based on production and cost data in the past decades and gave advice accordingly. Together my team won the Honourable Mention.
Last summer, I completed my independent research remotely under the supervision of Prof. Mei-Ling Shyu from the University of Miami to gain more research involvement in big data. This time I focused on fake news detection in social media. After reading volumes of relevant papers, my eyes were caught by a tri-relationship embedding framework (TriFN) proposed by Kai Shu et al., which modelled relationships among publishers, news pieces, and users’ social engagement simultaneously. Hence, I re-produced this approach from scratch, which contained news contents embedding and user embedding by adopting nonnegative matrix factorization (NMF) algorithms, user-news interactions embedding considering user credibility, and publisher-news relation embedding considering partisan bias of publishers, as well as a semi-supervised classification to predict unlabelled news items. I applied the PolitiFact dataset from the Kaggle website and pre-processed data using the bag-of-words model to generate feature matrices for embeddings. What’s more, I collected media bias information from mediabiasfactcheck.com and manually labelled it into three categories as required. Then I programmed to achieve the optimization algorithm using alternating least squares to update parameters until convergence after random initialization iteratively. My work could run smoothly in the end. I am thrilled about this valuable experience, for it allows me to delve into social media data, which is seen as a potential goldmine of insight. Moreover, it greatly enhanced my problem-solving and independent-thinking abilities to be prepared for more complex real-world challenges.
More importantly, I have seized the chance to become an Intern at PwC China from last August to December. At the IT department of Global Technology Solutions, I am mainly responsible for product management and business analysis duties. Besides, I also assist in developing the non-code development platform, named Digital Market 5.0, for clients’ easy custom app creation, including testing and updating functions, designing software interface, etc. Now the department has just launched Mainland China Digital Store, and I have become the core member at Product Development Group and am mainly responsible for developing a Qualtrics-like survey software tool. My work has just started, and the product plans to be released next year.
What I also aspire to highlight is my online participation in the lab of Dr. Jing Zhang from UC Irvine since April 2020, where I investigated Graph Convolutional Network (GCN), studied Unsupervised GraphSAGE algorithm, and trained on large-scale tumour network dataset to generate node embeddings, and also tried scalable representation learning in heterogeneous networks that could preserve both structural and semantic correlations of the network. I assisted in running the model on the above dataset and also conducted comparative experiments. Now we are working on a paper, planning to submit to the Learning Meaningful Representations of Life (LMRL) Workshop at NeurIPS 2020. Moreover, my graduate project, which aims to detect and analyse dynamic communities from social networks, is about to start. I will use the Raphtory software deployed on a cloud system to analyse data from large social networks.
University College London’s MSc in Knowledge, Information and Data Science is specially designed to train next-generation professionals in data science and knowledge engineering with a solid foundation in artificial intelligence and computational methods. It attracts me because of the rich elective courses. Observing the work of my mother, who has been a librarian at Fujian Normal University for nearly 30 years, I’m looking forward to taking the course Collections Care. The city of Fuzhou is rainy and humid, resulting in the malfunction of the CD player reading discs. As the archive records were all paper at the beginning, a lot of manpower and material resources were spent in the process of migrating to electronic files. Due to the poor computer performance and lack of custody experience at the beginning, electronic data might even be lost. I believe it is necessary to combine modern approaches to information management and collections care. And this course includes digital preservation, photographic media, disaster planning and archival standards. So, I believe it is a good start for me.
What’s more, the Dissertation is the program component that I like the most. This independent research project is very challenging, and I aspire to conduct it under the supervision of Prof. Annemaree Lloyd on how to improve current information systems in Chinese libraries. With the rise of digitalisation, almost every Chinese library keeps the collection data in an information system. Although many schools will buy online databases for students, if one wants to search for particular literature, he or she may need to look into several databases to read it. This kind of process can be simplified for students’ and librarians’ convenience. I aspire to find out the real needs for the libraries’ information system in order to make everyone’s life easier. So, I need to equip myself with corresponding research methods, which is Prof. Annemaree Lloyd’s strength. Also, as an experienced social science research based at the Department of Information Studies, she can provide me pieces of advice on research methodologies and the possible improvement in my research work.
I will also benefit a lot from your stimulating academic environment of the Department of Information Studies (DIS) and your broad range of expertise by working closely on real-world solutions. Closely aligned with the Knowledge, Information & Data Science (KIDS) research group, it would enable me to undertake research on topics such as information society and digital humanities. It will undoubtedly enable me to be fully prepared for a successful data analyst career.
Hence, after graduation, I’ll apply my learning to mine and interpret a large quantity of data to discover trends and patterns valuable for predictive analytics to make critical decisions and strategies. After years’ work, I will move on to roles like a senior analyst to take on predictive modelling and decision-making responsibilities. Long-term wise, I hope to assume a position as an executive, utilising data-driven insights to guarantee the organisation’s most informed decisions.