Algorithms and Insights: My Data Science Journey
I came to know the concept of ‘data’ when I conducted my own research project. I learned python to collect records from a food delivery platform to conduct a market research. But it was after I started my career at State Street when I contacted with large data sets -- there were so many systems containing data for accounting and transaction processing. Wanting to execute smoother processes that supported data integrity within different systems, I tried to store transactional data in the Access database. But there were at least 1.2 million records generated a day, leading a slow performance. After several attempts, I figured out that SQLite, a light weighted and portable database, was the best choice for us. And I even created customized views and drew entity relationship diagrams, helping our team better understand the database structure.
I also learned several data pre-processing methods based on real-market situations and applied them in my dissertation. My dissertation examined how network relationship influenced the decision making process of mergers and acquisitions. I used Chauvenet’s criterion and discovered that M&A events happened in some companies for more than 20 times a year. However, the events with the same first announcement date and underlying asset should be regarded as one. Therefore, I corrected the outliers and got a more convincing result. As I am facing data with higher complexity nowadays, I believe I have to fill up my mind with more effective data preprocessing techniques to score higher accuracy. In the paper from Prof. Isaac Triguero, I learned that KNN algorithm could be a valuable method to correct data imperfections. I will be excited if I can gain the opportunity to be instructed by Prof. Isaac Triguero and discuss with him on the method of obtaining quality data.
Data has been cleaned and shaped into a form to be meaningful and valuable. But getting that un-analysed information is far from enough. Clients love the way I demonstrate data-load patterns and report filing status by using an interactive dashboard. But I still aspire to explore deeper, finding more actionable insights, making reliable forecasts, and most importantly, letting the data speak. In my research on recommending closed-end funds to investors based on the information disclosed in regulatory reports such as N-CEN and N-CSR, I tried logistic regression methods to deal with such a classification problem. But it didn’t perform well due to multicollinarity. Inspired by the paper from Facebook, I combined gradient boosting decision trees with logistic regression, that is, to use decision tree features as inputs to the linear model. This combination helped decrease Cross Entropy by more than 5%. Seeing how significant the impact is to the performance, I am attracted by the charm of machine learning algorithms.
This practice made me strive to contrast the two approaches in statistics -- data modeling and algorithmic modeling. Leo Breiman said data modeling is to choose a simple (linear) model based on intuition while algorithm modeling is to choose the model with the highest predictive validation accuracy. He believes we should focus first on model accuracy, and only after building a high-performance model should we think about explaining it. Nevertheless, my industry experience tells a different story. At the beginning of my project analysing fund filing status, I did not have all potentially valuable features at my hand and could not easily plug them into a black box for tuning. I asked around about data availability and access and had to think about interpretable models in the first place. Facing this trade-off, I aspire to enhance both statistical and computational knowledge to enable me to come up with a solution. Your course covering both disciplines is a natural and exciting step for me to take.
Just as humans learn by either observations or examples to obtain knowledge and solve problems, machines are akin to learn by inputting datasets to train models and produce the correct outcome. I have a hunger to see what knowledge I can get from machine learning. In particular, having experience on analysing structured data, I am eager to learn to deal with unstructured data, identifying objects and images on Computer Vision module. Besides, In the process of collecting census-type information from N-CSR report, I found text mining to be quiet crucial. I believe learning from Prof. Ke Zhou, who is also an academic consultant for Yahoo! Research, will enable me to master practical text mining techniques.
When I came to know the DIKW model, I realised above stages of my understandings are exactly matched with the hierarchy -- from data to information, from information to knowledge, and now from knowledge to wisdom. Wisdom is the ability to increase effectiveness. I can see that machines wisdom is enabling innovation. Take my hometown Hangzhou as an example, the ‘City Brain’ project is empowering Hangzhou to think with data-driven governance, improving administrative capabilities and solving traffic problems. I hope I can learn from the basics, data, in your course and equip myself with Automated Scheduling and Social Simulation method to make decisions that support the operation of society. Not only to witness these innovations in society but also to participate.
What is beyond wisdom? To me, it is life. The organism feeds on negative entropy (Erwin Schrödinger, 1943). If the world as a whole is regarded as a complex isolated system, the entropy of the system will never reduce. But the development of human brains enables us to apply wisdom to evolve within the individual life cycle. I believe I should deploy energy and information to fight back the tide of entropy in the limited days of my life, with the help of revolutionary technology.