Active Hackathon
“I thrive in situations where I have to get things done or create new systems and new modules. I like to satisfy my curiosity and maker trait,” said the creator of PyTorch Tabular, GATE and LAMA-Net, Manu Joseph. He said that he is fascinated with math, data science, and machine learning, particularly deep learning, because of its flexibility and scalability.
Joseph currently heads the applied research at Thoucentric, a niche management company. At the company, he leads the group of researchers in productionising cutting-edge technology to add value to real-world customers, primarily in causality, predictive maintenance, time series forecasting, NLP and others. Prior to this, he worked with companies like Philips, Entercoms, Schneider Electric, Cognizant Technology Solutions and others.
In an exclusive interview with Analytics India Magazine, Joseph talks about his journey into data science, alongside some of his passion projects, tips for people entering data science for better career opportunities, and more.
From starting his career in industrial engineering to working in the IT industry, and later moving to the data science and analytics field, and currently leading the research initiatives, Joseph’s journey has been truly inspirational.
“Transitioning from a STEM role, say, engineering, to data science is relatively easier than other areas,” said Joseph. He said that whatever branch you study in engineering changes the way your brain is wired. “I think that is actually helpful in all of these things,” he added.
However, he said when shifting domains to areas like machine learning, statistics, or computer science, you have to be comfortable with programming. “There’s no way around it,” he added.
He said you could learn all the machine learning, you can learn everything, but at the end of the day, for all of that to be useful, you need to convert that into code. “In today’s scenario, nobody will do it for you. So you have to do it yourself,” he added, saying that a few years ago, there was the luxury, but now, with the industry growing rapidly, there is no other option but to learn.
Further, Joseph said that you should not be afraid of Math. “It is not going to get in your way in the beginning. You can get away without Math early on, but eventually, it will come knocking on your door and then it will make a lot of difference,” he added, saying that it is a lot easier to communicate concepts in Math than in English. “Understanding what’s happening is actually very important. Otherwise, you will be able to build a model; you will be able to predict and get results out of it. But, the first time you hit a wall, without knowing what is happening in the background you won’t be able to navigate around the problem,” said Joseph.
Lastly, he said that people should start looking at interesting problems, create datasets, participate in hackathons, and develop models to make them more useful. “Move away from your standard Titanic datasets and solve something interesting that makes your resume stand out. It is very easy to identify people who have gone the extra mile,” he added.
An industrial engineer turned data scientist, Joseph said when you are working with a business problem, tabular data constitutes about 90 per cent of the data—which is in tables—and all of your classical machine learning are the things we always use. However, these are just a small portion of what we can do because there are a lot more avenues to explore.
“That is where we started looking at deep learning. During my research, I found out that there was not a lot of work happening in that area,” recalled Joseph, saying that previously people were still using standard feedforward networks and something like categorical embeddings on top of that, for a tabular model.
“Since I was interested in the field, I kept tabs on what was happening. That’s when models like TabNet and a few other models came out. So I did see an acceleration in the space like more and more people were looking at how to use creative architectures for tabular data,” added Joseph.
Further, he said that when all these models came out and people started to implement their own data—it was a lot of hassle. “Because apart from TabNet, which has a very good library, all the other models were mostly coded bases. Making it work was extremely cumbersome,” he added.
That was the start of PyTorch Tabular, a framework for deep learning with tabular data. The framework has been built on top of PyTorch and PyTorch Lighting and works on pandas data frames directly. It has also used SOTA models such as NODE and TabNet to create a unified API.
“I started this as an internal project. At the time, it did not even have a name. The idea, however, was to unify all of that so that you can switch between different models, just like a Scikit-learn setup,” said Joseph. He said once the data pipeline is ready, switching to a new model is just about changing one line of code. That was the guiding principle behind the development of PyTorch Tabular. Soon he open-sourced the library for others to contribute and use. It is one of the most liked and talked about ML libraries on GitHub.
One thing led to another; Joseph and his colleague Harsh Raj later released a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data called GATE (gated additive tree ensemble). Inspired by GRU, GATE uses a gating mechanism as a feature representation learning unit with an in-built feature selection mechanism. It also uses an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict the desired output.
Joseph said that GATE is a competitive alternative to SOTA methods like GBDTs, NODE, FT Transformers, etc., where they have experimented on several public datasets (both classification and regression). The code is yet to be available for open source.
At Thoucentric, Joseph, alongside Varchita Lalwani, recently developed LAMA-Net, a new encoder-decoder (Transformer) based model with an induced bottleneck, latent alignment using maximum mean discrepancy and manifold learning to tackle the problem of unsupervised homogeneous domain adaptation for remaining useful life (RUL) prediction.
Citing predictive maintenance in manufacturing, Joseph said this is more like a domain adaptation technique, where we focus on how we can use training data with shifting data distributions to train a robust model to predict remaining useful time.
“In a real-world implementation, it is really difficult to get the data needed to train these models—you will need to have data for multiple failures in the past, and failures are usually a rare event. So, getting the data is difficult,” said Joseph, saying that using the existing datasets, we can now use our domain adaptation to a new dataset without any labels.
The latest paper from Applied Research @thoucentric is on Unsupervised #DomainAdaptation for #PredictiveMaintenance. We focus on how we can make use of training data with shifting data distributions to train a robust model to predict remaining useful time.
A thread#ML #DL #AI https://t.co/p4FAGdlN0y
To date, Joseph has worked on more than 20+ AI/ML projects, and in a personal capacity, he has worked on more than ten projects. At Thoucentric, he is currently building a team of data scientists who will be working on new-age technologies to solve their customer problems. The team is working on four different projects and is planning to publish three papers in the coming months.
Joseph told AIM that he would continue developing new methods and technologies in areas that do not use a lot of training data and build domain-agnostic models. “Because, having worked in the industry for some time now, I know that training data is very difficult to come by. That too, like annotated training data, is very, very difficult to come by,” said Joseph. He said that is why he is interested in areas like transfer learning, self-supervised learning, etc.
Data science resources:
Newsletters:
AI/ML Courses:
Must-read research papers
Conference, Virtual
Genpact Analytics Career Day
3rd Sep
Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep
Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023
Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023
Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023
Stay Connected with a larger ecosystem of data science and ML Professionals
Discover special offers, top stories, upcoming events, and more.
The use of AI in Parliamentary activities is gaining momentum around the world. One such project is “The Flemish Scrollers”, which tags politicians using phones in Parliament on Twitter. To know more about it, we got in touch with Dries Depoorter, creator of the Twitter bot.
StoryFile is much more than just a platform for innovative conversational video technology. Stephen Smith, CEO and co-founder of StoryFile, is the brain behind this innovation. You may recall him as the man who made his mother, the late Marina Smith MBE, adopt this technology and speak at her own funeral.
Here’s your chance to showcase your functional and technical skills and get hired by InfoCepts!
RawNeRF’s noise reduction method when combined with the 3D scene gives a high-resolution output which is seamless when transitioning between angle and positions.
Tech giants such as Microsoft and Facebook have rolled out products and features generated from hackathons.
Web3 is powered by blockchain, hence, for a Web3 developer, it is vital to know blockchain from scratch.
OpenAI has also allowed full usage rights to commercialise the images that they create with DALL.E – including the right to reprint, sell and merchandise.
Pragya Mishra, business analyst (data sciences and analytics), VMware, speaks about the ideal inclusive workplace and the challenges of a business analyst in the data analytics space
A good ML Observability tool can provide a common framework for all stakeholders to understand, debug, monitor and deliver the much-needed framework for AI Governance.
Security Research Labs uncovered new hacking frontiers that opened up despite improvements in 5G standards. The team was able to hack into the network multiple times, thereby getting hold of customer data or disrupting operations due to poorly configured cloud technology
Stay up to date with our latest news, receive exclusive deals, and more.
© Analytics India Magazine Pvt Ltd 2022
Terms of use
Privacy Policy
Copyright