jupyter Archives | Colaberry

Posted on March 15, 2021 by Yash .

Serving Jupyter Notebooks to Thousands of Users

In our organization, Colaberry Inc, we provide professionals from various backgrounds and various levels of experience, with the platform and the opportunity to learn Data Analytics and Data Science. In order to teach Data Science, the Jupyter Notebook platform is one of the most important tools. A Jupyter Notebook is a document within an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.

In this blog, we will learn the basic architecture of JupyterHub, the multi-user jupyter notebook platform, its working mechanism, and finally how to set up jupyter notebooks to serve a large user base.

Why Jupyter Notebooks?

In our platform, refactored.ai we provide users an opportunity to learn Data Science and AI by providing courses and lessons on Data Science and machine learning algorithms, the basics of the python programming language, and topics such as data handling and data manipulation.

Our approach to teaching these topics is to provide an option to “Learn by doing”. In order to provide practical hands-on learning, the content is delivered using the Jupyter Notebooks technology.

Jupyter notebooks allow users to combine code, text, images, and videos in a single document. This also makes it easy for students to share their work with peers and instructors. Jupyter notebook also gives users access to computational environments and resources without burdening the users with installation and maintenance tasks.

Limitations

One of the limitations of the Jupyter Notebook server is that it is a single-user environment. When you are teaching a group of students learning data science, the basic Jupyter Notebook server falls short of serving all the users.

JupyterHub comes to our rescue when it comes to serving multiple users, with their own separate Jupyter Notebook servers seamlessly. This makes JupyterHub equivalent to a web application that could be integrated into any web-based platform, unlike the regular jupyter notebooks.

JupyterHub Architecture

The below diagram is a visual explanation of the various components of the JupyterHub platform. In the subsequent sections, we shall see what each component is and how the various components work together to serve multiple users with jupyter notebooks.

Components of JupyterHub

Notebooks

At the core of this platform are the Jupyter Notebooks. These are live documents that contain user code, write-up or documentation, and results of code execution in a single document. The contents of the notebook are rendered in the browser directly. They come with a file extension .ipynb. The figure below depicts how a jupyter notebook looks:

Notebook Server

As mentioned above, the notebook servers serve jupyter notebooks as .ipynb files. The browser loads the notebooks and then interacts with the notebook server via sockets. The code in the notebook is executed in the notebook server. These are single-user servers by design.

Hub

Hub is the architecture that supports serving jupyter notebooks to multiple users. In order to support multiple users, the Hub uses several components such as Authenticator, User Database, and Spawner.

Authenticator

This component is responsible for authenticating the user via one of the several authentication mechanisms. It supports OAuth, GitHub, and Google to name a few of the several available options. This component is responsible for providing an Auth Token after the user is successfully authenticated. This token is used to provide access for the corresponding user.

Refer to JupyterHub documentation for an exhaustive list of options. One of the notable options is using an identity aggregator platform such as Auth0 that supports several other options.

User Database

Internally, Jupyter Hub uses a user database to store the user information to spawn separate user pods for the logged-in user and then serve notebooks contained within the user pods for individual users.

Spawner

A spawner is a worker component that creates individual servers or user pods for each user allowed to access JupyterHub. This mechanism ensures multiple users are served simultaneously. It is to be noted that there is a predefined limitation on the number of the simultaneous first-time spawn of user pods, which is roughly about 80 simultaneous users. However, this does not impact the regular usage of the individual servers after initial user pod creation.

How It All Works Together

The mechanism used by JupyterHub to authenticate multiple users and provide them with their own Jupyter Notebook servers is described below.

The user requests access to the Jupyter notebook via the JupyterHub (JH) server.
The JupyterHub then authenticates the user using one of the configured authentication mechanisms such as OAuth. This returns an auth token to the user to access the user pod.
A separate Jupyter Notebook server is created and the user is provided access to it.
The requested notebook in that server is returned to the user in the browser.
The user then writes code (or documentation text) in the notebook.
The code is then executed in the notebook server and the response is returned to the user’s browser.

Deployment and Scalability

The JupyterHub servers could be deployed in two different approaches:
Deployed on the cloud platforms such as AWS or Google Cloud platform. This uses Docker and Kubernetes clusters in order to scale the servers to support thousands of users.
A lightweight deployment on a single virtual instance to support a small set of users.

Scalability

In order to support a few thousand users and more, we use the Kubernetes cluster deployment on the Google Cloud platform. Alternatively, this could also have been done on the Amazon AWS platform to support a similar number of users.

This uses a Hub instance and multiple user instances each of which is known as a pod. (Refer to the architecture diagram above). This deployment architecture scales well to support a few thousand users seamlessly.

To learn more about how to set up your own JupyterHub instance, refer to the Zero to JupyterHub documentation.

Conclusion

JupyterHub is a scalable architecture of Jupyter Notebook servers that supports thousands of users in a maintainable cluster environment on popular cloud platforms.

This architecture suits several use cases with thousands of users and a large number of simultaneous users, for example, an online Data Science learning platform such as refactored.ai

Posted on November 4, 2017 by Ram Dhan Yadav Katamaraja .

The Colaberry learning team in partnership with the data science club of the University of Texas, Dallas organized a data science hackathon on October 28th, 2017. It was one whirlwind tour of putting our data science learning platform https://refactored.ai that we are cooking in our labs to an interesting test in an uncontrolled environment. Here is a rundown of what happened.

Inception:

It all started with Colaberry bringing its first-ever data science interns to work on the https://refactored.ai platform in the summer of 2017. One of those interns is, Kshitij Yadav, who is pursuing a Master’s in Business Analytics from Naveen Jindal School of Management, UTD. He is a very unassuming guy with great hustle and he did hustle his way into the internship team. He stumbled a lot through the summer, and made a lot of mistakes, but continuously improved through sheer handwork and dedication. In the end, he emerged as one of the best interns we hired. Before he completed his internship, he asked: ‘I learned a lot on https://refactored.ai. How can I continue to contribute to this platform once I am back to school?’. I suggested that he go back, start/participate in a data science club and continue to learn on the platform, and contribute to the platform as a volunteer. Kshitij took that advice very seriously, went back to school, collaborated with the newly formed data science club, convinced us to partner with the data science club to run a hackathon at university for the benefit of enthusiastic students, and brought Colaberry to his school to run a hackathon.

Opening architecture:

Organizing a data science hackathon is a very interesting challenge for me and our refactored team. I usually continuously think about the reusability and scalability of the work that we do. So, I started with the question: ‘How can we organize this hackathon in such a way that it gives us the capability to repeat any future hackathons with less effort and potentially scale to run more hackathons?’

That question forced us to think critically about the data science content creation and consumption that happens on the refactored platforms. It forced us to move from a closed architecture to an open architecture. Instead of forcing users to create content using tools within our platform, we opened the architecture so that content contributors can check their content into GitHub and pull it into the platform. This enabled us to think about how to leverage GitHub infrastructure to empower any data scientist in the world to contribute content to refactored platforms in the form of Jupyter notebooks. Giving the flexibility of contributing content in the form of a Jupyter Notebook comes with its challenges. One such challenge is how to translate the content that is contributed into a lesson or lab on the refactored platform. As the first step, we created a temporary solution in the form of a custom-refactored jupyter notebook. This custom notebook is hosted at http://hackathon.refactored.ai. This notebook allows users to upload their Jupyter Notebook and quickly convert that into a format that allows the refactored platform to import content through the CI process. I hope that we can figure out how to eliminate this manual intervention and automatically import Jupyter notebooks into the refactored platform. This is probably a good artificial intelligence and machine learning challenge to solve.

Setting the stage:

With this open content contribution architecture and content format conversation tool, we have enough ammunition to organize a small-scale data science hackathon. For organizing hackathons, we created a public git repository at https://github.com/colaberry/hackathons which users can fork and work on the challenges. To organize the UTD data science hackathon, we created a subfolder utd-fall-2017 and posted hackathon instructions. We decided to test two aspects as part of the hackathon. The first one is to test the ability of participants to provide their data-driven analysis for the data sets. The second one is to test their ability to deliver the analysis as a product that can be imported into the refactored platform.

UTD data science club’s president Achintya Sen and the team did a remarkable job in collaborating with us to create a plan for organizing the hackathon. The hackathon was divided into three stages:

Refactored platform orientation – Oct 19th, 2017
Hackathon day – Oct 28th, 2017
Award ceremony day – Nov 2nd, 2017

Hackathon execution:

In the first stage, Harish Krishnamurthy and Murali Mallina provided orientation on Python for data science, jupyter notebooks, and using the refactored.ai platform to quickly learn data science by doing.

In the second stage, a 6-hour hackathon was organized in a large room that accommodated 57 students who formed into 18 teams to participate in the hackathon. The hackathon was filled with energy, ideas, questions, focus, creativity, and lots of fun. The data science club team and volunteers did an incredible job behind the scenes organizing the logistics, food, AV, etc.

I was able to attend the actual hackathon to observe whether our open architecture is working as expected or not, how the participants are responding, and whether they can easily follow the instructions and gather any other feedback. For the most part, they were able to very easily work with jupyter notebooks and refactored platforms. What was most exciting to me and our team was that the participants were constantly referring to the lessons and concepts in the refactored platform while hacking away. One area where participants struggled was working with ‘assertion’ blocks as part of converting the content created into lessons format that can be imported into a refactored platform. I envision a day when we can automate this lesson conversion process by applying artificial intelligence and machine learning.

Feedback from participants:

The hackathon also provided an opportunity to get feedback from participants on various aspects of the refactored platform and tools we are developing. Overall feedback was overwhelmingly positive. This encourages us to continue to do the work that we are doing to democratize data science. Sharing below some graphs from the feedback that we received.

Evaluation:

After the hackathon was complete, the refactored team worked on evaluating the 18 pull requests submitted on the public git repository. The evaluation team included Sathwik Mohan and Brinda Krishnaswamy under the guidance of Harish Krishnamurthy. We came up with various criteria for scoring and brainstormed a lot to pick winning submissions. Even after 2 days of deliberation, evaluation, and scoring the team could not pick a winner as the aggregate scores of the submissions were creating ties. Finally, they decided to pick 3 winning entries and 2 runner-up entries.

Award ceremony:

As the final step, the data science club organized the award ceremony and a networking event on 2nd November. Tommy Haugh and David Freni from the Colaberry recruiting team attended the event. Data science club advisor and director of the MS business analytics program Khasif Saeed was also present at the event.

Winners of the UTD 2017 hackathon along with analysis / feedback:

• Pythonians: The highlight of their notebook1 and notebook2 was the content development that was guiding their workflow throughout the analysis. They adhered to the Refactored course format, which helps us to understand their thought process in arriving at the model. From the beginning, their approach to the dataset was logical and well presented. Explanatory Data Analysis or EDA is an important part of understanding the data and exploring its weaknesses. We would encourage you to add more on that front. This will enable you to understand what the data is trying to tell you. Conclusions are an important part of model building. Write inferences from your model and explain your coefficients.

• DataFreaks: Major contribution in their notebook was the EDA. Box plots clearly conveyed the number of employed graduates in each category. A very nice heat map clearly shows multicollinearity. Add to that a table of variance inflation factor (VIF) was a good way to weed out unwanted variables. The side by side distribution plot of median salary in employed and unemployed graduates was very illustrative. We can clearly see there is a definite relationship. Show model statistics, coefficients and model accuracy using MSE. This alone helps us to understand how well the model can perform.

• MeanSquare: This team doubled their efforts and worked on two datasets. With the graduate students dataset, they started with some very good plots as part of their EDA. Bar plot for total graduates grouped by major category and heat map to check collinearity both hold a lot of information. A pairplot among predictors is a very easy way to check their distribution and assert model assumptions. The model predictions were good. Model diagnostics through colored scatter plot of predictions was also well done. In addition to providing mean squared errors for model accuracy, these kind of diagnostic plots convey the same information in a way that is easy to understand. For women in STEM dataset, they had a piechart, bar plots and heat map, in addition to a scatterplot of predicted vs actual values that captures the model accuracy. Presentation is an equally important part of model building. It is important to convey the thoughts that went behind building your model. We encourage you to add content to summarize and add conclusions in each step.

The Runners up did a very good job but just missed out on a few key points that was expected. But kudos to them for making the top.

• DataArtists: Happiness Index dataset was a tough nut to crack. But DataArtists team really owned it. Their approach to this complex dataset was inspiring. With numerous variables and lots of missing values, they handled the data cleaning very well. Application of multinomial logistic and attempt at decision tree is commendable. Having some output would have helped understand the strength of the model. Following the Refactored format to creating a notebook really helps present your findings and model building logic in a way that is clear to the reader. Focus on adding content and more exploratory data analysis around your variables. Work on perfecting your model and increase accuracy.

• Data_Diggers: They followed the refactored course format all through the notebook with content added for every step. Each exercise was followed by very strong assertion blocks. A scatterplot of Median salary and proportion of women in STEM was used to clearly show the relationship between the two variables. Conclusions were provided for the plots, which is what we expect from EDA. A bar plot showing the number of men and women in each category is illustrative. Their team were one of the very few who clearly showed the model statistics along with model evaluation metrics. They used a pair plot to assess model accuracy which was a unique approach and very useful. As part of EDA and data cleaning, you were expected to check for multicollinearity and missing values and provide summary statistics. This forms an important first step. Always ensure your data is clean before you venture into building a model around it. Collinear predictors can severely affect the performance of your model.

In general, for everyone, we encourage you to not just fit a good model to the data but also to understand the hypothesis behind building the model. Derive inferences from your model. Every data tells a story and the objective of a model is to extract maximum information from that story. Build conclusions and explain limitations in the model. Understand how coefficients relate to the predictors and what their standard errors mean. Asses the model accuracy and build diagnostic plots. Explain why you chose the predictors and why your model is the best fit. Presenting output goes a long way in communicating what your model is about. Explain why you chose a plot, what did you learn from the plot, why did you choose a model diagnostic and what does it imply about your model.

Tom and David made some wonderful connections at the event and looking forward to working with students and the university.

What’s next?

Traveling on this hackathon journey has been so exciting that it planted seeds for new possibilities. Some of the possibilities are organizing inter-college hackathons or maybe a worldwide hackathon to put the platform to test the boundaries. We started receiving correspondence from students who were looking for career opportunities. So, I am thinking about launching a thousand virtual internships initiative for students to contribute to the continuous development of the refactored platform and content. That could open up a world of opportunities for students and us. I am excited to see where this takes us.

Tag: jupyter

Serving Jupyter Notebooks to Thousands of Users

Serving Jupyter Notebooks to Thousands of Users

Road to Data Science Hackathon At The University of Texas, Dallas