SQL Server Joins in the Finance Industry

Image of glass building with close-up of finance sign

Posted on February 7, 2023 by Kevin Guisarde .

SQL Server Joins in the Finance Industry

SQL Server Joins are a fundamental aspect of SQL Server, providing developers with the ability to combine data from multiple tables into a single result set. With Joins, developers can access data from multiple tables and perform complex queries that extract specific information. In this blog, we will explore the different types of SQL Server Joins, including Inner Joins, Left Joins, Right Joins, and Outer Joins, and provide coding examples from the Finance Industry to illustrate these concepts. By understanding the different types of Joins and their applications, developers can enhance their SQL skills and build more powerful applications that extract data more efficiently. Whether you’re new to SQL or a seasoned developer, this blog will help you master SQL Server Joins and unlock their full potential.

Agenda

Introduction to SQL Server Joins
Different Concept Types of SQL Server Joins
Real-world Example Questions in the Finance Industry
Most Commonly Asked Interview Question in SQL Server Joins
Conclusion

Introduction to SQL Server Joins in the Finance Industry

SQL Server Joins are an essential part of SQL Server, and they allow you to combine data from multiple tables into a single result set. In this blog, we will be discussing the different concept types of SQL Server Joins and provide coding examples from the Finance Industry to illustrate the concepts.

Different Concept Types of SQL Server Joins

INNER JOIN

The INNER JOIN is used to combine data from two tables based on a matching value in both tables. In the Finance Industry, you might use an INNER JOIN to combine data from a customer table and a transaction table to get a list of customers and their transactions.

Coding Example:

SELECT *
FROM customer
INNER JOIN transaction
ON customer.customer_id = transaction.customer_id;

LEFT JOIN

The LEFT JOIN is used to combine data from two tables based on a matching value in the left table and returns all records from the left table and the matching records from the right table. In the Finance Industry, you might use a LEFT JOIN to combine data from a customer table and a transaction table to get a list of customers and their transactions, with customers who have not made any transactions appearing in the result set with NULL values for the transaction data.

Coding Example:

SELECT *
FROM customer
LEFT JOIN transaction
ON customer.customer_id = transaction.customer_id;

RIGHT JOIN

The RIGHT JOIN is used to combine data from two tables based on a matching value in the right table and returns all records from the right table and the matching records from the left table. In the Finance Industry, you might use a RIGHT JOIN to combine data from a customer table and a transaction table to get a list of transactions and the corresponding customer data, with transactions that have not been made by any customers appearing in the result set with NULL values for the customer data.

Coding Example:

SELECT *
FROM customer
RIGHT JOIN transaction
ON customer.customer_id = transaction.customer_id;

FULL OUTER JOIN

The FULL OUTER JOIN is used to combine data from two tables based on a matching value in both tables and returns all records from both tables, with NULL values for non-matching records. In the Finance Industry, you might use a FULL OUTER JOIN to combine data from a customer table and a transaction table to get a list of all customers and transactions, with NULL values for customers who have not made any transactions and transactions that have not been made by any customers.

Coding Example:

SELECT *
FROM customer
FULL OUTER JOIN transaction
ON customer.customer_id = transaction.customer_id;

Real-World Example Questions in the Finance Industry

Script to generate tables and records (Continued):

Copy code
CREATE TABLE customer (
  customer_id INT PRIMARY KEY,
  customer_name VARCHAR(50),
  customer_address VARCHAR(100)
);

INSERT INTO customer (customer_id, customer_name, customer_address)
VALUES (1, 'John Doe', '123 Main St'),
       (2, 'Jane Doe', '456 Main St'),
       (3, 'Bob Smith', '789 Main St');

CREATE TABLE transaction (
  transaction_id INT PRIMARY KEY,
  customer_id INT,
  transaction_date DATE,
  transaction_amount DECIMAL(10,2)
);

INSERT INTO transaction (transaction_id, customer_id, transaction_date, transaction_amount)
VALUES (1, 1, '2022-01-01', 100.00),
       (2, 2, '2022-01-02', 200.00),
       (3, 1, '2022-01-03', 300.00),
       (4, 3, '2022-01-04', 400.00);

1. Write a query to get a list of all customers and their total transactions amount.

View Answer

SELECT customer.customer_name, SUM(transaction.transaction_amount) as total_transaction_amount
FROM customer
INNER JOIN transaction
ON customer.customer_id = transaction.customer_id
GROUP BY customer.customer_name;

2. Write a query to get a list of all customers and their latest transaction date.

View Answer

SELECT customer.customer_name, MAX(transaction.transaction_date) as latest_transaction_date
FROM customer
INNER JOIN transaction
ON customer.customer_id = transaction.customer_id
GROUP BY customer.customer_name;

3. Write a query to get a list of customers who have not made any transactions.

View Answer

SELECT customer.customer_name
FROM customer
LEFT JOIN transaction
ON customer.customer_id = transaction.customer_id
WHERE transaction.transaction_id IS NULL;

Most Commonly Asked Interview Question in SQL Server Joins

Q: Explain the difference between INNER JOIN and LEFT JOIN.

A: The main difference between INNER JOIN and LEFT JOIN is the way they combine data from two tables. INNER JOIN combines data from two tables based on a matching value in both tables and returns only the matching records. On the other hand, LEFT JOIN combines data from two tables based on a matching value in the left table and returns all records from the left table and the matching records from the right table.

I used INNER JOIN and LEFT JOIN in a previous project where I was working on a financial data analysis. I used INNER JOIN to combine data from a customer table and a transaction table to get a list of customers and their transactions. However, I also wanted to get a list of customers who have not made any transactions, so I used LEFT JOIN for this purpose, which returned all records from the customer table and the matching records from the transaction table, with NULL values for customers who have not made any transactions.

Conclusion

SQL Server Joins are an essential part of SQL Server and are used to combine data from multiple tables into a single result set. In this blog, we explored the different types of SQL Server Joins such as INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN, with examples from the finance industry. We also covered advanced practice questions, a commonly asked interview question, and a script to generate tables and records.

SQL Server Joins play a critical role in data analysis and understanding the different types and their usage is crucial for anyone working with data. Whether you are a beginner or an experienced developer, this blog should help you build a strong foundation in SQL Server Joins.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

February 7, 2023June 2, 2023

SQL Server Views In Transportation Industry

Posted on February 7, 2023 by Kevin Guisarde .

SQL Server Views In Transportation Industry

SQL Server Views are virtual tables that provide developers with a more efficient and user-friendly way to access data stored in the database. Views can simplify the way data is accessed by providing a consistent, controlled, and secure way of accessing the underlying tables, while hiding the complexity of the database structure. In this blog, we will delve into the different types of SQL Server Views and how they can be utilized in the transportation industry. With views, developers can create customized data views tailored to specific use cases, providing a more streamlined and efficient approach to data management. Whether you’re working in the transportation industry or any other sector, understanding SQL Server Views can help you improve your data management and streamline your application development process.

Agenda

Introduction to SQL Server Views
Types of SQL Server Views
Real-World Example Questions in the Transportation Industry
Most Commonly Asked Interview Question
Conclusion

Introduction to SQL Server Views

SQL Server Views are virtual tables that can be used to simplify the way you access data stored in the database. Views can provide a consistent, controlled, and secure way of accessing the underlying tables, hiding the complexity of the database structure. In this blog, we will discuss the different types of SQL Server Views and how they can be used in the transportation industry.

Types of SQL Server Views

SQL Server Views can be divided into three main categories: Simple View, Complex View, and Indexed View.

Simple View

A simple view is a SELECT statement that can be used to retrieve data from one or more tables. The SELECT statement can include a WHERE clause, aggregate functions, and any other SQL command that can be used in a SELECT statement. Here is an example of a simple view in the transportation industry:

CREATE VIEW vw_transportation_deliveries
AS
SELECT delivery_id, delivery_date, delivery_destination, delivery_status
FROM deliveries
WHERE delivery_status = 'Delivered'

In this example, we are creating a view vw_transportation_deliveries that retrieves all the deliveries with a status of “Delivered.”

Complex View

A complex view is a SELECT statement that combines data from multiple tables, using joins, and any other SQL command that can be used in a SELECT statement. Here is an example of a complex view in the transportation industry:

CREATE VIEW vw_transportation_delivery_details
AS
SELECT d.delivery_id, d.delivery_date, d.delivery_destination, d.delivery_status,
       v.vehicle_number, v.vehicle_type, v.vehicle_capacity
FROM deliveries d
INNER JOIN vehicles v ON d.vehicle_id = v.vehicle_id

In this example, we are creating a view vw_transportation_delivery_details that retrieves data from two tables, deliveries and vehicles, based on the vehicle_id field.

Indexed View

An indexed view is a view that has a clustered index. This type of view is useful when you need to improve query performance. Here is an example of an indexed view in the transportation industry:

CREATE VIEW vw_transportation_delivery_summary
WITH SCHEMABINDING
AS
SELECT delivery_destination, SUM(delivery_weight) AS total_delivery_weight
FROM deliveries
GROUP BY delivery_destination

In this example, we are creating a view vw_transportation_delivery_summary that retrieves the sum of delivery weight for each delivery destination. The WITH SCHEMABINDING option is used to ensure that the view definition cannot be changed. We are also creating a clustered index `idx_vw_transportation_delivery_summary` on the view, which will help improve the query performance when accessing this view.

Real-World Example Questions in the Transportation Industry

Script to generate tables and records:

-- create the delivery_destinations table 
CREATE TABLE delivery_destinations ( 
  destination_id INT PRIMARY KEY IDENTITY(1, 1), 
  destination_name VARCHAR(50), 
  city VARCHAR(50), 
  state VARCHAR(50) 
); 

-- insert sample data into the delivery_destinations table 
INSERT INTO delivery_destinations (destination_name, city, state) 
VALUES 
  ('Destination A', 'City A', 'State A'), 
  ('Destination B', 'City B', 'State B'), 
  ('Destination C', 'City C', 'State C'), 
  ('Destination D', 'City D', 'State D'), 
  ('Destination E', 'City E', 'State E'); 

-- create the deliveries table 
CREATE TABLE deliveries ( 
  delivery_id INT PRIMARY KEY IDENTITY(1, 1), 
  delivery_destination INT, 
  delivery_date DATE, 
  delivery_start_time DATETIME, 
  delivery_end_time DATETIME, 
  FOREIGN KEY (delivery_destination) REFERENCES delivery_destinations (destination_id) 
); 

-- insert sample data into the deliveries table 
INSERT INTO deliveries (delivery_destination, delivery_date, delivery_start_time, delivery_end_time) 
VALUES 
  (1, '2022-01-01', '2022-01-01 10:00:00', '2022-01-01 11:00:00'), 
  (2, '2022-01-02', '2022-01-02 09:00:00', '2022-01-02 10:00:00'), 
  (3, '2022-01-03', '2022-01-03 08:00:00', '2022-01-03 09:00:00'), 
  (4, '2022-01-04', '2022-01-04 07:00:00', '2022-01-04 08:00:00'), 
  (1, '2022-01-05', '2022-01-05 06:00:00', '2022-01-05 07:00:00'), 
  (2, '2022-01-06', '2022-01-06 05:00:00', '2022-01-06 06:00:00'), 
  (3, '2022-01-07', '2022-01-07 04:00:00', '2022-01-07 05:00:00'), 
  (4, '2022-01-08', '2022-01-08 03:00:00', '2022-01-08 04:00:00');

1. Write a query to retrieve the titles and release year of all movies that were released in the years 2000 or later, sorted by release year in ascending order.

View Answer

CREATE VIEW vw_top_5_delivery_destinations AS 
WITH CTE AS ( 
SELECT 
  delivery_destination, 
  COUNT(delivery_id) AS delivery_count 
FROM 
  deliveries 
WHERE 
  delivery_date BETWEEN DATEADD(month, -12, GETDATE()) AND GETDATE() 
GROUP BY 
  delivery_destination 
) 
SELECT 
  delivery_destination, 
  delivery_count 
FROM 
  CTE 
ORDER BY 
  delivery_count DESC 
OFFSET 0 ROWS FETCH NEXT 5 ROWS ONLY;

2. Write a view to retrieve the total number of deliveries for each delivery destination for the past 6 months, grouped by destination state and city.

View Answer

CREATE VIEW vw_delivery_destination_summary AS 
SELECT 
  dd.state, 
  dd.city, 
  COUNT(d.delivery_id) AS delivery_count 
FROM 
  deliveries d 
  INNER JOIN delivery_destinations dd ON d.delivery_destination = dd.destination_id WHERE 
  d.delivery_date BETWEEN DATEADD(month, -6, GETDATE()) AND GETDATE() 
GROUP BY 
  dd.state, 
  dd.city;

3. Write a view to retrieve the average delivery time for each delivery destination, including the destination name, city, state, and average delivery time in hours.

View Answer

CREATE VIEW vw_average_delivery_time AS 
SELECT 
  dd.destination_name, 
  dd.city, 
  dd.state, AVG(DATEDIFF(hour, d.delivery_start_time, d.delivery_end_time)) AS avg_delivery_time_hours 
FROM 
  deliveries d 
  INNER JOIN delivery_destinations dd ON d.delivery_destination = dd.destination_id GROUP BY 
  dd.destination_name, 
  dd.city, 
  dd.state;

Most Commonly Asked Interview Question

Q: What is the difference between a View and a Stored Procedure?

A: A View is a virtual table that retrieves data from one or more tables, whereas a Stored Procedure is a precompiled set of SQL commands that can perform actions such as insert, update, and delete data in the database. A View can be used to simplify the way you access data stored in the database, while a Stored Procedure can be used to perform complex operations.

For example, in a previous project, I used a View to retrieve the sum of delivery weight for each delivery destination in the transportation industry. This View was then used in multiple reports to display the delivery summary. On the same project, I also used a Stored Procedure to perform bulk updates to the delivery status based on certain criteria.

Conclusion

SQL Server Views are a powerful tool that can simplify the way you access data stored in the database. They can provide a consistent, controlled, and secure way of accessing the underlying data while abstracting the complexity of the underlying tables. In the transportation industry, SQL Server Views can be used to aggregate data, retrieve delivery details, and simplify the way you access data for reporting and analysis.

In conclusion, if you’re interested in a career in data analytics and you want to learn more about SQL Server Views, then book a call with our admissions team or visit training.colaberry.com to learn more.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

February 7, 2023June 2, 2023

SQL Server Indexes

Posted on February 7, 2023 by Kevin Guisarde .

SQL Server Indexes

SQL Server Indexes are an essential component of a well-designed database, providing developers with a powerful tool to improve query performance. Indexes are data structures that help SQL Server locate the data requested in a query quickly, by creating a sorted copy of the data that can be searched more efficiently. In this blog, we will explore the different types of indexes available in SQL Server and their applications in the Technology Industry. With indexes, developers can dramatically reduce query response times, improve application performance, and enhance the overall user experience. Whether you’re working with large databases or small, understanding the different types of indexes and how to use them effectively is critical to building efficient and effective applications.

Agenda

Introduction to SQL Server Indexes
Types of Indexes
Real-World Example Questions in Technology Industry
Most Commonly Asked Interview Question
Conclusion

Introduction to SQL Server Indexes

SQL Server Indexes play a crucial role in improving the performance of database queries. They are essentially data structures that help SQL Server quickly locate the data that is requested in a query. In this blog, we will discuss the different types of indexes available in SQL Server and their applications in the Technology Industry.

Types of Indexes

Clustered Indexes

Clustered indexes determine the physical order of data in a table. A table can only have one clustered index, as the physical order of the data can only be determined in one way. Clustered indexes can improve query performance because they allow data to be retrieved in the order it is stored on disk.

Example:
Consider a table called Orders with columns OrderID, CustomerID, and OrderDate. If we frequently run queries that sort the data by OrderDate, it would make sense to create a clustered index on the OrderDate column.

CREATE CLUSTERED INDEX IX_Orders_OrderDate 
ON Orders (OrderDate)

Non-Clustered Indexes

Non-clustered indexes are separate from the table data, and they contain a copy of the indexed columns and a pointer to the location of the actual data in the table. A table can have multiple non-clustered indexes.

Example:
Consider a table called Customers with columns CustomerID, CustomerName, and City. If we frequently run queries that filter by the City column, it would make sense to create a non-clustered index on the City column.

CREATE NONCLUSTERED INDEX IX_Customers_City 
ON Customers (City)
WHERE IsActive = 1

Filtered Indexes

Filtered indexes are indexes that only include a subset of the rows in a table. Filtered indexes can improve query performance by reducing the size of the index, making it more efficient to use.

Example:
Consider a table called Customers with columns CustomerID, CustomerName, and IsActive. If we frequently run queries that filter by only the IsActive column and the value is 1, it would make sense to create a filtered index that only includes rows where IsActive is 1.

CREATE NONCLUSTERED INDEX IX_Customers_IsActive_Filtered 
ON Customers (IsActive) 
WHERE IsActive = 1

Composite Indexes

Composite indexes are indexes that contain multiple columns. A composite index can improve query performance when the index covers the query, meaning that all the columns that the query uses are included in the index.

Example:
Consider a table called Orders with columns OrderID, CustomerID, and OrderDate. If we frequently run queries that filter by both CustomerID and OrderDate, it would make sense to create a composite index on both CustomerID and OrderDate.

CREATE NONCLUSTERED INDEX IX_Orders_CustomerID_OrderDate 
ON Orders (CustomerID, OrderDate)

Real-World Example Questions in Technology Industry

1. Consider a table called Orders with columns OrderID, CustomerID, OrderDate, and TotalAmount. Create an index that will improve the performance of queries that filter by OrderDate and return the TotalAmount.

View Answer

CREATE NONCLUSTERED INDEX IX_Orders_OrderDate_TotalAmount 
ON Orders (OrderDate) INCLUDE (TotalAmount)

2. Consider a table called Employees with columns EmployeeID, FirstName, LastName, HireDate, and DepartmentID. Create an index that will improve the performance of queries that sort by HireDate and filter by DepartmentID.

View Answer

CREATE NONCLUSTERED INDEX IX_Employees_HireDate_DepartmentID 
ON Employees (DepartmentID, HireDate)

3. Consider a table called Invoices with columns InvoiceID, CustomerID, InvoiceDate, TotalAmount, and IsPaid. Create an index that will improve the performance of queries that filter by IsPaid and return the TotalAmount.

View Answer

CREATE NONCLUSTERED INDEX IX_Invoices_IsPaid_TotalAmount 
ON Invoices (IsPaid) INCLUDE (TotalAmount)

Most Commonly Asked Interview Question

Q: What is an Index in SQL Server and how does it improve performance?

An Index in SQL Server is a data structure that helps SQL Server quickly locate the data that is requested in a query. The data in an index is organized in such a way that allows SQL Server to search for data much faster than if it had to search through the entire table.

I have used indexes in several projects to improve the performance of database queries. For example, in a project for a software company, I was tasked with optimizing the queries for their bug-tracking system. I created several non-clustered indexes on the most frequently used columns in the “Bugs” table, such as “Status” and “Date_Modified”. This helped reduce the query execution time significantly, as the indexes made it much easier for SQL Server to locate the relevant data.

As a result of the improved performance, the software company was able to retrieve the information they needed to resolve bugs, which helped them release their software faster and with fewer bugs.

Conclusion

In conclusion, indexes play a crucial role in improving the performance of database queries in SQL Server. By creating indexes on frequently used columns, you can reduce the query execution time and retrieve data much faster.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

February 7, 2023June 2, 2023

SQL Server Stored Procedures in Media Industry

Posted on February 7, 2023 by Kevin Guisarde .

SQL Server Stored Procedures are a valuable tool for managing and maintaining complex database logic. Stored Procedures are precompiled sets of T-SQL statements that can be executed by calling the stored procedure name. They provide a convenient way to encapsulate a series of T-SQL statements into a single executable unit, making it easier to manage and maintain complex database logic. In this blog, we will discuss the benefits of using SQL Server Stored Procedures, including improved performance, security, and ease of maintenance. We will also explore the different types of Stored Procedures and provide examples of how they can be used in various industries. Whether you’re new to SQL Server or an experienced developer, understanding Stored Procedures can help you build more efficient and effective applications, and simplify the management of complex database logic.

Agenda

Introduction to SQL Server Stored Procedures
Different Stored Procedure Types using Examples from the Media Industry
Real-World Example Questions in the Media Industry
A Most Commonly Asked Interview Question in SQL Server Stored Procedures
Conclusion

Introduction to SQL Server Stored Procedures

SQL Server Stored Procedures are precompiled sets of T-SQL statements that can be executed by calling the stored procedure name. They provide a convenient way to encapsulate a series of T-SQL statements into a single executable unit that can be executed repeatedly, making it easier to manage and maintain complex database logic.

Different Stored Procedure Types using Examples From The Media Industry

Simple Stored Procedures

A simple stored procedure is a basic stored procedure that only contains a single SELECT statement. This type of stored procedure is commonly used to retrieve data from a database.

Consider a media database that contains information about movies and their respective ratings. A simple stored procedure can be created to retrieve the titles of movies with a rating of 8 or higher:

CREATE PROCEDURE GetHighRatedMovies 
AS 
BEGIN 
  SELECT Title 
  FROM Movies 
  WHERE Rating >= 8 
END

Parameterized Stored Procedures

A parameterized stored procedure is a stored procedure that accepts parameters. These parameters can be used to filter data or customize the behavior of the stored procedure.

Consider a media database that contains information about movies and their respective ratings. A parameterized stored procedure can be created to retrieve the titles of movies with a specified rating:

CREATE PROCEDURE GetMoviesByRating (@minRating INT) 
AS 
BEGIN 
  SELECT Title 
  FROM Movies 
  WHERE Rating >= @minRating 
END

Stored Procedures with Output Parameters

A stored procedure with output parameters is a stored procedure that returns output in the form of parameters. These parameters can be used to return a value from the stored procedure to the calling code.

Example in Media Industry:
Consider a media database that contains information about movies and their respective ratings. A stored procedure with output parameters can be created to retrieve the total number of movies with a specified rating:

CREATE PROCEDURE GetMovieCountByRating (@minRating INT, @movieCount INT OUTPUT) 
AS 
BEGIN 
  SELECT @movieCount = COUNT(*) 
  FROM Movies 
  WHERE Rating >= @minRating 
END

Real-World Example Questions in the Media Industry

Script:

CREATE TABLE Movies ( 
  MovieID INT PRIMARY KEY IDENTITY(1,1), 
  Title VARCHAR(100), 
  ReleaseYear INT, 
  Rating DECIMAL(3,1), 
  BoxOffice INT 
); 

INSERT INTO Movies (Title, ReleaseYear, Rating, BoxOffice) 
VALUES 
  ('The Avengers', 2012, 8.0, 1518594910), 
  ('The Dark Knight', 2008, 9.0, 534858444), 
  ('Inception', 2010, 8.8, 825532764), 
  ('Avatar', 2009, 7.8, 278900000), 
  ('The Lord of the Rings: The Return of the King', 2003, 9.0, 378800000), 
  ('The Matrix', 1999, 8.7, 171300000), 
  ('The Shawshank Redemption', 1994, 9.2, 283400000);

1. Write a query to retrieve the titles and release year of all movies that were released in the years 2000 or later, sorted by release year in ascending order.

View Answer

SELECT Title, ReleaseYear 
FROM Movies 
WHERE ReleaseYear >= 2000 
ORDER BY ReleaseYear ASC

2. Write a query to retrieve the title and box office earnings of all movies that have a box office earning of more than $1 billion, sorted by box office earnings in descending order.

View Answer

SELECT Title, BoxOffice 
FROM Movies 
WHERE BoxOffice > 1000000000 
ORDER BY BoxOffice DESC

3. Write a query to retrieve the average rating and the standard deviation of the ratings of all movies.

View Answer

WITH CTE_AVG AS ( 
  SELECT AVG(Rating) AVG_RATING 
  FROM Movies 
), 
CTE_STDDEV AS ( 
  SELECT STDEV(Rating) STDEV_RATING 
  FROM Movies 
) 
SELECT AVG_RATING, STDEV_RATING 
FROM CTE_AVG, CTE_STDDEV

A Most Commonly Asked Interview Question in SQL Server Stored Procedures

Q: What is the difference between a stored procedure and a user-defined function in SQL Server?

A: A stored procedure and a user-defined function are two different types of database objects in SQL Server. The main difference between them is their usage and return type.

A stored procedure is used to perform a specific task, such as retrieving data from a database, inserting data into a database, or updating data in a database. Stored procedures can return multiple result sets and output parameters, but they cannot return a single value.

On the other hand, a user-defined function is used to return a single value or a table. User-defined functions can only return a single value or a table, and they cannot return multiple result sets or output parameters.

In my previous project, I used both stored procedures and user-defined functions to build a database-driven application. I used stored procedures to perform tasks such as retrieving data from a database and inserting data into a database, and I used user-defined functions to return calculated values that were used in various parts of the application.

Conclusion

In conclusion, SQL Server Stored Procedures are a powerful tool for managing complex database logic. They provide a convenient way to encapsulate a series of T-SQL statements into a single executable unit, making it easier to manage and maintain complex database logic. With the different concept types and real-world example questions in the Media Industry, it’s clear that SQL Server Stored Procedures play a crucial role in the field of data analytics.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

February 7, 2023June 2, 2023

SQL Partition By Clause in Pharmaceutical Industry

Posted on February 7, 2023 by Kevin Guisarde .

The SQL Partition By Clause is a crucial tool in managing and analyzing large datasets in SQL Server and other relational database management systems. By leveraging this feature, it becomes possible to segment a vast result set into more manageable portions based on one or more columns in the data. This can lead to significant improvements in query execution times and make it easier to perform in-depth data analysis. Whether you are working with massive datasets or need to optimize your query performance, the SQL Partition By Clause is a powerful tool that can help you achieve your goals.

Agenda

Introduction to SQL Partition By Clause
Different Concept Types in SQL Partition By Clause
Real-World Example Questions in Pharmaceutical Industry
Most Commonly Asked Interview Question
Conclusion

Introduction to SQL Partition By Clause

SQL Partition By Clause is an important concept in SQL Server and other relational database management systems. It is used to divide a large result set into smaller, manageable parts based on one or more columns in the data. This can improve the performance of query execution and make it easier to analyze data.

Different Concept Types in SQL Partition By Clause

There are several different types of partitions that can be created using the Partition By clause. Let’s look at each of these in detail, using examples from the Pharmaceutical industry.

Row Number Partition

This type of partition assigns a unique row number to each row in the result set. This is useful for pagination, as you can retrieve a specific range of rows based on the row number.

SELECT ROW_NUMBER() OVER (PARTITION BY ProductName ORDER BY SaleDate) AS RowNumber,     
  ProductName, 
  SaleDate, 
  SaleAmount 
FROM Sales 
WHERE Industry = 'Pharmaceutical'

In the above example, we are using the Row Number partition to assign a unique row number to each row in the result set. The partition is based on the ProductName column and the rows are ordered by SaleDate.

Rank Partition

This type of partition assigns a rank to each row in the result set, based on one or more columns. Rows with the same values will receive the same rank.

SELECT RANK() OVER (PARTITION BY ProductName ORDER BY SaleAmount DESC) AS Rank, 
  ProductName, 
  SaleDate, 
  SaleAmount 
FROM Sales 
WHERE Industry = 'Pharmaceutical'

In the above example, we are using the Rank partition to assign a rank to each row in the result set. The partition is based on the ProductName column and the rows are ordered by SaleAmount in descending order.

Dense Rank Partition

This type of partition assigns a dense rank to each row in the result set, based on one or more columns. Rows with the same values will receive the same rank, and there will not be any gaps in the ranks.

SELECT DENSE_RANK() OVER (PARTITION BY ProductName ORDER BY SaleAmount DESC) AS DenseRank,       
  ProductName, 
  SaleDate, 
  SaleAmount 
FROM Sales 
WHERE Industry = 'Pharmaceutical'

In the above example, we are using the Dense Rank partition to assign a dense rank to each row in the result set. The partition is based on the ProductName column and the rows are ordered by SaleAmount in descending order.

Real-World Example Questions in Pharmaceutical Industry

Before we move on to the example questions, let’s create the script to generate the table and records needed to answer the questions.

-- Script to create tables and data for Real World Example Questions

-- Table to store the Product Information
CREATE TABLE Product_Information (
    Product_ID INT PRIMARY KEY,
    Product_Name VARCHAR(100) NOT NULL,
    Manufacturer_ID INT NOT NULL,
    Product_Type VARCHAR(100) NOT NULL,
    Product_Launch_Date DATE NOT NULL
);

-- Table to store the Sales Information
CREATE TABLE Sales_Information (
    Sales_ID INT PRIMARY KEY,
    Product_ID INT NOT NULL,
    Sales_Date DATE NOT NULL,
    Sales_Quantity INT NOT NULL,
    Sales_Amount DECIMAL(18,2) NOT NULL
);

-- Insert Data into Product_Information
INSERT INTO Product_Information (Product_ID, Product_Name, Manufacturer_ID, Product_Type, Product_Launch_Date)
VALUES (1, 'Lipitor', 101, 'Cholesterol Lowering', '2020-01-01'),
       (2, 'Advil', 102, 'Pain Relief', '2020-01-01'),
       (3, 'Zocor', 101, 'Cholesterol Lowering', '2020-02-01'),
       (4, 'Aleve', 102, 'Pain Relief', '2020-02-01'),
       (5, 'Crestor', 103, 'Cholesterol Lowering', '2020-03-01'),
       (6, 'Tylenol', 102, 'Pain Relief', '2020-03-01');

-- Insert Data into Sales_Information
INSERT INTO Sales_Information (Sales_ID, Product_ID, Sales_Date, Sales_Quantity, Sales_Amount)
VALUES (1, 1, '2021-01-01', 100, 1000),
       (2, 1, '2021-02-01', 120, 1200),
       (3, 1, '2021-03-01', 130, 1300),
       (4, 2, '2021-01-01', 50, 500),
       (5, 2, '2021-02-01', 60, 600),
       (6, 2, '2021-03-01', 70, 700),
       (7, 3, '2021-01-01', 200, 2000),
       (8, 3, '2021-02-01', 220, 2200),
       (9, 3, '2021-03-01', 240, 2400),
       (10, 4, '2021-01-01', 80, 800),
       (11, 4, '2021-02-01', 90, 900),
       (12, 4, '2021-03-01', 100, 1000),
       (13, 5, '2021-01-01', 150, 1500),
       (14, 5, '2021-02-01', 170, 1700),
       (15, 5, '2021-03-01', 190, 1900),
       (16, 6, '2021-01-01', 60, 600),
       (17, 6, '2021-02-01', 70, 700),
       (18, 6, '2021-03-01', 80, 800);

1. What is the average salary of pharmaceutical sales representatives grouped by city?

View Answer

WITH Sales_Data AS (
  SELECT
    City,
    Salary,
    AVG(Salary) OVER (PARTITION BY City) AS Average_Salary
  FROM
    Pharmaceutical_Sales_Representatives
)
SELECT
  City,
  Average_Salary
FROM
  Sales_Data
GROUP BY
  City,
  Average_Salary;

2. How many pharmaceutical products were sold in each state in the last 5 years?

View Answer

WITH Sales_Data AS (
  SELECT
    State,
    Year,
    SUM(Products_Sold) OVER (PARTITION BY State) AS Total_Products_Sold
  FROM
    Pharmaceutical_Product_Sales
  WHERE
    Year >= YEAR(GETDATE() - 5)
)
SELECT
  State,
  Total_Products_Sold
FROM
  Sales_Data
GROUP BY
  State,
  Total_Products_Sold;

3. What is the total cost of pharmaceutical products sold in each city over the last 10 years?

View Answer

WITH Sales_Data AS (
  SELECT
    City,
    Year,
    SUM(Product_Cost) OVER (PARTITION BY City) AS Total_Cost
  FROM
    Pharmaceutical_Product_Sales
  WHERE
    Year >= YEAR(GETDATE() - 10)
)
SELECT
  City,
  Total_Cost
FROM
  Sales_Data
GROUP BY
  City,
  Total_Cost;

Most Commonly Asked Interview Question and Answer

Q: Can you explain the use of Over(Partition By) clause in SQL?

A: The Over(Partition By) clause is a function in SQL that allows you to perform a calculation over a set of rows that are defined by a partition. In other words, the Partition By clause allows you to divide the rows of a result set into groups based on the values in one or more columns.

For example, in a previous project, I had to analyze sales data for a pharmaceutical company. I used the Over(Partition By) clause to group the sales data by city and calculate the average salary of pharmaceutical sales representatives for each city. This allowed me to easily identify the cities with the highest and lowest average salaries.

In summary, the Over(Partition By) clause is a powerful tool for data analysis and can be used in a variety of scenarios, such as calculating running totals, moving averages, and percentiles.

This gave us a clear picture of the cost of each medication and helped us make informed decisions about which medications to prescribe to our patients.

Conclusion

In this blog, we covered the SQL Partition By Clause and its uses in the pharmaceutical industry. We went through different concept types and provided real-world examples and coding exercises to help you understand how to use the Over(Partition By) clause in SQL. Finally, we discussed a commonly asked interview question and provided a detailed answer to help you prepare for your next interview.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

February 7, 2023June 2, 2023

A Comprehensive Guide to SQL Case Statement in Healthcare

Posted on February 7, 2023 by Kevin Guisarde .

A Comprehensive Guide to SQL Case Statement in Healthcare Industry

The SQL Case Statement is an essential feature of SQL that enables developers and data analysts to build conditional logic into their queries. By evaluating a set of conditions, the Case Statement returns a result that is based on the outcome of the evaluation. This functionality can be used to create complex, multi-level decision trees within a SQL query. With the Case Statement, data analysts can effectively analyze and extract specific data sets from vast data sources, making it an indispensable tool in the data analysis process. Overall, the SQL Case Statement is a powerful feature of SQL that provides developers and data analysts with greater flexibility and precision in their data analysis and decision-making capabilities.

Agenda

Introduction to SQL Case Statement
Types of SQL Case Statement with examples from the Healthcare Industry
Real-World Example Questions in the Healthcare Industry
Most Commonly Asked Interview Question and Answer
Conclusion

Introduction to SQL Case Statement

SQL Case Statement is a conditional statement in SQL that returns a result based on the evaluation of a set of conditions. The Case Statement is used to implement conditional logic in SQL queries, making it a powerful tool for data analysis and decision-making.

Types of SQL Case Statement with Examples From The Healthcare Industry

Simple Case Statement

A Simple Case Statement is used to evaluate a single expression and return a corresponding result. For example, in the Healthcare Industry, you can use a Simple Case Statement to categorize patients based on their age.

SELECT 
  PatientID, 
  PatientName, 
  Age, 
  CASE 
    WHEN Age < 18 THEN 'Child' 
    WHEN Age BETWEEN 18 AND 64 THEN 'Adult' 
    ELSE 'Senior' 
  END AS PatientCategory 
FROM Patients;

Searched Case Statement

A Searched Case Statement evaluates multiple conditions and returns a result based on the first matching condition. For example, in the Healthcare Industry, you can use a Searched Case Statement to calculate the co-pay amount for a patient based on their insurance plan.

SELECT 
  PatientID, 
  PatientName, 
  InsurancePlan, 
  CASE 
    WHEN InsurancePlan = 'Plan A' THEN 50 
    WHEN InsurancePlan = 'Plan B' THEN 40 
    ELSE 30 
  END AS CoPayAmount 
  FROM Patients;

Nested Case Statement

A Nested Case Statement is used to evaluate multiple conditions within another Case Statement. For example, in the Healthcare Industry, you can use a Nested Case Statement to categorize patients based on their age and insurance plan.

SELECT 
  PatientID, 
  PatientName, 
  Age, 
  InsurancePlan, 
  CASE 
    WHEN Age < 18 THEN 'Child' 
    ELSE 
      CASE 
        WHEN InsurancePlan = 'Plan A' THEN 'Adult with Plan A' 
        WHEN InsurancePlan = 'Plan B' THEN 'Adult with Plan B' 
      ELSE 'Senior' 
    END 
  END AS PatientCategory 
FROM Patients;

Real-World Example Questions in the Healthcare Industry

Script to generate tables and records needed for the real-world example questions:

CREATE TABLE Patients 
( 
  PatientID INT PRIMARY KEY, 
  PatientName VARCHAR(100), 
  Age INT, 
  InsurancePlan VARCHAR(100) 
); 
INSERT INTO Patients (PatientID, PatientName, Age, InsurancePlan) 
VALUES (1, 'John Doe', 35, 'Plan A'), (2, 'Jane Doe', 40, 'Plan B'), (3, 'John Smith', 50, 'Plan C'), (4, 'Jane Smith', 65, 'Plan A'), (5, 'Jim Brown', 25, 'Plan B');

1. What is the average age of patients with Plan A and Plan B insurance?

View Answer

SELECT 
  CASE 
    WHEN insurance_plan = 'Plan A' THEN 'Plan A' 
    WHEN insurance_plan = 'Plan B' THEN 'Plan B' 
  END as Insurance_Plan, 
  AVG(age) as Average_Age 
FROM patients 
GROUP BY insurance_plan

2. What is the total number of patients for each insurance plan?

View Answer

SELECT 
  insurance_plan, 
  COUNT(*) as Total_Patients 
FROM patients 
GROUP BY insurance_plan

3. List the patients and their age categories (Child, Adult with Plan A, Adult with Plan B, Senior) based on their age and insurance plan.

View Answer

SELECT 
  patient_name, 
  age, 
  insurance_plan, 
  CASE WHEN age <= 18 THEN 'Child' 
    WHEN age BETWEEN 18 and 65 AND insurance_plan = 'Plan A' THEN 'Adult with Plan A'           
    WHEN age BETWEEN 18 and 65 AND insurance_plan = 'Plan B' THEN 'Adult with Plan B' 
    WHEN age > 65 THEN 'Senior' 
  END as Age_Category 
FROM patients

Most Commonly Asked Interview Question and Answer

Q: How do you use the SQL Case Statement in a real-world scenario?

A: I have used the SQL Case Statement in several real-world projects, including in the Healthcare Industry. One specific example is when I was working on a project to categorize patients based on their age and insurance plan. To accomplish this, I used a Nested Case Statement that evaluated the patient’s age and insurance plan and returned the appropriate patient category. The final result was a table that displayed the patient’s information, including their age category, which was used for further analysis and decision-making. The use of the Case Statement made the process of categorizing patients much simpler and more efficient.

Conclusion

The SQL Case Statement is a versatile and powerful tool for data analysis and decision-making in SQL. With the different types of Case Statements, including Simple, Searched, and Nested, you can implement complex conditional logic and return results based on multiple evaluations. By using examples from the Healthcare Industry, you can see the practical applications of the Case Statement and how it can be used to improve your data analysis and decision-making processes.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

February 6, 2023June 2, 2023

SQL Server Transactions in the Mortgage Industry

Posted on February 6, 2023 by Kevin Guisarde .

SQL Server transactions are a critical feature of SQL that ensures data consistency and integrity. Transactions are a way to group multiple database operations into a single, atomic unit of work. This means that either all the operations within the transaction are committed, or none of them are committed, ensuring that the database remains in a consistent state at all times. Transactions also provide rollback functionality, which allows all changes made during a transaction to be undone if an error occurs, maintaining data integrity. These features make transactions an essential tool in maintaining data accuracy and consistency, particularly in high-volume, high-transaction environments. In summary, SQL Server transactions provide developers and data analysts with a powerful tool to manage data consistency and integrity while ensuring the database remains in a consistent state, even in the case of errors or failures.

Agenda

Introduction to SQL Server Transactions
Types of Transactions in SQL Server
Real-world Examples in the Mortgage Industry
Commonly Asked Interview Question
Conclusion

Introduction to SQL Server Transactions

SQL Server transactions are used to ensure data consistency and integrity. They allow multiple operations to be performed as a single unit of work, either committing all changes or rolling back all changes if any errors occur. In other words, transactions in SQL Server guarantee that the database remains in a consistent state even in the case of failures or errors.

Types of Transactions in SQL Server

Different Concept Types

Implicit Transactions
Explicit Transactions
Savepoint Transactions
Distributed Transactions

Implicit Transactions

An implicit transaction is one that is automatically started by SQL Server, such as a single SELECT statement. These transactions are automatically committed when the transaction is complete, and they cannot be rolled back.

Example in the Mortgage Industry:

SELECT * FROM Mortgages WHERE LoanAmount > 100000

Explicit Transactions

An explicit transaction is one that is explicitly started by the user. The user must explicitly commit or roll back the transaction.

Example in the Mortgage Industry:

BEGIN TRANSACTION
  UPDATE Mortgages
  SET LoanAmount = LoanAmount + 5000
  WHERE LoanType = 'Fixed'
COMMIT TRANSACTION

Savepoint Transactions

A savepoint transaction allows you to save the state of the transaction and then later roll back to that state.

Example in the Mortgage Industry:

BEGIN TRANSACTION
  UPDATE Mortgages
  SET LoanAmount = LoanAmount + 5000
  WHERE LoanType = 'Fixed'
  SAVE TRANSACTION Savepoint1
  UPDATE Mortgages
  SET LoanAmount = LoanAmount + 10000
  WHERE LoanType = 'Variable'
  ROLLBACK TRANSACTION Savepoint1
COMMIT TRANSACTION

Distributed Transactions

A distributed transaction is a transaction that involves two or more databases.

Example in the Mortgage Industry:

BEGIN DISTRIBUTED TRANSACTION
  UPDATE Mortgages
  SET LoanAmount = LoanAmount + 5000
  WHERE LoanType = 'Fixed'
  UPDATE MortgageRates
  SET Rate = Rate + 0.05
  WHERE LoanType = 'Fixed'
COMMIT DISTRIBUTED TRANSACTION

Real-world Examples in the Mortgage Industry

Script to generate the necessary tables and records:

CREATE TABLE Mortgages (
  LoanID int PRIMARY KEY,
  LoanType varchar(20),
  LoanAmount money
)

INSERT INTO Mortgages (LoanID, LoanType, LoanAmount)
VALUES (1, 'Fixed', 100000),
       (2, 'Variable', 110000),
       (3, 'Fixed', 120000),
       (4, 'Variable', 130000)

CREATE TABLE MortgageRates (
  LoanType varchar(20) PRIMARY KEY,
  Rate decimal(5,2)
)

INSERT INTO MortgageRates (LoanType, Rate)
VALUES ('Fixed', 4.5),
       ('Variable', 5.0)

Advanced Practice Questions

1. What is the total sum of loan amounts for all mortgages with a loan type of ‘Fixed’ and a loan amount greater than $100,000?

View Answer

SELECT SUM(LoanAmount) 
FROM Mortgages 
WHERE LoanType = 'Fixed' and LoanAmount > 100000

2. What is the average loan amount for all mortgages with a loan type of ‘Variable’ and a loan amount less than $100,000?

View Answer

SELECT AVG(LoanAmount) 
FROM Mortgages 
WHERE LoanType = 'Variable' and LoanAmount < 100000

3. Retrieve the mortgage record for the mortgage with the highest loan amount for each loan type.

View Answer

WITH cte AS ( 
  SELECT LoanType, LoanAmount, 
    ROW_NUMBER() OVER (PARTITION BY LoanType ORDER BY LoanAmount DESC) AS RowNumber 
  FROM Mortgages 
) 
SELECT LoanType, LoanAmount 
FROM cte 
WHERE RowNumber = 1

Commonly Asked Interview Question

Q. What is a transaction in SQL Server and how do you use it?

A. A transaction in SQL Server is a sequence of database operations that are executed as a single unit of work. Transactions are used to ensure data consistency and integrity by either committing all changes or rolling back all changes if any errors occur.

For example, in a previous project, I had to transfer funds from one bank account to another. To ensure the accuracy of the data, I used a transaction. I started the transaction, updated the balance of the first account, then checked if the balance was sufficient. If the balance was sufficient, I updated the second account and committed the transaction. If not, I rolled back the transaction.

Conclusion

SQL Server transactions play a crucial role in maintaining the consistency and integrity of data in the mortgage industry. Understanding the different types of transactions and how to use them is essential for data professionals in the industry.

Interested in a career in Data Analytics? Book a call with our admissions team or visit training.colaberry.com to learn more.

Book a Call

March 15, 2021June 2, 2023

Serving Jupyter Notebooks to Thousands of Users

Posted on March 15, 2021 by Yash .

Serving Jupyter Notebooks to Thousands of Users

In our organization, Colaberry Inc, we provide professionals from various backgrounds and various levels of experience, with the platform and the opportunity to learn Data Analytics and Data Science. In order to teach Data Science, the Jupyter Notebook platform is one of the most important tools. A Jupyter Notebook is a document within an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.

In this blog, we will learn the basic architecture of JupyterHub, the multi-user jupyter notebook platform, its working mechanism, and finally how to set up jupyter notebooks to serve a large user base.

Why Jupyter Notebooks?

In our platform, refactored.ai we provide users an opportunity to learn Data Science and AI by providing courses and lessons on Data Science and machine learning algorithms, the basics of the python programming language, and topics such as data handling and data manipulation.

Our approach to teaching these topics is to provide an option to “Learn by doing”. In order to provide practical hands-on learning, the content is delivered using the Jupyter Notebooks technology.

Jupyter notebooks allow users to combine code, text, images, and videos in a single document. This also makes it easy for students to share their work with peers and instructors. Jupyter notebook also gives users access to computational environments and resources without burdening the users with installation and maintenance tasks.

Limitations

One of the limitations of the Jupyter Notebook server is that it is a single-user environment. When you are teaching a group of students learning data science, the basic Jupyter Notebook server falls short of serving all the users.

JupyterHub comes to our rescue when it comes to serving multiple users, with their own separate Jupyter Notebook servers seamlessly. This makes JupyterHub equivalent to a web application that could be integrated into any web-based platform, unlike the regular jupyter notebooks.

JupyterHub Architecture

The below diagram is a visual explanation of the various components of the JupyterHub platform. In the subsequent sections, we shall see what each component is and how the various components work together to serve multiple users with jupyter notebooks.

Components of JupyterHub

Notebooks

At the core of this platform are the Jupyter Notebooks. These are live documents that contain user code, write-up or documentation, and results of code execution in a single document. The contents of the notebook are rendered in the browser directly. They come with a file extension .ipynb. The figure below depicts how a jupyter notebook looks:

Notebook Server

As mentioned above, the notebook servers serve jupyter notebooks as .ipynb files. The browser loads the notebooks and then interacts with the notebook server via sockets. The code in the notebook is executed in the notebook server. These are single-user servers by design.

Hub

Hub is the architecture that supports serving jupyter notebooks to multiple users. In order to support multiple users, the Hub uses several components such as Authenticator, User Database, and Spawner.

Authenticator

This component is responsible for authenticating the user via one of the several authentication mechanisms. It supports OAuth, GitHub, and Google to name a few of the several available options. This component is responsible for providing an Auth Token after the user is successfully authenticated. This token is used to provide access for the corresponding user.

Refer to JupyterHub documentation for an exhaustive list of options. One of the notable options is using an identity aggregator platform such as Auth0 that supports several other options.

User Database

Internally, Jupyter Hub uses a user database to store the user information to spawn separate user pods for the logged-in user and then serve notebooks contained within the user pods for individual users.

Spawner

A spawner is a worker component that creates individual servers or user pods for each user allowed to access JupyterHub. This mechanism ensures multiple users are served simultaneously. It is to be noted that there is a predefined limitation on the number of the simultaneous first-time spawn of user pods, which is roughly about 80 simultaneous users. However, this does not impact the regular usage of the individual servers after initial user pod creation.

How It All Works Together

The mechanism used by JupyterHub to authenticate multiple users and provide them with their own Jupyter Notebook servers is described below.

The user requests access to the Jupyter notebook via the JupyterHub (JH) server.
The JupyterHub then authenticates the user using one of the configured authentication mechanisms such as OAuth. This returns an auth token to the user to access the user pod.
A separate Jupyter Notebook server is created and the user is provided access to it.
The requested notebook in that server is returned to the user in the browser.
The user then writes code (or documentation text) in the notebook.
The code is then executed in the notebook server and the response is returned to the user’s browser.

Deployment and Scalability

The JupyterHub servers could be deployed in two different approaches:
Deployed on the cloud platforms such as AWS or Google Cloud platform. This uses Docker and Kubernetes clusters in order to scale the servers to support thousands of users.
A lightweight deployment on a single virtual instance to support a small set of users.

Scalability

In order to support a few thousand users and more, we use the Kubernetes cluster deployment on the Google Cloud platform. Alternatively, this could also have been done on the Amazon AWS platform to support a similar number of users.

This uses a Hub instance and multiple user instances each of which is known as a pod. (Refer to the architecture diagram above). This deployment architecture scales well to support a few thousand users seamlessly.

To learn more about how to set up your own JupyterHub instance, refer to the Zero to JupyterHub documentation.

Conclusion

JupyterHub is a scalable architecture of Jupyter Notebook servers that supports thousands of users in a maintainable cluster environment on popular cloud platforms.

This architecture suits several use cases with thousands of users and a large number of simultaneous users, for example, an online Data Science learning platform such as refactored.ai

October 21, 2020February 13, 2024

Load Testing Jupyter Notebooks

Posted on October 21, 2020 by Yash .

Load Testing Jupyter Notebooks

Introduction

Consider this scenario: you set up a JupyterHub environment (to learn more, go to the JupyterHub section below) so that over 1000 participants of your online workshop can access Jupyter notebooks in JupyterHub. How do you ensure that the workshop runs smoothly? How do you ensure that the cloud servers you allocated for this event are sufficient? You might first reference similar threads to this one:

https://stackoverflow.com/questions/46569059/how-to-stress-load-test-jupyterhub-for-multiple-users

To learn the implementation- read on.

Performance Testing

Performance, scalability, and reliability of applications are key non-functional requirements of any product or service. This is especially true when a product is expected to be used by a large number of users.

This document overviews the Refactored platform and its JupyterHub environment, describes effective load tests for these types of systems, and addresses some of the main challenges the JupyterHub community faces when load testing in JupyterHub environments. The information presented in this document will benefit anyone interested in running load/stress tests in the JupyterHub environment.

Refactored

Refactored is an interactive, on-demand data training platform powered by AI. It provides a hands-on learning experience to accommodate various learning styles and levels of expertise. The platform consists of two core components:

Refactored website
JupyterHub environment.

JupyterHub

Jupyter is an open-source tool that provides an interface to create and share documents, including live code, the output of code execution, and visualizations. It also includes cells to create code or project documentation – all in a single document.

JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening the users with installation and maintenance tasks. Students, researchers, and data scientists can get their work done in their workspaces on shared resources, that can be managed efficiently by system administrators.

To learn how to create a JupyterHub setup using a Kubernetes cluster, go to https://zero-to-jupyterhub.readthedocs.io/en/latest/.

Load Testing Approach

Running load tests on JupyterHub requires a unique approach. This tool differs significantly in the way it works as compared to a regular web application. Further, a modern authentication mechanism severely limits the options available to run seamless end-to-end tests.

Load Testing Tool

We use k6.io to perform load testing on Refactored in the JupyterHub environment.
k6.io is a developer-centric, free, and open-source load testing tool built to ensure an effective and intuitive performance testing experience.

JupyterHub Testing

To start load testing the JupyterHub environment, we need to take care of the end-to-end flow. This includes the server configurations, login/authentication, serving notebooks, etc.

Since we use a cloud authentication provider on Refactored, we ran into issues testing end-to-end flow due to severe restrictions in load testing cloud provider components. Generally, load testing such platforms is the responsibility of the cloud application provider- they are typically well-tested for loads and scalability. Hence, we decided to temporarily remove the authentication from the cloud provider and use a dummy authentication for the JupyterHub environment.

To do that, we needed to change the configuration in k8s/config.yaml under the JupyterHub code.

Find the configuration entry that specifies authentication below:

auth:

type: custom

custom:

className: ***oauthenticator.***Auth0OAuthenticator

config:

NOTE:

client_id: “!*****************************”

client_secret: “********************************”

oauth_callback_url: “https://***.test.com/hub/oauth_callback”

admin:

users:

– admin

In our case, we use a custom authenticator. , so changing it to the following dummy authenticator:

auth:

type: dummy

dummy:

password: *******1234

admin:

users:

– admin

GCP Configuration

In the GCP console under the Kubernetes cluster, take a look at the user pool as shown below:

Edit the user pool to reflect the number of nodes required to support load tests.
There might be a calculation involved to figure out the number of nodes.
In our case, we wanted to load test 300 user pods in JupyterHub. We created about 20 nodes as below:

k6 Configuration

k6.io is a tool for running load tests. It uses JavaScript (ES6 JS) as the base language for creating the tests.

First, install k6 from the k6.io website. There are 2 versions – cloud & open-source versions. We use the open-source version downloaded into the testing environment.

To install it on Debian Linux-based unix systems:

sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv-keys 379CE192D401AB61
echo “deb https://dl.bintray.com/loadimpact/deb stable main” | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get install k6

Check the documentation at https://github.com/loadimpact/k6 for other types of OS.

Running Load Tests

Creating Test Scripts

In our case, we identified the key tests to be performed on the JupyterHub environment:
1. Login.
2. Heartbeat check.
3. Roundtrip for Jupyter notebooks.

To run tests on Refactored, we create the .js module that does the following via JS code.

1. Imports

import { check, group, sleep } from ‘k6’;
import http from ‘k6/http’;

2. Configuration options

We set up the configuration options ahead of the tests. These options include the duration of the tests, number of users, maximum users simulated, and other parameters such as shut-down grace time for the tests to complete.

Here is a sample set of options:
export let options = {
max_vus: 300,
vus: 100,
stages: [
{ duration: “30s”, target: 10 },
{ duration: “4m”, target: 100 },
{ duration: “30s”, target: 0 }
],
thresholds: {
“RTT”: [“avg r.status === 200 });
}

);

3. Actual tests

We created the actual tests as JS functions within group objects provided by the k6.io framework. We had various groups, including a login group, a heartbeat group, and other individual module check groups. Further groups can be chained within those groups.

Here is a sample set of groups to test our JupyterHub environment:

export default function() {

group(‘v1 Refactored load testing’, function() {

group(‘heart-beat’, function() {

let res = http.get(“https://refactored.ai”);

check(res, { “status is 200”: (r) => r.status === 200 });

});

group(‘course aws deep racer – Home ‘, function() {

let res = http.get(url_deepracer_home);

check(res, {

“status is 200”: (r) => r.status === 200,

“AWS Deepracer Home .. done”: (r) => r.body.includes(‘<h3>AWS DeepRacer</h3>’)

});

})

group(‘course aws deep racer Pre-Workshop- Create your AWS account ‘, function() {

let res = http.get(url_create_aws);

check(res, {

“status is 200”: (r) => r.status === 200,

“Create AWS account.. done”: (r) => r.body.includes(‘<h1 class=”main_heading”>Create your AWS account</h1>’)

});

group(‘course aws deep racer Pre-Workshop – Introduction to Autonomous Vehicle ‘, function() {

let res = http.get(url_intro_autonmous);

check(res, {

“status is 200”: (r) => r.status === 200,

“Introduction to Autonomous Vehicle.. done”: (r) => r.body.includes(‘<h1 class=”main_heading”>Introduction to Autonomous Vehicles</h1>’)

});

group(‘course aws deep racer Pre-Workshop – Introduction to Machine learning ‘, function() {

let res = http.get(url_intro_ml);

check(res, {

“status is 200”: (r) => r.status === 200,

“Introduction to Machine learning.. done”: (r) => r.body.includes(‘<h1 class=”main_heading”>Introduction to Machine learning</h1>’)

});

Load Test Results

The results of the load test are displayed while running the tests.

Start of the test.

The test is run by using the following command:

root@ip-172-31-0-241:REFACTORED-SITE-STAGE:# k6 run -u 300 -I 300 dsin100days_test.js

The parameters ‘u’ & ‘i’ provide the number of users to be simulated as well as the iterations to be performed respectively.

The first part of the test displays the configuration options, the test scenario, and the list of users created for the test.

Test execution.

Further results display the progress of the test. In this case, the login process, the creation of user pods, and the reading of the notebooks are displayed. Here is a snapshot of the output:

INFO[0027] loadtestuser25 Reading 1st notebook

INFO[0027] loadtestuser20 Reading 1st notebook

INFO[0027] loadtestuser220 Reading 1st notebook

INFO[0027] loadtestuser64 Reading 1st notebook

INFO[0027] loadtestuser98 Reading 1st notebook

INFO[0027] loadtestuser194 Reading 1st notebook

INFO[0028] loadtestuser273 Reading 1st notebook

INFO[0028] loadtestuser261 Reading 1st notebook

INFO[0028] loadtestuser218 Reading 1st notebook

INFO[0028] loadtestuser232 Reading 1st notebook

INFO[0028] loadtestuser52 Reading 1st notebook

INFO[0028] loadtestuser175 Reading 1st notebook

INFO[0028] loadtestuser281 Reading 1st notebook

INFO[0028] loadtestuser239 Reading 1st notebook

INFO[0028] loadtestuser112 Reading 1st notebook

INFO[0028] loadtestuser117 Reading 1st notebook

INFO[0028] loadtestuser159 Reading 1st notebook

INFO[0029] loadtestuser189 Reading 1st notebook

Final results

After the load test is completed, a summary of the test results is produced. This includes the time taken to complete the test and other statistics on the actual test.

Here is the final section of the results:

running (04m17.1s), 000/300 VUs, 300 complete and 0 interrupted iterations

default ✓ [======================================] 300 VUs 04m17.1s/10m0s 300/300 shared items

█ v1 Refactored Jupyter Hub load testing

█ login

✗ The login is successful..

↳ 89% — ✓ 267 / ✗ 33

█ Jupyter hub heart-beat

✗ Notebooks Availability…done

↳ 94% — ✓ 284 / ✗ 16

✓ heart-beat up..

█ get 01-Basic_data_types notebook

✓ Notebook loaded

✓ 01-Basic_data_types.. done

█ get 02-Lists_and_Nested_Lists notebook

✓ 02-Lists_and_Nested_Lists.. done

✓ Notebook loaded

█ get dealing-with-strings-and-dates notebook

✓ Notebook loaded

✓ dealing-with-strings-and-dates.. done

checks…………………: 97.97% ✓ 2367 ✗ 49

data_received…………..: 43 MB 166 kB/s

data_sent………………: 1.2 MB 4.8 kB/s

group_duration………….: avg=15.37s min=256.19ms med=491.52ms max=4m16s p(90)=38.11s p(95)=40.86s

http_req_blocked………..: avg=116.3ms min=2.54µs med=2.86µs max=1.79s p(90)=8.58µs p(95)=1.31s

http_req_connecting……..: avg=8.25ms min=0s med=0s max=98.98ms p(90)=0s p(95)=84.68ms

http_req_duration……….: avg=3.4s min=84.95ms med=453.65ms max=32.04s p(90)=13.88s p(95)=21.95s

http_req_receiving………: avg=42.37ms min=31.37µs med=135.32µs max=11.01s p(90)=84.24ms p(95)=84.96ms

http_req_sending………..: avg=66.1µs min=25.26µs med=50.03µs max=861.93µs p(90)=119.82µs p(95)=162.46µs

http_req_tls_handshaking…: avg=107.18ms min=0s med=0s max=1.68s p(90)=0s p(95)=1.23s

http_req_waiting………..: avg=3.35s min=84.83ms med=370.16ms max=32.04s p(90)=13.88s p(95)=21.95s

http_reqs………………: 3115 12.114161/s

iteration_duration………: avg=47.15s min=22.06s med=35.86s max=4m17s p(90)=55.03s p(95)=3m51s

iterations……………..: 300 1.166693/s

vus……………………: 1 min=1 max=300

vus_max………………..: 300 min=300 max=300

Interpreting Test Results:

In the above results, the key metrics are:

Http_reqs: gives the requests per second. In this case, it is 12.11 r/s. This is because it includes first-time requests. during the initial run, the codes are synced with GitHub and include idle time. This could also happen due to initial server spawns. In other cases, there could be as many as 70 requests per second.
Vus_max: maximum virtual users supported.
Iteration: 300 iterations. This could be n-fold as well.
Http_req_waiting: 3.35 s on average wait time during the round trip.

Running Individual Tests

The final step in this process is the testing of an individual user to see some key metrics around the usage of the JupyterHub environment.

The key metrics include:

Login-timed test
Notebook roundtrip
Notebook functions: start kernel, run all cells

This is performed by using a headless browser tool. In our case, we use PhantomJS as we are familiar with it. There are other tools to consider that will perform the same or even better.

Before we do the test, we must define the performance required from the page loads. The performance metrics include:

Load the notebook within 30 seconds.
Basic Python code execution must be completed within 30 seconds of the start of the execution.
In exceptional cases, due to the complexity of the code, it must not exceed 3 minutes of execution. This applies to the core data science code. There could be further exceptions to this rule depending on the notebook in advanced cases.

In the next technical paper, we will explore how to run individual tests.

– Written by Manikandan Rangaswamy

March 3, 2020February 13, 2024

5 Ways To Ace Your Business Intelligence Interview – Part 4

Image of people at office desk in discussion

Posted on March 3, 2020 by Yash .

5 Ways To Ace Your Business Intelligence Interview – Part 4

Human skills are some of the most important (and often forgotten) tools to help you succeed in your business intelligence interviews. In this mini-blog, we’re sharing 4 core human skills you can master to improve the odds of you landing your next data science role.

In the age of automation, business information and intelligence, if utilized strategically, has the power to propel a business far above its competitors as well as exponentially boost brand awareness and profitability. This makes business intelligence roles extremely valuable to corporations.

The BI and analytics industry is expected to soar to a value of $26.50 billion by the end of 2021. Moreover, companies that use BI analytics are five times more likely to make swifter, more informed decisions. In this second post of our 5 part mini blog series, we will show you how preparation will help you ace your next business intelligence interview.

According to Forbes, “dashboards, reporting, end-user self-service, advanced visualization, and data warehousing are the top five most important technologies and initiatives strategic to BI in 2018.” How do your skills line up with these initiatives? Are you ready to succeed in your next business intelligence interview?

5 Simple Ways To Ace Your Next Business Intelligence Interview

1) The First 5 Minutes Matter Most!

2) Come Prepared – Know Exactly What the Role Demands

3) Dress for Success

4) Master Core Human Skills

Career automation and AI are streamlining job roles and cutting out unnecessary positions so job hunters looking to ace their business intelligence interview need to adapt to the Future of Work and presentation skills that are both practical and necessary. The Future of Work isn’t just about knowing tech or being able to code; the most necessary skills are those that are often overlooked by applicants when it comes to interviewing time. These four skills are critically important and will help you stand out and ace your next business intelligence interview:

Grit
Emotional Intelligence
Critical/Creative Thinking
Problem-Solving

Human skills show that you can “go the long haul” and not give up when situations get challenging. These important skills also show that you are flexible and can think outside of the box. Moreover, presenting these skills by describing how you work on projects and displaying your problem-solving skills is extremely important. Colaberry’s programs help you develop all of the aforementioned skills.

Stay tuned for the next mini-blog in this series. We will be releasing a short blog twice a week to help you ace your next business intelligence interview!

Our Program

Colaberry has been providing one-of-a-kind, career-oriented training in data analytics and data science since 2012. We offer instructor-led onsite and online classes. Learn with us in person on our campus in Plano, Texas, or remotely from the comfort of your home. We have helped over 5,000 people to transform their lives with our immersive boot camp-style programs.

In-Demand Skills

Colaberry training programs equip you with in-demand tech and human skills. Our up-to-date lessons and carefully crafted curriculum set you up for success from day one. Throughout the training and the job search, our mentors will support and guide you as you transition into a fast-paced and exciting field of data analytics and data science.

Project-Based Learning

Our programs integrate projects that are based on real-world scenarios to help you master a new concept, tool, or skill. We work with you to build your portfolio to showcase your skills and achievements.

Award Winning Learning Platform

You will be learning using our homegrown technology platform Refactored AI which is recognized as the “Most Promising Work of the Future Solution” in global competition by MIT SOLVE. Our platform also received General Motors’ “Advanced Technology” prize and McGovern Foundation’s “Artificial Intelligence for Betterment of Humanity” prize.

Placement Assistance

Colaberry’s program, platform, and ecosystem empower you with skills that you need to succeed in your job interviews and transition into high-paying careers. Over 1/3rd of Colaberry graduates receive job offers after their first in-person interview. We provide you continuous mentoring and guidance until you land your job, and provide you post-placement support for twelve months so that you not only survive but thrive in your career.

Financial Aid

At Colaberry, we strive to create opportunities for all. We work with each individual to ensure the cost of the training does not hold them back from becoming future-ready. We offer various payment plans, payment options, and scholarships to work with the financial circumstances of our learners.

Military Scholarship

Colaberry is committed to supporting men and women who have served our country in uniform. As part of this commitment, we offer Military Scholarships to enable active-duty and retired military members to transition into civilian life. We have already helped numerous veterans by creating a pathway to rewarding and exciting careers in data science and data analytics. We hold alumni events and provide an extensive support system and a strong community of veterans to help our students succeed. Contact our enrollment team to find out more about how we can help you.

Apply Now to Start Your New Career!