Data Mastery with Python and SQL: Unleashing Efficiency and Security through 4 Strategic Use Cases | by Subha Ganapathi | May, 2023


Leverage techniques to extract hashtags, suppress warnings, and utilize parameterized queries for powerful data analysis

Photo by Kevin Canlas on Unsplash

Data analysis and management are essential components of any modern enterprise’s operations. To effectively harness the power of data, professionals rely on a combination of programming languages and tools that enable efficient data processing, manipulation, and analysis. In this article, we explore the incredible capabilities of Python and SQL, two fundamental languages widely utilized by data analysts and scientists for effective decision-making.

It is a known fact that Python provides a plethora of libraries and frameworks to tackle diverse data-related challenges. Together, Python and SQL form a powerful duo that enables data professionals to unlock the full potential of their data and mine their databases more effectively.

In this article, we look into four distinct use cases that demonstrate the effectiveness and synergy of Python and SQL in addressing various data challenges. Each use case represents a unique scenario where the combined prowess of Python’s flexibility and SQL’s querying capabilities shines through.

Let’s dive in!

Use Case 1: Enhance the readability of SQL Queries written in Python

Imagine you’re using a Web API to connect to a cloud database, such as GridDB, and retrieve data by executing SQL queries. When constructing an HTTP request body, specifically for an API endpoint that accepts JSON payloads, the SQL query needs to be incorporated in the request body. But the challenge lies with real-world queries.

Real-world SQL queries tend to become more intricate and can often become challenging to maintain proper indentation, incorporate line breaks, and format the code for readability. Additionally, when constrained to writing queries in a single line in a Python notebook such as VScode or Jupyter, it becomes impossible to add helpful comment lines to explain the code’s functionality. Notebooks with such queries become difficult to maintain and debug in the long run.

The below solution allows you to write SQL queries in multiple lines, which improves code readability and maintainability. By using proper line breaks and indentation, you can easily organize and understand complex queries without sacrificing clarity.

Here is a better way to write SQL queries in Python.

sql_query1 = (f"""
-- Query to classify sales data by region, category, and sales classification

SELECT region,
CASE WHEN lower(category) LIKE 'a%' THEN 'Category A'
WHEN lower(category) LIKE 'b%' THEN 'Category B'
WHEN lower(category) LIKE 'c%' THEN 'Category C'
ELSE 'Other Categories'
END AS category_classification,
CASE WHEN subquery.total_sales BETWEEN 1 AND 1000 THEN 'Low Sales'
WHEN subquery.total_sales BETWEEN 1001 AND 5000 THEN 'Medium Sales'
WHEN subquery.total_sales > 5000 THEN 'High Sales'
END AS sales_classification
FROM Sales_Dataset
JOIN (
SELECT region, SUM(sales) AS total_sales
FROM Sales_Dataset
GROUP BY region
) AS subquery
ON Sales_Dataset.region = subquery.region
GROUP BY 1, 2
ORDER BY 3 DESC
""")

This can then passed to the request body using the JSON library in Python as shown below –

import json
# Construct the request body
request_body = json.dumps([
{
"type": "sql-select",
"stmt": sql_query1
}
])

# Validate the constructed request body
print(request_body)

It creates a JSON object with a single item, which contains the type of operation (“sql-select”) and the SQL query statement (sql_query1). The json.dumps() function is used to convert the Python dictionary into a JSON string representation.

You can then proceed to posting your request using the requests library.

data_req1 = requests.post(url, data=request_body, headers=header_obj)

Use Case 2: Extract hashtags from a text column

When conducting social media analysis of platforms like TikTok and Instagram, it is common to extract data through APIs and store it in databases like Azure or Redshift. However, the API response often includes the content as a string, with hashtags scattered throughout the video titles. To address this, the following query can be used to extract hashtags from a a text column such as a video title.

select * from 
(SELECT distinct TRIM(SPLIT_PART(title, '#', num)) AS hashtag
FROM social_media_video_info
CROSS JOIN (
SELECT 1 AS num UNION ALL
SELECT 2 AS num UNION ALL
SELECT 3 AS num UNION ALL
SELECT 4 AS num UNION ALL
SELECT 5 AS num UNION ALL
SELECT 6 AS num UNION ALL
SELECT 7 AS num UNION ALL
SELECT 8 AS num UNION ALL
SELECT 9 AS num UNION ALL
SELECT 10 AS num
) AS nums
WHERE num <= LENGTH(title) - LENGTH(REPLACE(title, '#', '')) + 1
AND TRIM(SPLIT_PART(title, '#', num)) <> ''
)
where hashtag not like '% %'

The subquery performs the following steps:

  • It splits the “title” column of the “social_media_video_info” table using the ‘#’ character as the delimiter.
  • The function SPLIT_PART(title, ‘#’, num) extracts the portion of the “title” column delimited by ‘#’ at the specified “num” position.
  • The TRIM() function removes any leading or trailing spaces from the extracted portion.
  • The DISTINCT keyword ensures that only unique hashtags are selected.
  • The CROSS JOIN with the subquery “nums” generates a temporary result set with numbers from 1 to 10.
  • The condition num <= LENGTH(title) — LENGTH(REPLACE(title, ‘#’, ‘’)) + 1 ensures that the splitting is done up to the maximum number of hashtags in the “title” column.
  • The condition TRIM(SPLIT_PART(title, ‘#’, num)) <> ‘’ filters out any empty hashtags.
  • In summary, the query extracts hashtags from the “title” column of the “social_media_video_info” table by splitting the column values using ‘#’ as the delimiter. It ensures that only unique non-empty hashtags are selected, and any hashtags containing spaces are excluded from the result. Note that this query only considers upto 10 hashtags per title.

Use Case 3: Suppressing Future and Deprecation Warnings in Python

The purpose of these lines is to suppress future warnings and deprecation warnings during the execution of the program.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

This can be useful in situations where you are confident that the code you are using is compatible with the current version of the library and you don’t want to be alerted about potential issues or deprecated features.

Warnings often provide valuable information about potential problems or changes in future versions of libraries. It’s generally recommended to address and resolve the underlying issues causing the warnings rather than ignoring them outright. The ‘warnings’ module also provides a simplefilter() option which can be used to control how warnings are handled in your Python code.

Use Case 4: Use parameterized queries where possible

When executing SQL queries using Python, it is recommended to use parameterized queries or prepared statements instead of directly embedding values in the SQL statement. This helps prevent SQL injection attacks and ensures proper handling of data types.

Let’s say that your application or script uses the below select query –

SELECT * FROM 
TABLE
WHERE
COLUMN1 IN ('abcd')

If an attacker wants to insert a malicious value in your database, they can use this query as a loophole to do so. Below is a basic example of how the attacker can add an insert statement to a ‘select query’ and inject unnecessary values to the database. The grey portion below is the malicious input that the attacker provided which results in 2 queries being executed instead of one — 1. select and 2. insert.

SQL Injection Attack — Scenario Highlighted (Image by author)

Note that this is not limited to an Insert table. An attacker can run another select or update or delete or even do a drop table. Imagine how disastrous a ‘DROP TABLE’ would be!

SQL injection can be prevented by sanitizing the input as well as by parameterization. Let’s see each of them in detail.

Parameterization

Parameterization can be done manually or through packages available in Python. Imagine that you use a survey application from which you send surveys for users to fill out. Users are asked to type-in details in two text boxes provided. Let’s suppose that every time the user inputs the detail in both text boxes, an Insert SQL query runs in the background such as the one below –

theVal1 = comes from survey textbox1
theVal2 = comes from survey textbox2

Below is the application code that runs in the background –

  sql = "INSERT INTO TABLE VALUES ('" + theVal1 + "','" + theVal2 + "')"

Let’s say the 1st user enters A3 in textbox1 and enters A4 in the textbox2. The query that runs in the backend will be as below –

INSERT INTO TABLE VALUES ('A3','A4')

Let’s say the second user is a hacky user. This user understands your table structure and backend query. If they are onto some mischief, they can try to maliciously insert an additional record using this.

Let’s suppose that the user types in the value A1 in textbox 1 and the below value in textbox2

A2 ');INSERT INTO TABLE VALUES ('B1','B2

What essentially happens is that the value gets appended to the backend query to become the below –

INSERT INTO TABLE VALUES ('A1','A2');INSERT INTO TABLE VALUES ('B1','B2')

Thus, 2 records will be inserted by this hacky user.

Your table will have three values, 1 inserted by the first user, and 2 inserted by the second user –

Backend Table (Image by author)

Input Sanitization

Sanitization can be done by escaping special characters in the input. This involves replacing or encoding characters that have a special meaning in the target context (e.g., SQL queries) with their escaped counterparts. For example, in SQL, the single quote character (‘) is commonly escaped by doubling it (‘’) i.e., replacing the single quote in the string with 2 single quotes. In other words, we can manually escape the input values before putting it into the query. We can use str.replace to do this.

While the application code remains the same, we add a couple of string replace statements as shown below –

theVal1 = comes from survey textbox1
theVal2 = comes from survey textbox2

escapedVal1 = theVal1.replace("'", "''")
escapedVal2 = theVal2.replace("'", "''")
sql = "INSERT INTO TABLE VALUES ('" + escapedVal1 + "','" + escapedVal2 + "')"

What happens as a result is that, when the hacky user tries to insert a malicious record, it will get inserted along with the user’s insert statement. It’ll look like the one below –

INSERT INTO TABLE VALUES ('A1','A2'');INSERT INTO TABLE VALUES (''B1'',''B2')

The values inserted in your table would be as below –

Backend Tables after escaping quotes (Image by author)

Thereby, you’ll see in your backend table that one of the users tried to run an insert statement. You have effectively stopped a SQL injection from happening by merely escaping your input variables.

A more optimal way would be to use python libraries such as psycopg2, pyodbc, sqlite3 or SQLAlchemy. These are SQL adapters that have inbuilt support for parameterized queries among other features.

In this article, we explored four practical use cases to enhance your Python programming skills when working with SQL queries. We started with Use Case 1, where we focused on improving the readability of SQL queries written in Python. By utilizing techniques such as query formatting and indentation, we can make our code more organized and easier to understand.

Moving on to Use Case 2, we delved into the extraction of hashtags from a text column. By leveraging SQL and its string manipulation functions, we learnt how to effectively extract relevant hashtags and enhance the data analysis process.

In Use Case 3, we addressed the importance of suppressing future and deprecation warnings in Python. By utilizing the `warnings` module, we can ensure a cleaner Python output and an error-free code execution, avoiding unnecessary distractions and potential compatibility issues.

Lastly, in Use Case 4, we emphasized the significance of using sanitized code and parameterized queries to enhance security, improve performance, and prevent SQL injection attacks.

By understanding and implementing these use cases, Python developers and Data Analysts can elevate their SQL query execution and optimization skills, leading to more robust and efficient code. I believe that incorporating these techniques in real-world scenarios will help create cleaner workflows and will empower you to derive actionable insights efficiently and securely.



Source link

Leave a Comment