Coding corner

Decorator

  • A decorator is a design pattern in Python that allows us to add new functionality to an existing object without modifying its content. Applying a decorator to a function can be seen as making a wrapper for this function to do some extra tasks without changing the function itself.
  • Make your code much cleaner when you need to compute time execution of a function, do some logging or caching.
1. Let's consider the first example where we create a decorator that upper case any string returned by some function.

def uppercase_decorator(function):
    return function().upper
@uppercase_decorator
def say_hello():
    return 'Hello there'

say_hello()

Output---> 'HELLO THERE'

2. Next, let's do some more realistic example to determine the execution time of any function.

! pip install sympy
import time
import math
from sympy.ntheory import factorint

def execution_time_decorator(function):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = function(*args, **kwargs)
        end = time.time()
        execution_time = end - start
        print(f"Function '{function.__name__}' executed in {execution_time} seconds")
        return result
    return wrapper

@execution_time_decorator
def factorization(n):
    return factorint(n)

factorization(25)

Output: ---> Function 'factorization' executed in 0.0 seconds
{2: 3, 5: 3, 11: 1}

* With execution_time_decorator, you can compute time execution of any function by simply "decorate" that function by @execution_time_decorator. Pretty cool, right :)

3. Let's finish with the logging example. 
logging.basicConfig(level = logging.DEBUG) means that any less severe exception than DEBUG will be ignored.

import logging

logging.basicConfig(level = logging.DEBUG)
logger = logging.getLogger()

def logging_decorator(function):
    def wrapper(*args, **kwargs):
        try:
            result = function(*args, **kwargs)
            return result
        except Exception as e:
            logger.exception(f"Exception from {function.__name__} : {str(e)}")
            raise e
    return wrapper

@logging_decorator
def divide_by_zero():
    return 1. / 0.

Run the function to see how your customized logging decorator captures the exception !!!

None and NaN in Python and PySpark

1. Python

  • None is a very special constant in Python that represents null variable or nothing:
    • You can assign None to variables (x = None to say that this variable has not taken a value yet). You can also pass it as a default argument in a function (function(x=None)) and make a function return None to indicate that the function return nothing.
    • It’s an instance of its own type (NoneType). It’s not numeric (not int or float).
  • NaN (Not a Number), on the other hand, is a special floating-point value, hence numeric. It represents the result of an invalid or undefined mathematical operation such as dividing by zero.

In Pandas world, None is automatically turned into NaN. They are all interpreted as missing values in a Pandas dataframe. Let’s consider the following example:

import pandas as pd
import numpy as np 
data = [('A', np.nan), ('B', None), (None, 20.0)]
df_pandas = pd.DataFrame(pd.DataFrame(data, columns=['name', 'age']))
print(df_pandas)
   name   age
0     A   NaN
1     B   NaN
2  None  20.0

df_pandas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    2 non-null      object 
 1   age     1 non-null      float64
dtypes: float64(1), object(1)
memory usage: 176.0+ bytes

We can see that the column name in Pandas has Object Type and the column age has Float64 Type. If you want to specifically choose a type for each column, for example String Type for name and Float32 Type for age, use the type cast as follows:

In [ ]:

df_pandas = df_pandas.astype({'name': pd.StringDtype(), 'age': pd.Float32Dtype()})

Then we obtain a dataframe with your chosen column types.

In [ ]:

df_pandas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    2 non-null      string 
 1   age     1 non-null      Float32
dtypes: Float32(1), string(1)
memory usage: 167.0 bytes

2. None and NaN in Spark dataframe

In [ ]:

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('spark_example').getOrCreate()

Let’s take the same data structure as above but this time we create a Spark dataframe with that.

In [ ]:

# data = [('A', np.nan), ('B', None), (None, 20.0)]
schema = StructType().add('name', StringType()).add('age', FloatType())
df_spark =spark.createDataFrame(data=data, schema=schema)
df_spark.show()
+----+----+
|name| age|
+----+----+
|   A| NaN|
|   B|null|
|null|20.0|
+----+----+

In [ ]:

df_spark.dtypes

Out[ ]:

[('name', 'string'), ('age', 'float')]
  • Firstly, as in Python case, None represents the null value (nothing) in a Spark dataframe.
  • We observe that, unlike Pandas, Spark data frame does distinct between NaN and null (None) values. Hence, if you want to count all missing values in this Spark dataframe, you have to make use of two distinct functions isnan() and isNull().

In [ ]:

df_spark.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_spark.columns]).show()
+----+---+
|name|age|
+----+---+
|   1|  2|
+----+---+

Let’s convert the above Spark dataframe into Pandas and see what’ll happen:

In [ ]:

df_converted = df_spark.toPandas()
df_converted

Out[ ]:

nameage
0ANaN
1BNaN
2None20.0

In [ ]:

df_converted.dtypes

Out[ ]:

name     object
age     float32
dtype: object

I guess that you guys detected something strange here already:

  • In the column name, the null value becomes None, just like in the list data. However, in column age, everything seems alright, NaN and null values turn into NaN (as expected).
  • This is an issue that we should be aware and handle properly to avoid potential bugs when processing data (for training a Machine Learning model for instance).

In [ ]:

df_converted.fillna(value=np.nan, inplace=True)
df_converted

Out[ ]:

nameage
0ANaN
1BNaN
2NaN20.0

Besides, we observe that the column name with original String Type turned into Object Type in Pandas dataframe. Well, it is not too annoying for data processing (nowadays almost Machine Learning models handle Object Type column gracefully). However, for certain reasons, if you need to convert this column in a usual type like String, the same casting code will work like a charm 🙂

In [ ]:

df_converted['name'] = df_converted['name'].astype(pd.StringDtype())

In [ ]:

df_converted.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    2 non-null      string 
 1   age     1 non-null      float32
dtypes: float32(1), string(1)
memory usage: 164.0 bytes

Note

  • Practically, when working in a Big Data Framework, a good practice is to avoid at all cost converting a Spark dataframe in Pandas since this operation can be enomoursly time-consuming.
  • However, In case toPandas() is the only option, you can optimize this operation with PyArrow. Apache Arrow is an in-memory columnar data format used in Apache Spark to efficiently transfer data between JVM and Python processes. Just simply add the following line in your code, it might ease your pain 😀 spark.conf.set(“spark.sql.execution.arrow.pyspark.enabled”, “true”)

Generator in Python

Imagine that you are building a Neural Machine Translation which can translate English into Vietnamese efficiently. \ With this kind of ML/AI problem, at some point, you deal with processing the huge amount of sequences of text sequentially \ (applying some special character EOS (End of Sentence) to each sentence, filter their length, etc). \ The keyword here is that you want to do (generate) things like that one at a time on-demand (sequentially) instead of calculating and storing many things at the same time. \ This approach can be achieved via a nice design pattern in Python called generator.

Put simply, generator is a special type of iterator that produces a sequence of values on-the-fly.

  • You can only iterate over generators once. They do not store all values in memory which makes them memory-efficient. \
  • Generate a sequence of values using the yield statement.
  • This makes generators useful for working with large or infinite sequences where it’s not practical to pre-generate all the values.
  • Let’s take a classic programming example where you want to generate the first n� values of the Fibonacci sequence.
    • $u0=u1=1$
    • $u_n=u_{n-1} + u_{n-2}$
  1. Define the generator with yield statement that determines the Fibonacci sequence up to an index n
def fibonacci_generator(n):
    a = b = 1
    for i in range(n):
        yield a
        a, b = b, a+b

n = 5
gen = fibonacci_generator(5) 

The generator object is an iterator and can be used in a for loop or with the next() function to retrieve the values it generates.
When the generator is iterated over, it executes the code inside the generator function until it encounters a yield statement.
At this point, the current value is yielded, and the generator's state is frozen.
The next time the generator is iterated over, execution resumes from where it left off until the next yield statement is encountered.

for i in range(5):
    print(next(gen))

You see that your generator gen generated $5$ first Fibonacci values $\{1, 1, 2, 3, 5\}$.
* Remark that the first call $next(gen)$ is to start the generator. In this case, it will also return the first value of Fibonacci sequence ($1$). 
* If you keep generating values after $n=5$, an error occurs since your generator can not generate more than $n$ Fibonacci values. 

The method **send()**
* the send() method is used to send a value back into the generator and resume its execution.
* Let's consider the following example

def square_generator():
    while True:
        value = yield
        yield value **2

gen = square_generator() # Create the generator object
next(gen)  # Start the generator
gen.send(5)  # Sends the value 5 to the generator and square it

next(gen)
gen.send(10)


### Iterate on string
* A string is iterable. If you want to iterate over a string with **next()**, use **iter()** method as follows:

a_string = 'python_tips'
gen = iter(a_string)

for i in range(len(a_string)):
    print(next(gen))

BLEU (Bilingual Evaluation Understudy) score
* To fix the idea, let's consider the Machine Translation context where we build a machine to translate one language into another (French into English for instance). BLEU is one of the most popular metrics to evalute the quality of your machine translation.
* Consider the two following sentences: the predicted sentence (P) and the reference sentence (R). **(P)** is the output of your machine and **(R)** is a translation from a human (a professional language expert for instance).
  * (P): my baby is doing just fine
  * (T): my baby is struggling a bit

unigram = pred_sentence.split()
print(unigram)
bigram = [(unigram[i], unigram[i+1] ) for i in range(len(unigram) - 1)]
print(bigram)
trigram = [(unigram[i], unigram[i+1], unigram[i+2] ) for i in range(len(unigram) - 2)]
print(trigram)

#### n-grams: $n$ consecutive words and symbols
* 1-gram (unigram): ['my', 'baby', 'is', 'doing', 'just', 'fine']
* 2-gram (bigram): ('my', 'baby'), ('baby', 'is'), ('is', 'doing'), ('doing', 'just'), ('just', 'fine')
* 3-gram (trigram): ('my', 'baby', 'is'), ('baby', 'is', 'doing'), ('is', 'doing', 'just'), ('doing', 'just', 'fine')

#### Precision
* $\text{the nth-precision P(n)} = \dfrac{\text{number of n-grams in (P) that appears in (T)}}{\text{number of n-grams in (P)}}$ 
* In the above example
  * unigrams in (P) that appears in (T) are ['my', 'baby', 'is'], hence the numerator of P(1) is 3.
  * the number of unigrams in (P) is the number of words in the sentence (P), hence 6.
  * Then, the precision P(1) = 3/6 = 0.5
* Cons: Precision metric can not deal with word repetition. If we replace the (P) by the following sentence "the the the", then we have a perfect precision@1 = 1.