Core Python for Data Analysis and Scientific Computing

Posted by Anonymous and classified in Computers

Written on in English with a size of 11.79 KB

Key Concepts in Data Science & Scientific Computing

Visualization Techniques

  • Overlapping Histograms: Use semi-transparent alpha parameter for comparison.

Data Structures & Algorithms

  • BFS Implementation: collections.deque is ideal for Breadth-First Search.
  • Grid Representation: Obstacles often represented by a value like 1.

Jupyter Notebook & Markdown

  • Markdown Headings: Use # prefix for headings in Jupyter Markdown cells.

Optimization & Least Squares

  • Normal Equations: Direct matrix inversion for Least Squares: β = (XᵀX)⁻¹Xᵀy.

Numerical Integration & Simulation

  • Orbit Simulation: Runge–Kutta 4th order method is a common integration technique.

Search Algorithms

  • Brute-Force Search: Often implemented using nested loops.

Python Ecosystem Fundamentals

Python Language Features

  • Dynamic Typing: Variables have no fixed type; type is determined at runtime.
  • Indentation: 4 spaces define code blocks (no braces).
  • Chained Comparisons: 0 < x < 1 is equivalent to 0 < x and x < 1.
  • Comprehension: Builds a new container in one line (e.g., list, dict, set).
  • Exception Handling: try/except is preferred over pre-checks (EAFP - Easier to Ask for Forgiveness than Permission).

Core Python Data Structures

  • List: Ordered, mutable, allows duplicate elements.
  • Tuple: Ordered, immutable, generally faster than lists.
  • Dictionary (Dict): Key-value map; average lookup time is O(1).
  • Set: Unordered collection of unique elements; fast membership tests.

NumPy Concepts

  • ndarray: N-dimensional homogeneous array; fixed-size, contiguous in memory.
  • Vectorization: Apply operations element-wise without explicit Python loops for performance.
  • Broadcasting: Aligning shapes by "stretching" singleton dimensions for element-wise operations.
  • uFuncs (Universal Functions): Fast C-implemented functions (e.g., np.sin, np.exp).

Pandas Principles

  • Series vs. DataFrame: 1-D labeled array vs. 2-D tabular data structure.
  • Indexing: .loc for label-based indexing, .iloc for integer-based indexing.
  • Missing Data: Represented by NaN (Not a Number) for floats; use .dropna() or .fillna().
  • GroupBy: Split-apply-combine pattern for aggregations.

Visualization Terms (Matplotlib)

  • Figure vs. Axes: Figure is the overall canvas; Axes is an individual plot area.
  • Legend: ax.legend() displays only labeled artists on the plot.
  • Alpha: Opacity of plot elements (0 for transparent, 1 for opaque).
  • Histogram Bins: Number of bins controls resolution versus noise in a histogram.
  • KDE (Kernel Density Estimate): Smooth estimate of the underlying Probability Density Function (PDF); bandwidth controls smoothness.

Discrete Dynamics

  • State Vector: Encapsulates all variables needed to advance a system's state.
  • Time Stepping (Explicit Euler): xk+1 = xk + f(xk)·dt (1st order approximation).
  • Equilibrium: A state x* such that x* = f(x*).
  • Stability: Local stability occurs when |f′(x*)| < 1.

Optimization & Least Squares

  • Objective Function: The function to minimize (cost) or maximize (gain).
  • Decision Variables: Parameters that are adjusted during optimization.
  • Constraints: Equalities or inequalities that restrict decision variables.
  • Normal Equations: Closed-form solution for Least Squares: β = (XᵀX)⁻¹Xᵀy.
  • SciPy Optimization: Use scipy.optimize.minimize(fun, x0, bounds, constraints).

Search Algorithms (BFS)

  • FIFO Queue: Ensures the first-explored layer is the shallowest, leading to the shortest path.
  • Visited Set/Map: Avoids revisiting nodes and records parents for backtracking.
  • Grid Encoding: Common representation: 0 for free space, 1 for blocked/obstacle. Neighbors are typically up/down/left/right.

Jupyter Notebook & Python Style

  • Shift+Enter: Runs the current cell and moves to the next.
  • Markdown Cells: Cell type for text; prefix headings with #.
  • PEP8: Python style guide recommendations: 4-space indent, snake_case for variables/functions, max ~79 characters per line, grouped imports.

Python & Plotting Utilities

  • *=: Works with strings for repetition (e.g., "abc" * 3).
  • list(filter()): Removes elements that satisfy the condition inside the filter function.
  • ax.fill_between(x, y1, y2): Fills the area between two y-coordinates.
  • max(d, key=d.get): Returns the key with the maximum value in a dictionary d.

Practical Code Snippets & Examples

Python Language Features

  • List Comprehension: Generate lists concisely.
    e = [i for i in range(10) if i % 2 == 0] # [0, 2, 4, 6, 8]
  • Sorting Lists: Sort in ascending or descending order.
    m = sorted(my_list, reverse=True) # Sorts from largest to smallest
  • F-strings (Formatted String Literals): Embed expressions inside string literals.
    print(f'Distance {v:.2f} m') # Formats 'v' to two decimal places
  • Finding Max Value and Index:
    peak_power = round(max(elec_power), 2)
    i = elec_power.index(peak_power)
  • New Line Character: \n inserts a new line.
  • Dictionary Definition:
    s = {'a': 1, 'b': 2}
  • Dictionary Comprehension:
    squares = {k: k*k for k in range(1, 6)} # {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
  • Looping Through Strings:
    for letter in input_str: # Iterates over each character
  • Boolean Checks:
    • any(x > 5 for x in data): Returns True if any element in data is greater than 5.
    • all(x > 0 for x in data): Returns True if all elements in data are greater than 0.

Math Module Usage

  • Importing Pi: from math import pi
  • Trigonometric Functions: math.cos(math.radians(angles[i]))
  • Square Root: math.sqrt(d)

NumPy Array Operations

  • Array Creation: a = np.array([1, 2, 3])
  • Array Properties:
    • arr.shape: Dimensions of the array (rows, columns).
    • arr.size: Total number of elements.
    • arr.dtype: Data type of array elements.
  • Statistical Functions:
    • np.mean(data[:, 2]): Mean of the third column.
    • np.max(arr): Maximum value in the array.
    • np.min(arr): Minimum value in the array.
    • np.mean(arr) / np.median(arr): Mean or median of the array.
    • np.sum(arr): Sum of array elements.
  • Special Arrays: np.eye(N): Creates an N x N identity matrix.
  • Sorting & Indexing:
    • np.sort(arr): Returns a sorted copy of the array.
    • np.argmax(arr): Returns the indices of the maximum values along an axis.

Pandas DataFrame Operations

  • DataFrame Creation: df = pd.DataFrame(...).
  • Loading & Saving Data:
    • df = pd.read_csv('data.csv'): Loads data from a CSV file.
    • df.to_csv('output.csv', index=False): Saves DataFrame to CSV without index.
  • DataFrame Properties:
    • df.shape: Returns a tuple representing the dimensions (rows, columns).
    • df.describe(): Generates descriptive statistics of the DataFrame.
    • df.count(): Counts non-null observations per column.
    • df.sum() / df.mean(): Sum or mean of DataFrame elements/columns.
  • Data Selection & Filtering:
    • ages = df['age']: Selects a single column (Series).
    • row5 = df.iloc[4]: Selects the 5th row by integer position.
    • adults = df[df.age >= 18]: Filters rows where age is 18 or greater.
  • Grouping Data: grouped = df.groupby('category').mean(): Groups by 'category' and computes the mean.

Matplotlib Plotting Commands

  • Basic Plotting:
    ax.plot(x, y, color='blue', alpha=0.6, linestyle='-', marker='o', markersize=5, label='My Data')
    Note: title is usually set via ax.set_title(), not directly in ax.plot().
  • Setting Labels & Legend:
    • ax.set_xlabel('x-axis label'): Sets the label for the x-axis.
    • ax.legend(): Displays the legend for labeled plot elements.
  • Displaying & Saving Plots:
    • plt.show(): Displays all open figures.
    • plt.savefig('plot.png'): Saves the current figure to a file.
  • Common Plot Types:
    • ax.plot(...): Line plot.
    • ax.scatter(...): Scatter plot (individual points).
    • ax.bar(...): Vertical bar chart.
    • ax.barh(...): Horizontal bar chart.
    • ax.hist(...): Histogram.
    • ax.pie(...): Pie chart.
    • ax.errorbar(...): Plot data with error bars.

Numerical Simulation Example (Projectile Motion)

import numpy as np

# Physical parameters
g = 9.81  # Gravity (m/s^2)
dt = 0.01 # Time step (s)
n_steps = 1000 # Number of simulation steps

# Initial conditions
y0 = 10.0 # Initial height (m)
v0 = 0.0  # Initial velocity (m/s)

def dynamics(xk):
    """
    Compute the next state [y, v] from the current state xk = [y, v].
    Applies explicit Euler integration.
    """
    yk, vk = xk
    vn = vk - g * dt
    yn = yk + vk * dt
    return [yn, vn]

# Pre-allocate state array
x = np.zeros((n_steps, 2))
x[0] = [y0, v0]

# Simulation loop
for k in range(n_steps - 1):
    x[k + 1] = dynamics(x[k])

Related entries: