Social Network Analysis | Chapter 2 | Network Measures

Name: Social Network Analysis | Chapter 2 | Network Measures
Uploaded: 2022-01-26T06:03:59.000Z
Duration: 3 h 54 min 40 s
Description: This is supplementary material for the book Social Network Analysis by Dr. Tanmoy Chakraborty. Book Website: https://social-network-analysis.in/ Available for purchase at: https://www.amazon.in/Social-Network-Analysis-Tanmoy-Chakraborty/dp/9354247830

Understanding Network Analysis Metrics

Introduction to Network Measures

The lecture introduces the second chapter of social network analysis, focusing on various measures used to quantify networks.

It will cover properties of nodes, edges, and the network as a whole, building on previous discussions about microscopic, mesoscopic, and macroscopic levels.

Viral Marketing Example

The speaker references popular songs like "Gangnam Style" and "Collaborative," highlighting their rapid rise in popularity.

Emphasizes that success is not solely due to song quality but also effective publicity through social networks—termed viral marketing.

Cites research indicating that influential users' endorsements can significantly boost visibility for content.

Importance of Prominent Users

Discusses the necessity of convincing prominent social media users to share content for increased reach.

Introduces the idea that identifying key metrics can help locate these influential users within a network.

Key Questions Addressed

Lists critical questions for this chapter regarding post visibility dynamics and strategies for publicizing content effectively on social media.

Overview of Network Metrics

Outlines different views (microscopic, mesoscopic, macroscopic), each with specific measures:

Microscopic: Degree, local clustering coefficient, node centrality.

Mesoscopic: Connected components, giant components, group centralities.

Macroscopic: Degree distribution, path lengths, edge density.

Exploring Node Degree

Definition and Calculation of Degree

Defines degree as the number of edges connected to a node; provides examples from an undirected network illustrating degrees of specific nodes.

Special Cases in Degree Calculation

Explains how self-loops contribute twice to a node's degree count; gives an example with node four having five connections due to one self-loop.

Weighted Networks and Directed Graphs

In weighted networks, degree is calculated as the sum of weights attached to edges. For instance, node one's degree totals six plus two plus three.

In-Degree vs. Out-Degree

Introduces concepts of in-degree (edges directed towards a node) and out-degree (edges directed away from a node), providing examples for clarity.

Sum of Degrees Relation to Edges

Understanding Degree Distribution in Graphs

Introduction to Degree Distribution

The age of a graph is responsible for the increment of edges, leading to twice the number of edges present.

Degree distribution is defined as the probability distribution of degrees of nodes within a network, where degree is treated as a random variable.

Components of Degree Distribution

Denote k as the degree; n_k represents the number of nodes with degree k . For example, n_0 , n_1 , etc., indicate counts for respective degrees.

The total number of nodes in the graph is represented by capital N . Probability for each degree can be calculated as p_k = n_k/N .

Calculating Mean Degree

The mean degree can be measured using the formula: Mean Degree = Σ(k * p_k), which summarizes all probabilities weighted by their respective degrees.

Example Calculation

In an example graph with five nodes:

No nodes have degree 1 ( n_1 = 0 ).

Two nodes have degree 2 ( n_2 = 2 ).

Two nodes have degree 3 ( n_3 = 2 ).

One node has degree 4 ( n_4 = 1 ).

Cumulative and Complementary Distributions

Cumulative degree distribution (CDD) measures cumulative probabilities up to a certain degree. It’s calculated based on fractions of nodes with degrees less than or equal to k .

Complementary cumulative distribution (CCDD), defined as C(k) = 1 - C(k) , indicates the fraction of nodes with degrees greater than k .

Importance and Characteristics of Power Law

Real-world graphs often follow power law distributions, indicating that one quantity varies as a power function relative to another.

Understanding Power Laws and Their Applications in Networks

Scaling Functions and Power Laws

The independent variable x can be scaled up or down, affecting the function's representation without changing its form. For example, scaling y = x^2 to y = (2x)^2 results in y = 4x^2 , maintaining the functional form.

When scaling the argument of a power law function, such as y = x^-2 , the relationship remains a power law despite being inversely proportional; as x increases, y decreases.

A general power law is represented by the equation f(x) = a cdot x^-k , where k is a coefficient and a is a scalar value. This indicates that varying the argument leads to predictable changes in function values.

If you replace x with cx , it results in scaling by a factor of c^-kf(x). This demonstrates that power laws are scale-invariant; they maintain their functional form under scaling transformations.

Taking logarithms of both sides transforms the power law into linear equations on log-log scales, revealing straight-line relationships which simplify analysis.

Importance of Power Laws in Real Networks

Power laws are significant because many real-world networks exhibit degree distributions that follow this pattern. An example includes analyzing the worldwide web network's degree distribution.

In degree distributions plotted on log-log scales, points often align linearly, indicating adherence to power law behavior. The best-fit line represents this relationship effectively.

Most real networks have degree distributions characterized by coefficients ranging from 2 to 3 when fitted with power laws. This consistency across various networks highlights an essential property of complex systems.

Attempts to fit data points with alternative functions often yield poor fits compared to those derived from power laws. The green line represents an inadequate fit for comparison against the established best-fit line for degree distributions.

Network Terminology and Definitions

Key definitions include "incident" nodes (two nodes connected by an edge), "adjacent" nodes (nodes linked directly), and "work" within networks—defined as sequences of nodes and edges traversed without restrictions on repetition.

A walk through a network allows movement between nodes without constraints on repeating edges or nodes, emphasizing flexibility in navigating network structures.

The length of work refers to counting all edges traversed during this process, providing insight into connectivity within networks and how paths can be analyzed quantitatively.

Understanding Graph Theory Concepts

Definitions of Key Terms

Trail: A restricted walk in a graph where no edge is repeated. For example, moving from node 5 to 2 and then to 3, while not repeating any edges.

Path: Similar to a trail but does not allow for repeated nodes. An example path could be from node 5 to 2 to 3 to 4.

Cycle: A specific type of path where the first and last nodes are the same, with all other nodes being distinct. This represents a closed loop in the graph.

Measuring Distances in Graphs

Distance Between Nodes: Typically measured as the shortest path between two nodes. Multiple paths may exist, but only the shortest is considered for distance measurement.

Diameter of a Graph: The longest distance among all pairs of nodes within the network, indicating maximum expansion possible.

Average Path Length: Calculated by summing distances between all pairs of nodes and dividing by the total number of possible edges (n(n - 1)/2).

Understanding Density in Networks

Density (Edge Density): Defined as the ratio of actual edges present in a network compared to potential edges. It is calculated using e/(n(n - 1)/2).

Clustering Coefficient Insights

Local vs Global Clustering Coefficient

Clustering Coefficient Overview: Measures how clusters or communities form within networks; includes local and global coefficients.

Local Clustering Coefficient: Focuses on an individual node's neighbors. It assesses how connected these neighbors are amongst themselves.

Calculation Methodology

To calculate local clustering coefficient:

Identify neighbors of a node (ego).

Ignore edges connecting back to the ego.

The local clustering coefficient is determined by calculating edge density among these neighbors—actual connections divided by possible connections.

Implications of Clustering Coefficients

A high local clustering coefficient indicates that neighbors are well-connected, suggesting stability within that substructure.

Conversely, a low value suggests instability; if ties break easily among neighbors, it reflects weak connectivity within that social structure.

Understanding Clustering Coefficients in Networks

Local Clustering Coefficient

The local clustering coefficient is calculated as the ratio of edges between neighbors to the possible edges. For example, with four neighbors, it results in a coefficient of 1/6.

In another case, if there are two edges out of three possible connections, the local clustering coefficient would be 2/3.

The local clustering coefficient measures connectivity at the vertex or node level.

Global Clustering Coefficient

The global clustering coefficient assesses the entire network rather than individual nodes. It is defined by examining triplets within the graph.

A triplet consists of three nodes and can either be open (not fully connected) or closed (fully connected). The global clustering coefficient is determined by dividing the number of closed triplets by the total number of triplets.

Closed triplets indicate strong connectivity among nodes; for instance, a triangle formed by three interconnected nodes counts as three closed triplets.

Analyzing Triplet Structures

In a given graph example, two distinct closed triplets can be identified: one involving nodes 1, 2, and 3 and another involving nodes 2, 3, and 1. Each triangle contributes multiple instances to closed triplet counts.

Open triplets are also analyzed; for example, combinations like (2,1), (4), and (3,1), which do not form complete triangles but still represent potential connections.

Components in Networks

Components refer to disjoint parts of a network where no paths exist between them. Connected components have at least one path connecting any pair of nodes within them.

Real-world networks often contain many small components alongside a giant component that encompasses about 80% to 90% of all nodes.

Centrality Measures

Centrality evaluates how significant or impactful specific nodes or edges are within a network context. Different scenarios may require different centrality quantifications.

Four key aspects captured by centrality include prestige, prominence, importance (oddly noted), and power—each reflecting different dimensions of node significance.

Understanding Centrality Measures in Graph Theory

Degree Centrality

Degree centrality is defined as the number of followers a node has, indicating its importance within a network. More followers equate to higher degree centrality.

Closeness Centrality

Closeness centrality measures how close a node is to all other nodes in the graph, reflecting its ability to spread information quickly.

To calculate closeness centrality for a node v , one must measure the shortest path distance from v to all other nodes and sum these distances.

The normalization process involves taking the reciprocal of the summed distances, ensuring that higher values indicate greater closeness centrality. This is calculated as C_v = n(n - 1)/2/sum d_uv .

A higher closeness centrality value indicates that a node can disseminate information or influence others more effectively due to its proximity to them.

It’s important to note that there is no direct correlation between degree centrality and closeness centrality; they serve different purposes in analyzing networks.

Betweenness Centrality

Betweenness centrality quantifies how often a node appears on the shortest paths between pairs of other nodes, highlighting its role as an intermediary in the network.

For each pair of nodes, multiple shortest paths may exist. Betweenness measures how many of these paths pass through a specific node under consideration.

Understanding Betweenness Centrality

Definition and Calculation of Betweenness Centrality

Betweenness centrality measures how often a node appears in the shortest paths between pairs of other nodes in a graph.

The formula involves calculating the number of shortest paths between all pairs of nodes (x, y) and counting how many include the node v under inspection.

An example is provided where node 3's betweenness centrality is calculated by examining all pairs and their respective shortest paths.

Example Calculation

For nodes 1 and 2, there is only one shortest path that does not include node 3, resulting in a count of zero for this pair.

In contrast, for nodes 1 and 5, there are two shortest paths; one includes node 3, contributing to its betweenness centrality score.

Interpretation of Betweenness Centrality

A high betweenness centrality indicates that a node acts as an important bridge within the network, connecting different communities or clusters.

Nodes with high betweenness are termed articulation points; their removal can significantly disrupt network connectivity.

Importance in Network Analysis

Articulation points can be critical for information flow between communities; they may reveal sensitive connections if compromised.

Identifying these nodes aids in community detection; removing them can clarify cluster structures within networks.

Distinction from Closeness Centrality

Betweenness centrality differs from closeness centrality: the former focuses on participation in shortest paths among all pairs while the latter measures proximity to all other nodes from a specific node.

Closeness looks at direct distances from one node to others, whereas betweenness assesses overall influence across multiple connections.

Variants of Betweenness Centrality

Flow betweenness considers all possible paths rather than just the shortest ones but is computationally more expensive to calculate.

Understanding Graph Centrality Measures

Importance of Edge Betweenness in Clustering

Edge betweenness is crucial for graph clustering and segmentation, as it helps identify paths between clusters by minimizing conductance.

Conductance measures the number of paths between two communities; removing a critical edge can separate these clusters effectively.

Eigenvector Centrality Explained

Eigenvector centrality serves as a foundational measure for other centrality metrics, emphasizing connections to prominent nodes.

A node's importance increases if connected to highly influential nodes; for example, node E has many inward links but may not be as significant as node C due to its connections.

The Prestige Factor in Centrality

Node C's importance is higher than that of node E because it is linked to a prestigious node (e.g., Amitabh Bachchan), while E connects only with less significant nodes.

The concept illustrates how being referenced by reputable sources (like Stanford University’s website) enhances a node's visibility and impact.

Mathematical Representation of Eigenvector Centrality

Eigenvector centrality is calculated recursively based on the centralities of neighboring nodes, focusing on inward neighbors in directed graphs.

The formula involves summing the eigenvector centralities of all inward neighbors, adjusted through an adjacency matrix representation.

Transitioning to Katz Centrality

Understanding Centrality in Graphs

The Concept of Neighbors and Their Influence

The influence of neighbors diminishes as the distance increases; for example, a direct neighbor has a greater impact than a two-hop or three-hop neighbor.

When measuring centrality, immediate neighbors (like Bob, Aziz, Diego, Sri, and Priya) have a higher impact on node v compared to more distant nodes (like John and Cheney).

Attenuation Factor in Centrality Calculation

An attenuation factor (alpha), ranging from 0 to 1, is used to discount the influence of farther neighbors on node v.

As you consider further hops (k), the alpha value decreases; for instance, if k = 1 (one hop), alpha might be 0.5.

Calculating Centrality Using Adjacency Matrix

For one-hop neighbors, the formula involves summing contributions from connected nodes using an adjacency matrix where a_ij = 1 indicates a connection.

For second-hop neighbors, the contribution is calculated by squaring the adjacency matrix and applying the reduced alpha value.

Understanding Paths Through Matrix Powers

Squaring the adjacency matrix allows us to find connections of length two between nodes; cubing it finds connections of length three.

This method effectively counts possible paths between nodes at varying lengths based on their connectivity.

Implications of Alpha Values in Centrality Measures

As you move through different hops (k), values decrease exponentially due to multiplication with decreasing alpha values.

Understanding Page Rank and Its Origins

Introduction to Alpha Calculation

The discussion begins with the concept of alpha, initially set at 0.4, which is squared in subsequent calculations (0.4² = 0.16). This indicates a decreasing contribution in the context being discussed.

Overview of Page Rank

Page rank is introduced as a centrality measure, primarily known for its application in Google search algorithms. It has undergone various modifications since its inception.

Historical Context of Page Rank

The origins of page rank trace back to Larry Page and Sergey Brin's project at Stanford University between 1996 and 1998, under the guidance of advisor Hector Garcia-Molina.

The idea behind page rank was that web pages could be ranked based on link popularity; more links to a page would result in a higher ranking.

Development and Publication

In 1998, the foundational paper on page rank was published with contributions from Rajeev Motwani, leading to the initial prototype for Google's search engine.

The term "page rank" may refer both to Larry Page's surname and the concept of web pages themselves, though this remains somewhat ambiguous.

Patent History

The patent for page rank was assigned to Stanford University rather than Google because it was developed during students' time there. Google later acquired an exclusive license for $1.8 million.

In 2005, Stanford sold all shares related to the patent to Google for approximately $336 million.

Influence from Citation Analysis

Page rank draws significant influence from citation analysis; papers receiving numerous citations are considered impactful, especially if cited by prestigious authors.

Eugene Garfield's work in bibliometrics laid groundwork for understanding citation impact within academic literature during the 1950s.

Further Developments Post-Publication

Following page rank’s introduction in 1998, John Kleinberg published another influential algorithm called HITS (Hyperlink-Induced Topic Search), which shares similarities but differs in methodology.

Measuring Page Rank

To measure page rank for a node v_i , one must consider how it is influenced by other nodes pointing towards it within a directed graph representing web pages and hyperlinks.

The impact on v_i 's page rank depends on those nodes’ ranks that link to it; specifically, each contributing node’s influence is divided by its out-degree (the number of outgoing links).

Understanding PageRank and Its Components

The Calculation of PageRank

The PageRank of a node v_i is calculated based on the contributions from its neighbors, specifically nodes a , b , and c . Each neighbor's PageRank is weighted by their respective out-degrees.

For each inward neighbor v_t of v_i , the contribution to the PageRank is determined by dividing the PageRank of v_t by its out-degree.

Uniform Distribution in PageRank

A component of the calculation involves distributing a uniform value across all nodes, represented as 1/textmod V , where mod V denotes the total number of nodes in the graph.

This uniform distribution ensures that even pages with no incoming links receive a small amount of PageRank, preventing them from having a zero value.

Damping Factor in PageRank

The damping factor ( d ) balances two components: the contribution from hyperlinks and the uniform distribution. It is typically set to 0.85, which has been widely accepted in literature since its introduction.

The damping factor serves as a tunable parameter that can be adjusted for different applications but defaults to 0.85 due to historical precedent.

Browsing Behavior and Probability

There are two primary ways users access web pages: through hyperlinks or directly typing URLs into browsers. Both methods influence how PageRank is interpreted.

The probability of opening any particular webpage uniformly is given as 1/textmod V. In contrast, accessing via hyperlinks depends on the out-degree of neighboring nodes.

Comparison with HITS Algorithm

Introduced by Jon Kleinberg in 1998, HITS (Hyperlink-Induced Topic Search) differs from PageRank by considering both inbound and outbound links when measuring importance.

While PageRank focuses solely on inward connections (in-degree), HITS also accounts for outward connections (out-degree), highlighting their significance in citation networks like academic papers.

Understanding Hub and Authority in Social Network Analysis

Importance of Relevant Courses

The significance of having a centralized webpage listing relevant courses on social network analysis is emphasized, as it aids users in finding necessary information efficiently.

Page Rank and Node Types

PageRank considers both the in-degree (incoming links) and out-degree (outgoing links) of nodes, introducing two key concepts: hub nodes (high out-degree) and authority nodes (high in-degree).

Hardness and Authoritativeness

Each node should exhibit properties of both hardness (influenced by authority nodes pointing to it) and authoritativeness (influenced by hub nodes pointing to it), creating a dual matrix for quantification.

Recursive Formulation

The recursive relationship between hardness and authoritativeness is established, where the hardness of a node depends on the authoritativeness of its out-neighbors, while authoritativeness relies on the hardness of its in-neighbors.

Mathematical Representation

The mathematical formulation involves vectors representing hubness h and authoritativeness x, with adjacency matrix A. The relationships are expressed as h = Ax for hubness and x = A^T h for authoritativeness.

Eigenvector Equations in Network Analysis

Eigenvector Relationships

The equations derived indicate that hubness corresponds to the eigenvector of AA^T, while authoritativeness relates to the eigenvector of A^TA, establishing foundational principles for analyzing networks.

Hyperlink-Induced Topic Search

This method utilizes the aforementioned formulations to search web pages based on different topics, enhancing relevance through structured link analysis.

Exudative Mixing: Homophily vs. Heterophily

Conceptual Overview

Exudative mixing can be categorized into homophily (similar properties connecting nodes, e.g., shared college or interests) and heterophily (different properties connecting nodes, e.g., gender relations).

Measuring Homophily

Understanding Graph Properties and Metrics

Degree and Correlation

The discussion begins with the concept of edges in a graph, using an example sequence of nodes (a, b, c, d, e) to illustrate properties like node degree.

The speaker introduces the idea of measuring correlation between two sequences using Pearson correlation, specifically focusing on degree distribution.

Clustering coefficients are introduced as another property to measure; the goal is to determine if nodes with similar clustering coefficients are connected.

Homophily is defined: a tendency towards one indicates homophily, while -1 indicates heterophilia and 0 indicates non-associative behavior.

Transitivity and Reciprocity

Transitivity is explained as checking if friends of friends are also friends. This can be measured using local or global clustering coefficients.

Reciprocity is discussed next; it examines whether connections between nodes are mutual (e.g., if A follows B, does B follow A?).

The method for measuring reciprocity involves counting reciprocal pairs in a graph relative to total pairs.

Measuring Reciprocity

The adjacency matrix is used for calculating reciprocity; specific conditions must be met for edges to count as reciprocal.

To calculate reciprocity mathematically: if both a_ij and a_ji equal 1, they contribute to the count of reciprocal edges.

The maximum number of possible reciprocal pairs in a graph is discussed; it's noted that each pair counts as two edges.

Structural Equivalence

The square of the adjacency matrix provides insights into paths within the graph. Specifically, diagonal elements indicate paths returning to the same node.

Trace calculations from squared matrices yield information about reciprocal pairs present in the graph structure.

Neighborhood Similarity Measures

Structural equivalence focuses on shared neighbors between nodes; higher intersections indicate greater structural similarity.

Normalization techniques such as Jaccard coefficient help quantify structural equivalence by comparing common neighbors against total unique neighbors.

Cosine similarity offers another approach by multiplying neighbor counts rather than taking unions for normalization.

Degeneracy and Core-Periphery Analysis

Understanding Degeneracy

Degeneracy refers to the process of reducing a network by removing nodes, which can also involve moving nodes to different cores. This concept is closely related to core-periphery analysis or k-core decomposition.

K-Core Decomposition Process

The k-core decomposition begins with a network where nodes with degree zero are removed first. These nodes are assigned to core zero, indicating they do not connect to any other node.

After removing degree zero nodes, the next step involves removing nodes with degree one. This is an iterative process; as nodes are removed, new degree one nodes may emerge that also need removal.

The removal of degree one nodes continues until no more such nodes exist in the graph. The remaining structure will consist of higher-degree nodes.

Identifying Higher Cores

Once all degree one nodes are removed, the focus shifts to removing degree two nodes. This process continues iteratively until only higher-degree (at least three) nodes remain.

Nodes that survive this iterative removal process represent higher cores within the network. Each set of removed lower-degree nodes constitutes a distinct core level (e.g., core 1, core 2).

Coreness vs Degree

It’s important to note that there is no direct correlation between a node's coreness and its degree. A node can have a high degree but still be part of a lower core if it becomes isolated through removals.

For example, even if a node has the highest initial degree, it could drop to being part of core one after successive removals due to its connections being eliminated.

Implications in Network Analysis

Core values indicate how influential or central a node is within the network structure. High-core value entities often play critical roles despite potentially low visibility (e.g., black market services on social networks).

Core-periphery analysis serves as an effective method for identifying key players in complex networks by systematically eliminating lower-degree connections until only significant structures remain.

Additional Metrics Mentioned