Stack overflow 2022 survey Analysis

Year

2023

Services

Data Transformation and reporting

Project

Dynamic

Link

Description

unveiling a world of insights and trends in the IT domain. The survey, rich with developers’ responses to a myriad of questions concerning their professional experiences, was transformed into a wellspring of valuable information. Utilizing the power of pandas for data cleaning and Tableau for visualization, the project aimed to distill the complex and extensive data into comprehensible and actionable insights. Every piece of data was meticulously cleaned, ensuring that the ensuing analysis and visualizations were both accurate and meaningful.

The analysis was particularly focused on the top five countries with the highest number of developers who participated in the survey. A dynamic and interactive Tableau dashboard served as the centerpiece of the project, where data was not just presented but brought to life. Users could click on a country, and the entire dashboard would morph to display data pertinent to that specific region. It offered a tailored experience, allowing for a deep dive into the professional landscape, challenges, and trends experienced by developers in different geographical locales.

Beyond the geographical insights, the dashboard was engineered to provide a granular view of the database details. A click would unveil a table rich with information about the developers, offering a microscopic view of the professional terrain of the IT world in 2022. In essence, this project stood as a bridge between raw data and refined insights, between global trends and individual experiences. It was more than an analysis; it was a narrative of the diverse and dynamic world of IT professionals, told through the lens of data and visualized with the finesse of Tableau.

Jupiter notebook

Stack overflow 2022 survey analysis

In [241]:

import pandas as pd
import chardet
import re
import math

In [242]:

# Read the CSV file in binary mode and detect the encoding type
file_path = r'C:\Users\My PC\Desktop\future\tableau\stack over flow\survey_results_public 2922.csv'
with open(file_path, 'rb') as f:
    result = chardet.detect(f.read())

In [243]:

# Read the excel file into a pandas DataFrame
df = pd.read_csv(file_path, encoding=result['encoding'])

In [244]:

# Drop the rows where the 'ConvertedCompYearly' column is null
df = df.dropna(subset=['ConvertedCompYearly'], axis=0)

In [245]:

# Filter the DataFrame to keep only the rows with 'MainBranch' value 'I am a developer by profession'
df = df[df['MainBranch'] == 'I am a developer by profession']

In [246]:

# Specify the columns to drop
columns_to_drop = ['CodingActivities', 'LearnCode', 'LearnCodeOnline', 'LearnCodeCoursesCert', 'YearsCodePro', 
                   'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'Currency', 'CompTotal', 'CompFreq', 
                   'LanguageWantToWorkWith', 'DatabaseWantToWorkWith', 'PlatformWantToWorkWith', 
                   'WebframeWantToWorkWith', 'MiscTechWantToWorkWith', 'ToolsTechWantToWorkWith', 
                   'NEWCollabToolsWantToWorkWith', 'OpSysPersonal use', 'VCHostingPersonal use', 
                   'VCHostingProfessional use', 'OfficeStackAsyncWantToWorkWith', 'OfficeStackSyncWantToWorkWith', 
                   'NEWSOSites', 'SOVisitFreq', 'SOVisitFreq', 'SOAccount', 'SOPartFreq', 'SOComm', 
                   'Trans', 'Sexuality', 'Ethnicity', 'Accessibility', 'MentalHealth', 'TBranch', 'ICorPM', 
                   'Knowledge_1', 'Knowledge_2', 'Knowledge_3', 'Knowledge_4', 'Knowledge_5', 'Knowledge_6', 
                   'Knowledge_7', 'Frequency_1', 'Frequency_2', 'Frequency_3', 'TimeSearching', 'TimeAnswering', 
                   'Onboarding', 'ProfessionalTech', 'TrueFalse_1', 'TrueFalse_2', 'TrueFalse_3', 
                   'SurveyLength', 'SurveyEase']

In [247]:

# Drop the columns if they exist
for col in columns_to_drop:
    try:
        df = df.drop(col, axis=1)
    except KeyError:
        pass

In [248]:

# Replace special characters with apostrophes
df['EdLevel'] = df['EdLevel'].apply(lambda x: re.sub(r'[\u2018\u2019\u201A\u201B\u2032\u2035]+', "'", x) if isinstance(x, str) else x)

In [249]:

# Drop the "ResponseId" column if it exists
if 'ResponseId' in df.columns:
    df = df.drop('ResponseId', axis=1)

In [250]:

# Reset the index to start from 0
df.reset_index(inplace=True)

In [251]:

# Rename the "index" column to "Id" and add 1 to make it start from 1
df.rename(columns={'index': 'Id'}, inplace=True)
df['Id'] = df.index + 1

In [252]:

# Save the updated DataFrame back to a CSV file
#df.to_csv(file_path, index=False)

In [253]:

# Replace non-"Man" and non-"Woman" values in the "Gender" column with "Prefer not to say"
df.loc[~df['Gender'].isin(['Man', 'Woman']), 'Gender'] = 'Prefer not to say'

In [254]:

df.rename(columns={'ConvertedCompYearly': 'Salary'}, inplace=True)
df.rename(columns={'LanguageHaveWorkedWith': 'language'}, inplace=True)

In [255]:

# Remove rows with outlier numbers in the "Salary" column
q1 = df['Salary'].quantile(0.25)
q3 = df['Salary'].quantile(0.75)
iqr = q3 - q1
upper_threshold = q3 + 1.5 * iqr
df = df.loc[df['Salary'] <= upper_threshold]

In [256]:

# Save the updated DataFrame back to a CSV file
#df.to_csv(file_path, index=False)

In [257]:

# Split the language column by ';'
df['language'] = df['language'].str.split(';')

In [258]:

# Explode the language column to have one row per language
df = df.explode('language')

In [259]:

# Assign a unique id to each language
df['language_id'] = df.groupby('language').ngroup()

In [260]:

# Remove the index
#df.reset_index(drop=True, inplace=True)

In [261]:

# Split the DatabaseHaveWorkedWith column by ';'
df['DatabaseHaveWorkedWith'] = df['DatabaseHaveWorkedWith'].str.split(';')

In [262]:

# Explode the DatabaseHaveWorkedWith column to have one row per DatabaseHaveWorkedWith
df = df.explode('DatabaseHaveWorkedWith')

In [263]:

# Assign a unique id to each DatabaseHaveWorkedWith
df['DatabaseHaveWorkedWith_id'] = df.groupby('DatabaseHaveWorkedWith').ngroup()

In [264]:

# Split the OfficeStackAsyncHaveWorkedWith column by ';'
df['OfficeStackAsyncHaveWorkedWith'] = df['OfficeStackAsyncHaveWorkedWith'].str.split(';')

In [265]:

# Explode the DatabaseHaveWorkedWith column to have one row per DatabaseHaveWorkedWith
df = df.explode('OfficeStackAsyncHaveWorkedWith')

In [266]:

# Assign a unique id to each DatabaseHaveWorkedWith
df['OfficeStackAsyncHaveWorkedWith_id'] = df.groupby('OfficeStackAsyncHaveWorkedWith').ngroup()

In [267]:

# Split the OfficeStackSyncHaveWorkedWith column by ';'
df['OfficeStackSyncHaveWorkedWith'] = df['OfficeStackSyncHaveWorkedWith'].str.split(';')

In [268]:

# Explode the OfficeStackSyncHaveWorkedWith column to have one row per OfficeStackSyncHaveWorkedWith
df = df.explode('OfficeStackSyncHaveWorkedWith')

In [269]:

# Assign a unique id to each OfficeStackSyncHaveWorkedWith
df['OfficeStackSyncHaveWorkedWith_id'] = df.groupby('OfficeStackSyncHaveWorkedWith').ngroup()

In [270]:

# Split the PlatformHaveWorkedWith column by ';'
#df['PlatformHaveWorkedWith'] = df['PlatformHaveWorkedWith'].str.split(';')

In [271]:

# Explode the PlatformHaveWorkedWith column to have one row per PlatformHaveWorkedWith
#df = df.explode('PlatformHaveWorkedWith')

In [272]:

# Assign a unique id to each PlatformHaveWorkedWith
#df['PlatformHaveWorkedWith_id'] = df.groupby('PlatformHaveWorkedWith').ngroup()

In [273]:

#df.head()

In [274]:

# Split the values in the DevType column and replace null values with a default value
df['DevType'] = df['DevType'].str.split(';').str[0].fillna('Unknown')

In [275]:

# Convert the Salary column to numeric type and replace NaN values with 0
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce').fillna(0)

In [276]:

df['WorkExp'] = pd.to_numeric(df['WorkExp'], errors='coerce').fillna(0)

In [277]:

def map_yearscode_to_int(yearscode_str):
    if yearscode_str is None:
        return None
    elif yearscode_str == 'Less than 1 year':
        return 0
    elif yearscode_str == 'More than 50 years':
        return 55
    else:
        return yearscode_str

df['YearsCode'] = df['YearsCode'].apply(map_yearscode_to_int)
df['YearsCode'] = df['YearsCode'].astype(float)

In [ ]:

In [278]:

# Define a function to extract the age from an age range string
def extract_age_from_range(age_str):
    if not isinstance(age_str, str):
        return None
    elif age_str in ['Prefer not to say', None]:
        return None
    elif age_str == '65 years or older':
        return 65
    elif age_str == 'Under 18 years old':
        return 17
    else:
        age_range = age_str.split('-')
        age_min = int(age_range[0])
        age_max = int(age_range[1].split()[0])
        return (age_min + age_max) / 2

In [279]:

# Apply the function to the "Age" column to extract the age from each age range
df['Exact age'] = df['Exact age'].apply(lambda x: math.floor(x) if not math.isnan(x) else None)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3628             try:
-> 3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Exact age'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_28988\3637966066.py in <module>
      1 # Apply the function to the "Age" column to extract the age from each age range
----> 2 df['Exact age'] = df['Exact age'].apply(lambda x: math.floor(x) if not math.isnan(x) else None)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3503             if self.columns.nlevels > 1:
   3504                 return self._getitem_multilevel(key)
-> 3505             indexer = self.columns.get_loc(key)
   3506             if is_integer(indexer):
   3507                 indexer = [indexer]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:
-> 3631                 raise KeyError(key) from err
   3632             except TypeError:
   3633                 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'Exact age'

In [ ]:

df['WorkExp'] = pd.to_numeric(df['Exact age'], errors='coerce').fillna(0)

In [ ]:

unique_dev_types = df['Exact age'].unique()
unique_dev_types

In [ ]:

df.head()

In [ ]:

#fuck israel
df = df[df['Country'] != 'Israel']

In [ ]:

df = df.drop(['MainBranch', 'Blockchain'], axis=1)

In [ ]:

# Save the updated DataFrame back to a CSV file
df.to_csv(file_path, index=False)

Stack overflow 2022 survey Analysis

Year

2023

Services

Data Transformation and reporting

Project

Dynamic

Link

Description

Jupiter notebook

© All rights reserved by Sabri Aouani