unveiling a world of insights and trends in the IT domain. The survey, rich with developers’ responses to a myriad of questions concerning their professional experiences, was transformed into a wellspring of valuable information. Utilizing the power of pandas for data cleaning and Tableau for visualization, the project aimed to distill the complex and extensive data into comprehensible and actionable insights. Every piece of data was meticulously cleaned, ensuring that the ensuing analysis and visualizations were both accurate and meaningful.
The analysis was particularly focused on the top five countries with the highest number of developers who participated in the survey. A dynamic and interactive Tableau dashboard served as the centerpiece of the project, where data was not just presented but brought to life. Users could click on a country, and the entire dashboard would morph to display data pertinent to that specific region. It offered a tailored experience, allowing for a deep dive into the professional landscape, challenges, and trends experienced by developers in different geographical locales.
Beyond the geographical insights, the dashboard was engineered to provide a granular view of the database details. A click would unveil a table rich with information about the developers, offering a microscopic view of the professional terrain of the IT world in 2022. In essence, this project stood as a bridge between raw data and refined insights, between global trends and individual experiences. It was more than an analysis; it was a narrative of the diverse and dynamic world of IT professionals, told through the lens of data and visualized with the finesse of Tableau.
import pandas as pd
import chardet
import re
import math
# Read the CSV file in binary mode and detect the encoding type
file_path = r'C:\Users\My PC\Desktop\future\tableau\stack over flow\survey_results_public 2922.csv'
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
# Read the excel file into a pandas DataFrame
df = pd.read_csv(file_path, encoding=result['encoding'])
# Drop the rows where the 'ConvertedCompYearly' column is null
df = df.dropna(subset=['ConvertedCompYearly'], axis=0)
# Filter the DataFrame to keep only the rows with 'MainBranch' value 'I am a developer by profession'
df = df[df['MainBranch'] == 'I am a developer by profession']
# Specify the columns to drop
columns_to_drop = ['CodingActivities', 'LearnCode', 'LearnCodeOnline', 'LearnCodeCoursesCert', 'YearsCodePro',
'OrgSize', 'PurchaseInfluence', 'BuyNewTool', 'Currency', 'CompTotal', 'CompFreq',
'LanguageWantToWorkWith', 'DatabaseWantToWorkWith', 'PlatformWantToWorkWith',
'WebframeWantToWorkWith', 'MiscTechWantToWorkWith', 'ToolsTechWantToWorkWith',
'NEWCollabToolsWantToWorkWith', 'OpSysPersonal use', 'VCHostingPersonal use',
'VCHostingProfessional use', 'OfficeStackAsyncWantToWorkWith', 'OfficeStackSyncWantToWorkWith',
'NEWSOSites', 'SOVisitFreq', 'SOVisitFreq', 'SOAccount', 'SOPartFreq', 'SOComm',
'Trans', 'Sexuality', 'Ethnicity', 'Accessibility', 'MentalHealth', 'TBranch', 'ICorPM',
'Knowledge_1', 'Knowledge_2', 'Knowledge_3', 'Knowledge_4', 'Knowledge_5', 'Knowledge_6',
'Knowledge_7', 'Frequency_1', 'Frequency_2', 'Frequency_3', 'TimeSearching', 'TimeAnswering',
'Onboarding', 'ProfessionalTech', 'TrueFalse_1', 'TrueFalse_2', 'TrueFalse_3',
'SurveyLength', 'SurveyEase']
# Drop the columns if they exist
for col in columns_to_drop:
try:
df = df.drop(col, axis=1)
except KeyError:
pass
# Replace special characters with apostrophes
df['EdLevel'] = df['EdLevel'].apply(lambda x: re.sub(r'[\u2018\u2019\u201A\u201B\u2032\u2035]+', "'", x) if isinstance(x, str) else x)
# Drop the "ResponseId" column if it exists
if 'ResponseId' in df.columns:
df = df.drop('ResponseId', axis=1)
# Reset the index to start from 0
df.reset_index(inplace=True)
# Rename the "index" column to "Id" and add 1 to make it start from 1
df.rename(columns={'index': 'Id'}, inplace=True)
df['Id'] = df.index + 1
# Save the updated DataFrame back to a CSV file
#df.to_csv(file_path, index=False)
# Replace non-"Man" and non-"Woman" values in the "Gender" column with "Prefer not to say"
df.loc[~df['Gender'].isin(['Man', 'Woman']), 'Gender'] = 'Prefer not to say'
df.rename(columns={'ConvertedCompYearly': 'Salary'}, inplace=True)
df.rename(columns={'LanguageHaveWorkedWith': 'language'}, inplace=True)
# Remove rows with outlier numbers in the "Salary" column
q1 = df['Salary'].quantile(0.25)
q3 = df['Salary'].quantile(0.75)
iqr = q3 - q1
upper_threshold = q3 + 1.5 * iqr
df = df.loc[df['Salary'] <= upper_threshold]
# Save the updated DataFrame back to a CSV file
#df.to_csv(file_path, index=False)
# Split the language column by ';'
df['language'] = df['language'].str.split(';')
# Explode the language column to have one row per language
df = df.explode('language')
# Assign a unique id to each language
df['language_id'] = df.groupby('language').ngroup()
# Remove the index
#df.reset_index(drop=True, inplace=True)
# Split the DatabaseHaveWorkedWith column by ';'
df['DatabaseHaveWorkedWith'] = df['DatabaseHaveWorkedWith'].str.split(';')
# Explode the DatabaseHaveWorkedWith column to have one row per DatabaseHaveWorkedWith
df = df.explode('DatabaseHaveWorkedWith')
# Assign a unique id to each DatabaseHaveWorkedWith
df['DatabaseHaveWorkedWith_id'] = df.groupby('DatabaseHaveWorkedWith').ngroup()
# Split the OfficeStackAsyncHaveWorkedWith column by ';'
df['OfficeStackAsyncHaveWorkedWith'] = df['OfficeStackAsyncHaveWorkedWith'].str.split(';')
# Explode the DatabaseHaveWorkedWith column to have one row per DatabaseHaveWorkedWith
df = df.explode('OfficeStackAsyncHaveWorkedWith')
# Assign a unique id to each DatabaseHaveWorkedWith
df['OfficeStackAsyncHaveWorkedWith_id'] = df.groupby('OfficeStackAsyncHaveWorkedWith').ngroup()
# Split the OfficeStackSyncHaveWorkedWith column by ';'
df['OfficeStackSyncHaveWorkedWith'] = df['OfficeStackSyncHaveWorkedWith'].str.split(';')
# Explode the OfficeStackSyncHaveWorkedWith column to have one row per OfficeStackSyncHaveWorkedWith
df = df.explode('OfficeStackSyncHaveWorkedWith')
# Assign a unique id to each OfficeStackSyncHaveWorkedWith
df['OfficeStackSyncHaveWorkedWith_id'] = df.groupby('OfficeStackSyncHaveWorkedWith').ngroup()
# Split the PlatformHaveWorkedWith column by ';'
#df['PlatformHaveWorkedWith'] = df['PlatformHaveWorkedWith'].str.split(';')
# Explode the PlatformHaveWorkedWith column to have one row per PlatformHaveWorkedWith
#df = df.explode('PlatformHaveWorkedWith')
# Assign a unique id to each PlatformHaveWorkedWith
#df['PlatformHaveWorkedWith_id'] = df.groupby('PlatformHaveWorkedWith').ngroup()
#df.head()
# Split the values in the DevType column and replace null values with a default value
df['DevType'] = df['DevType'].str.split(';').str[0].fillna('Unknown')
# Convert the Salary column to numeric type and replace NaN values with 0
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce').fillna(0)
df['WorkExp'] = pd.to_numeric(df['WorkExp'], errors='coerce').fillna(0)
def map_yearscode_to_int(yearscode_str):
if yearscode_str is None:
return None
elif yearscode_str == 'Less than 1 year':
return 0
elif yearscode_str == 'More than 50 years':
return 55
else:
return yearscode_str
df['YearsCode'] = df['YearsCode'].apply(map_yearscode_to_int)
df['YearsCode'] = df['YearsCode'].astype(float)
# Define a function to extract the age from an age range string
def extract_age_from_range(age_str):
if not isinstance(age_str, str):
return None
elif age_str in ['Prefer not to say', None]:
return None
elif age_str == '65 years or older':
return 65
elif age_str == 'Under 18 years old':
return 17
else:
age_range = age_str.split('-')
age_min = int(age_range[0])
age_max = int(age_range[1].split()[0])
return (age_min + age_max) / 2
# Apply the function to the "Age" column to extract the age from each age range
df['Exact age'] = df['Exact age'].apply(lambda x: math.floor(x) if not math.isnan(x) else None)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3628 try: -> 3629 return self._engine.get_loc(casted_key) 3630 except KeyError as err: C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'Exact age' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_28988\3637966066.py in <module> 1 # Apply the function to the "Age" column to extract the age from each age range ----> 2 df['Exact age'] = df['Exact age'].apply(lambda x: math.floor(x) if not math.isnan(x) else None) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 3503 if self.columns.nlevels > 1: 3504 return self._getitem_multilevel(key) -> 3505 indexer = self.columns.get_loc(key) 3506 if is_integer(indexer): 3507 indexer = [indexer] C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3629 return self._engine.get_loc(casted_key) 3630 except KeyError as err: -> 3631 raise KeyError(key) from err 3632 except TypeError: 3633 # If we have a listlike key, _check_indexing_error will raise KeyError: 'Exact age'
df['WorkExp'] = pd.to_numeric(df['Exact age'], errors='coerce').fillna(0)
unique_dev_types = df['Exact age'].unique()
unique_dev_types
df.head()
#fuck israel
df = df[df['Country'] != 'Israel']
df = df.drop(['MainBranch', 'Blockchain'], axis=1)
# Save the updated DataFrame back to a CSV file
df.to_csv(file_path, index=False)