State any 5 issues of Data Wrangling in python
State any 5 issues of Data Wrangling:-
Data wrangling, also known as data cleaning or data preprocessing, is a crucial step in data analysis. Here are five common issues that can arise during the data wrangling process in Python:
1.Missing Data: Missing data occurs when there are empty or null values in the dataset. Dealing with missing data involves strategies such as imputation (replacing missing values with estimated ones), deletion of rows or columns with missing data, or using advanced techniques like interpolation or machine learning algorithms to fill in the missing values.
2.Data Inconsistencies: Inconsistencies can arise when the same information is represented differently across the dataset. This can include inconsistent formatting, variations in spelling or capitalization, or different encoding schemes. Cleaning inconsistent data requires standardizing formats, applying text cleaning techniques, or using regular expressions to identify and correct inconsistencies.
3.Outliers: Outliers are data points that significantly deviate from the rest of the dataset. These can be caused by measurement errors, data entry mistakes, or genuine anomalies. Dealing with outliers involves identifying them using statistical methods or visualizations and deciding whether to remove them, transform them, or handle them separately during the analysis.
4.Data Duplicates: Duplicate data refers to multiple records or observations that are identical or nearly identical. Duplicates can introduce bias and affect the accuracy of analysis. Handling duplicates involves identifying and removing them based on specific criteria, such as matching values in key fields or considering a combination of attributes.
5.Data Formatting and Type Inconsistencies: Inconsistent data formats and types can hinder data analysis and modeling. This can include inconsistent date formats, numerical values stored as strings, or categorical variables represented differently. To address these issues, it is necessary to convert data to the correct format and ensure consistent data types across the dataset using functions and methods available in Python libraries like pandas.
These are just a few common issues encountered during data wrangling. Effective data wrangling practices involve a combination of Python programming skills, domain knowledge, and understanding of the dataset to handle these issues appropriately and prepare the data for further analysis.
OR(THE BELOW ANSWER IS DIRECTLY BASED ON DATA ANALYSIS WITH PANDAS BOOK)
Data wrangling is the process of preparing the data and getting it into a format that
can be used for analysis. The unfortunate reality of data is that it is often dirty,
meaning that it requires cleaning (preparation) before it can be used. The
following are some issues we may encounter with our data:
1.Human errors: Data is recorded (or even collected) incorrectly, such as
putting 100 instead of 1000, or typos. In addition, there may be multiple
versions of the same entry recorded, such as New York City, NYC, and
nyc
2.Computer error: Perhaps we weren't recording entries for a while (missing
data)
3.Unexpected values: Maybe whoever was recording the data decided to use
? for a missing value in a numeric column, so now all the entries in the
column will be treated as text instead of numeric values
4.Incomplete information: Think of a survey with optional questions; not
everyone will answer them, so we have missing data, but not due to
computer or human error
5.Resolution: The data may have been collected per second, while we need
hourly data for our analysis
6.Relevance of the fields: Often, data is collected or generated as a product
of some process rather than explicitly for our analysis. In order to get it to a
usable state, we will have to clean it up
7.Format of the data: The data may be recorded in a format that isn't
conducive to analysis, which will require that we reshape it
8.Misconfigurations in data-recording process: Data coming from sources
such as misconfigured trackers and/or webhooks may be missing fields or
passing them in the wrong order
Comments
Post a Comment